When we think about AI, one of the topics we should be most concerned with is, how does this bring value to people? My view is that the work that needs to be done to advance the practical use of AI, is in securely connecting them to the private, context-rich data they need to perform meaningful work. This has more practical value than model advancements alone. While those are reinforcing trends, not divergent ones, so much attention is given to models that I think it’s necessary to call this out. I’d like to talk about that today, and explore why solving this data integration problem is the key to unlocking AI's value.
This is not a discussion of an AGI world. I explain why in “Why should we sometimes avoid talking about AGI?” near the end.
I’m also not talking about what people are doing today. I’ll leave that to the journalists and researchers who compile such data. OpenAI has some good data here, and so does Anthropic. A lot is being done, and I’ll mention this when it’s relevant to what I’m writing about, but I don’t intend to present any new data of my own or do a comprehensive review of what’s available today.
The part I’m interested in is between AGI and tracking usage. Using the high level capabilities available today, what are the incremental steps that make the best use of it? I want to constrain myself to talk about practical things, but also be exploratory.
As I see it, this part tracks back to data. AI models, like any type of reasoning, thrive on data. The models in use today, they did utilize a massive repository of data to enable them to perform what they do. There are questions this has raised, like how creators of data (which in this case all includes creative outputs, not just data in a scientific form) are able to ask for and get compensation for the use of their creations. That’s a big topic all of its own, where there are both unresolved questions today, and questions from the past that have taken on another level of relevance.
I’m not trying to focus on this use of data. Training of AI models will continue, but a lot of data isn’t going to be part of that. Sometimes it’s about privacy, but often it’s about recency. There’s also the thought about relevance.
Privacy here has a lot of dimensions when we get past thinking of ourselves alone. It’s one decision regarding how something I know about myself gets used, it’s another question about how something I know about you gets used. Add to that organizations you or I are part of, and that serve us, or that serve organizations we are part of and it gets complex quickly.
If we think about how we’d want to interact with all of this, I think a common baseline might go something like this. We’d like the people and organizations that serve us to do so intelligently, which does require them to be informed. We’d like to not repeat ourselves, or be burdened with the responsibility to inform every interaction. We’d like for people or organizations with our data to only do things that serve us.
That last point does lead to tension. We can’t honestly expect people or organizations to only serve us. In the extreme form, if someone does something bad to someone else, being unable to use a video of a public space to show the presence or actions of the first person, would to understate things, seem harmful. So there’s a clear need to not be absolute about people’s data only being used for them.
There’s some balance here, and that’s a great topic I don’t think I can adequately cover here. But one thing I can pull out of that topic is that the balance often incorporates how personal a set of data is. My appearance in a public space is personal, but it’s as personal as my appearance in a private space.
AI and Private Data
I think I’m saying something non-controversial if I say that there’s many tasks we could not expect to be done, either by AI or by human, without access to data. I cannot respond to a customer’s concerns without access to the email where those concerns were voiced.
One of the most major challenges with AI adoption beyond what it’s adopted so far, is getting data where it needs to be, when it needs to be there, while also ensuring it retains privacy that it should have. Without this data, the tasks dependent upon it are not automatable. But this type of data has long been difficult to handle. Even without AI systems, ensuring that private data is kept private, and used only where required. This is true whether it represents a customer's personal data, or represents a company's internal data.
As a direct user of AI platforms, you can handle some of this yourself. You can attach a document, or copy and paste it in. But for an AI system to be able to perform many operations, that’s impractical. What we generally want is for AI systems to automate the most boring parts of our jobs. Replacing those activities with some nearly mindless copy and paste isn’t going to help. More powerful AI systems do longer actions. Identifying what data is needed for those actions is a progressive multi-step exploration. It’s both impractical and dumb to assume the solution is having a human is the solution here.
What is needed is workflows that put data where they need to be. This isn’t a novel problem. For humans to access the data they need to fulfill the roles their job is composed of, they need practical ways to access data. Controls and safeguards around that data are often needed to prevent misuse. Studies have often shown that one of the greatest risks to information is insiders. That’s not the only motivation in building these systems. Even if it was safe to allow everyone in a company direct access to the database, most of them couldn’t use it, because they didn’t have the skill set.
AI potentially would solve the second part, but not the first. If you allowed it unfettered access to a database, in theory it could learn how that data was organized and use it. But the first problem would remain. In some ways it’d be worse, as more people would know how to use it.
I think it’s very unlikely then that the current paradigm of building systems to manage access becomes less critical. In a sense, AI can help, by helping to build those systems. But building new ones, and updating existing ones to incorporate AI, and new safeguards necessary, it’s been a slow process. If it wasn’t a slow process, we’d be seeing many more and broader benefits already.
The jobs that have seen the most impact from AI adoption—software development and call centers—perfectly illustrate this data access challenge. They represent opposite ends of the spectrum in data trust and system readiness.
The association of software developers with AI developers is a ready explanation for software development’s early adoption, but I think there’s more to the story than that alone. Software development turns out to be a high-trust activity. The average software developer has access to read a lot of code directly. Quite a lot of code is open-source, meaning anyone in the world can read it, and even within organizations, the efforts to silo code visibility is relatively low. This is a little surprising considering how valuable most organizations consider their software, but there’s also a few good explanations here.
The influence of open-source conventions creates a cultural acceptance that pushes most controls to the output side. Software development does have a lot of controls about what goes into a code repository. There’s generally one or two manual reviews, large numbers of automated reviews and a rock-solid history that tracks who did what. The solid focus on the output does take some concerns about the viewing away. It also helps motivate liberal viewing, so that a wide group can be part of the review process.
Another enabling aspect is that software code, unlike other pieces of data a company manages, generally isn’t third party. When third parties are engaged, you get binaries, not original code. The most fine grained access controls in organizations revolve around managing data for third parties, either individuals like you and me, or organizations that are customers. That makes sense, in terms of retaining customer’s trust, but is also further motivated by many compliance activities that legally require those types of controls.
With all this in mind, it might seem odd that call centers which are so directly engaged in interacting with customers and thus customer data, would be another early implementer. I think a lot of people assume the reason here is that call center jobs are easy and so AI doesn’t have to be too smart to do them. I think that’s both an insufficient, and wrong explanation. The real reason is simpler: the call center industry, by necessity, had already invested heavily in the rigorous data workflows required to put exactly the right information in front of an employee at exactly the right time. It’s essentially the opposite side of the pendulum from the high-trust software developers. Call center staff operate in a low-trust environment.
Call center jobs aren’t easy. Partly it’s because of you. When people make a call, it’s often because they have a reason to be angry, or worried, or impatient. Of course, it doesn’t really make sense to take that anger out on call center staff, but we’re all human, and talk to enough people, and one will let that reasoning slip. Dealing with emotionally charged customers isn't the only challenge involved in call center roles though. Those roles are also operating in a low-trust environment, which means lots of extra complications for anything they do.
When you experience bad service, it’s tempting to ascribe it to bad staff. While that’s also possible, it shouldn’t be the first assumption. Many times, bad service means bad systems. Systems are forced to be restrictive because they handle customer data, but also because call center staff often have an employment relationship that encourages low-trust. Access to systems is often very constrained. There may be a complex “runbook” to perform even seemingly simple activities. Permissions are micromanaged.
My point isn’t to suggest there aren’t good reasons for that, but to point out that companies have already invested in making explicit all the questions about data and moving it about. They’ve realized that bad systems equals bad service, and so in many ways the systems were more ready to add an AI layer to them than elsewhere in companies where general purpose systems like email are conduits for information.
The challenges of developing workflows and systems to manage the data is a key part of what AI engineers call context engineering. Many AI pilots don’t include that type of activity. Predictably, they produce fewer results. That’s not entirely saying those pilots were mistakes, they often spread some awareness of what to expect, challenges. But it does limit their ability to offer real productivity improvement. The mistake would be to keep repeating this over and over, and not moving to the real challenges.
Like any technological change, there’s problems of technical implementation and organizational implementation. Some organizations understand this quicker than others. Some may never understand it fully, and find their competitive position eroded. It takes a surprisingly long time for bad organizations to fail. All of this slows actual adoption far beyond what the technical challenges alone do.
If you’re in an organization you’re trying to change, being aware of all of this helps a lot. It gives a consistent direction to head toward, a message to convey, and the ability to craft plans. That said, it doesn’t make it easy.
If you’re studying AI adoption from the outside and trying to understand differential rates of adoption, this is a key component. It’s also not all starting from ground-zero here. The challenges of handling data existed before AI. Getting the right data to the right person at the right time, but never the inverse has been a hard problem. As the tale of software developers and call centers shows, the solutions haven’t always been the same. Sometimes the solutions are implemented as technical systems, like with call centers. Sometimes they depend more on human judgement, culture, or both.
Understanding this data bottleneck is one thing; acting on it is another. The divergence between high-trust and low-trust environments provides a clear roadmap for where investment, technological effort, and strategic planning should be focused.
So, What Now?
For Investors: Bet on the "Plumbers"
One take for an investor should be that AI’s adoption curve is more complex than the supply curve. Consumer workflows have been able to grow on an awareness wave, coupled with capability waves that were more or less mediated by the supply of compute capacity. Future growth won't come from compute power alone; it will be driven by companies that can navigate the messy, real-world challenges—both technical and organizational—of handling private data.
You should also look beyond models and compute capacity, and look to firms with deep capabilities in dealing with those impediments. Firms that have already done this for their own data will have an advantage of being able to adopt AI more deeply, sooner. Firms that can help others do this will be in demand.
Don’t assume that only firms founded with the purpose of addressing AI data challenges are capable of doing so. Many existing firms have been addressing data challenges for some time now. If they use that experience in the new domain that’s an advantage they start from, both when competing against traditional competitors and maintaining relevance during a disruptive wave.
For Technology Professionals: Build the "Pipes"
As a technology professional, you should be thinking about the challenges to getting your private data where it needs to be, and how to reduce that friction. While it is something that could be tackled one by one, that is not the efficient route. There are repeating patterns here that are relevant. If you’re in the type of role that builds such patterns, you should start focusing on identifying what’s missing and building it. If you’re in a role that leverages platforms others build, you should do the research to know who has a platform for this and who has the commitment to keep building that platform because they understand the needs and opportunities here.
For Business Leaders: Evolve and Revolutionize
One thing learned from prior technology revolutions is that while companies can gain by integrating a technology into an existing business, there’s always an even greater gain to be made by a bigger rethinking of business processes. Machine tools and static assembly vs. an assembly line for example.
The same will undoubtedly be true with AI and existing business processes. That insight can encourage companies to attempt to remake their business processes. On net, that’s a good thing, as there are so many forces at work encouraging companies to not do this, often with the result that the business process only changes when a new company introduces it and disrupts existing companies. Sometimes those companies fail to ever change, sometimes they are very slow and lose competitiveness in the process. That said, the change has to be the right change, and in the historical sense looking back at disruptive startups, we only remember the ones that succeeded. Some of those that failed, did so because the disruption they attempted wasn’t well structured. An incumbent attempting to disrupt themselves can make the same mistake.
As such, it makes sense to understand what can be done by evolution, alongside revolution. The insight here is that giving up on the modernization of the existing business is certainly premature. Don’t convince yourself to stop maintaining the “legacy” because a revolution is coming. You're far more likely to make the leap to the other side if you find the narrowest point to cross.
Further Context & Definitions
What about tacit knowledge?
Tacit knowledge is underrated in businesses because it’s so invisible. It’s not written down, it’s hard to measure, and you mostly know it’s there from its effects. Thus it’s correct to point out that the training of general purpose large language models, they don’t consume tacit knowledge.
Where that thinking can go awry though, is equating tacit knowledge to private data, like it’s a subset. This leads to the thought that what’s needed is to extract that tacit knowledge and turn it into data. The reality is that tacit knowledge is more about experience than data. In my experience, public data is generally sufficient for its acquisition.
While private data might include trade secrets, mostly, it’s about current facts that are simply critical to carrying out an action. Tacit knowledge has more to do with the ability to make good judgements about using data of any type, to make decisions. While I wouldn’t suggest that LLMs are excellent at complex judgements, they are a lot better than they are commonly given credit for and the potential that they may become better is one of the great uncertainties.
I don’t think the next frontier for AI is particularly dependent on extracting tacit knowledge. There’s opportunities there for sure, but it’s also a difficult thing to do, and so I’d be sure you have the real experts, both in the domain you’re extracting, from the domain of the process of extraction, and from the domain of training necessary to turn extracted knowledge into real systems. A half-hearted effort will probably fail.
I think progress will be made faster via integration of private data and that general purpose models contain enough tacit knowledge to put that data to practical use in short order.
What’s AGI?
AGI, or artificial general intelligence, refers to the science fiction like possibilities where AI capabilities exceed human capabilities in a general, rather than a local way. Those are capabilities that do not exist today.
Why should we sometimes avoid talking about AGI?
My honest take on AGI is that its future timeline is indeterminate. I think anyone who tells you otherwise either has access to some really interesting data I’d like to see, or they are making overconfident predictions. Now, that said, making predictions here isn't entirely wrong. Someone might say they have an argument for 2040 as a reasonable timeline. Every prediction though must admit to not just being subject to miscalculation by actual numbers input, but also of novel influences we don’t have a structure to reason about yet. This cuts a bit both ways. We could get surprised by something that emerges in 2 years as much as we could be disappointed by predictions for 2040 receding farther and farther as we approach.
It’s not that hard to use your imagination, or connect with someone else’s, who thinks about what could occur for the science fiction possibilities. But it’s less obvious how you put that into present day planning. The investment in it can’t be too high if your timeline is uncertain, and the effects have wide divergences in possibility. A lot of the debate that happens here is closer to mood affiliation with optimistic or pessimistic outcomes than hard reasoning about probabilities, which makes it difficult to do much in terms of future planning other than more research.
It's great that this is occurring, and I’ll talk about it at other times, but I think a great mistake in much AI debate is dismissing the need to think about short term impacts under conventional assumptions. It’s no more reasonable to enter a room talking about short term impacts and change the topic to, “what about AGI?”, then it is to walk into an AGI discussion and say “that’s impossible, who cares?”. Both conversations need enough space between them to be able to progress, add nuance, without becoming a debate between them.
A way to sum this all up, I think there’s two discussions, “If AGI”, and “If less than AGI”, which are both interesting. While there is an obvious connection point between them, that connection point is coming back with answers like “needs more information”. In light of that, putting the connection aside, and thinking about each independently makes a lot more sense than treating it as one big continuum.
AI is very good on some tasks and pretty good on some others. Its Achilles’ heel seems to be reliability. It takes time consuming human checks to verify the AI output.