The Mirage of Deep Research
Many AI tools now offer a Deep Research feature, which pulls information from numerous resources on the internet and synthesizes them into a single report. The results are impressive, on par with the type of work a professional researcher might spend weeks to create. Deep Research returns these with results in less than an hour.
I think these are impressive features, with a lot of utility, but I’d like to outline a case that it’s easy to expect outcomes that are a mirage. This happens when expecting them to replace human-created equivalents. This is because we fundamentally mistake the purpose of these human-created reports. Their primary, unspoken job was never just to inform, but to act as a signal of the author's competence. Because that purpose has long been left unspoken, an honest discussion about this doesn’t come naturally. This creates a mirage of productivity, where it looks like Deep Research can solve the problem it looks like human-created equivalents are solving. But in reality, other harder problems still need to be solved.
A Bit More About Deep Research
The best way to understand the capabilities of these tools is to use them, and then engage with the output. But since the best versions have required subscriptions, and the reports are the equivalent of 25-50 pages, I should assume that’s not something everyone has done. It’s also helpful to understand a bit of how these tools work, to understand where their strengths and weaknesses will most naturally arise.
An initial question is, how much do these cost? To an end-user, the cost is typically going to be a subscription cost. Details here are changing often, but as of today a $20 subscription is going to allow for at least 1 report per day, so let’s estimate $1 report. Subscription prices might be lower than the compute time would cost directly, but even if we assume $5 for a single report, that’d be at best 15 minutes of a researcher's time.
There are other ways to break the costs down; token usage, or compute time, for example. When you estimate in those ways, $1 is probably high. Its a bit more complex to explain those estimates, and $5 is good enough for this article, so I’ll save you from the added verbosity.
There’s a healthy debate to be had on quality, comparing the top quality of a researcher to Deep Research outputs, but it’s still clear that if a researcher set their quality objective to be identical to Deep Research, they couldn’t shorten their time to match. It’s also clear that at least some of what researchers do today is of lesser quality, as an output.
In general what Deep Research features do first, is build a research plan. Using your initial prompt, they pose to themselves that same question, with a template of a plan. The models used with Deep Research features have been trained to transform this plan into action. This starts to get into an area that may be proprietary or optimized, but the most basic version of this is to reread its own plan, and turn sub-questions into sources, iterating through to create a list of sources that it will download.
Sources become part of a short term memory. This both allows for incorporating information that was not part of the model’s initial training, but also brings more attention to the sources, allowing a depth of recall greater than without this short term memory. One aspect of this recall will be the ability to cite these sources.
Normally when a model generates content, it doesn’t have strong associations it can backtrace to what caused it to generate that content. The content that influenced it was observed during training. The training process creates stronger associations between the tokens it creates for that content and all of the content it consumed before. While it’s possible it could associate the source, that’s not generally an emphasis of training. It becomes more practical to maintain these associations with the short term memory.
Deep Research features work iteratively, after gathering a set of resources, asking itself, what’s the next step, after incorporating a set of resources, and its prior planning into its short term memory. In each step it generates new plans, new sub-questions, and then searches for sources related to those questions, downloads them and incorporates them into the short-term memory
At some point it will try to generate the requested report, using the sources. It might do this more than once as it will also analyze its own output, and iterate if it assesses that more iteration is needed and still within any time or token budgets it has set for itself.
The above description, it’s general. Current or future iterations from Claude, Google, OpenAI, etc. may improve or have already improved on specific steps, but you can think of this as the baseline from which improvements are made.
In the end, you’ll have a 25+ page report, with multiple sections, probably multiple tables, citations attached to paragraphs, sections or table rows. Those citations will trace back to as many as hundreds of sources.
Like anything generative AI, the quality depends on the quality of the request. The way you frame your question and the specificity of the topic matter to report quality.
It’s also improving rapidly. If you tried this 6 months ago, the output for the same input today would be noticeably better. Some “strategies” are less important than before, raising the minimum quality while the maximum also increases. The results today are impressive, and will likely keep improving.
One note about citations, which foreshadows what I want to talk about next. Reports do not mostly copy content from citations in. While it is possible quotes will be included, the report is written by the model. A model can still generate content that is not grounded in the source it cites. It is less likely than if the Deep Research process wasn’t used, but it’s still possible. A human researcher can read a paper and mischaracterize the results and still include a citation; so can Deep Research.
The value of citations is to allow you to personally check the sources, but let's be honest, you, and other users, won’t always have that kind of time. You're much more likely to look at a citation for a statement you disagree with, or question, than the one that confirms your assumptions, understanding or thoughts.
There’s also many different levels to which you might engage with a source. You might just check it’s there. You might check the name of the source, using your own opinion of a source's reputation. Even if you open the source, you might skim. And finally, do you check the source’s sources, or demand multiple confirming sources?
How Human-Created Research Reports are Used Today
I predict that some use will start with expectations that results in some disappointments. The context is a naive view that these reports’ purpose is as an educational tool for readers. That’s usually not the case.
Most are not read thoroughly by their primary audience. Most are skimmed, because readers aren’t looking to learn from them but are looking to gauge the creators’ competence about the topic. A quick skim of a large document, with a bit of critical examination for flaws large or small, gives the author many chances to fail. The audience doesn’t approach these documents with an open eye focused on learning.
That’s not to say these reports are just busywork. They provide value by making authors think through the topic. They also can serve as conversation starters, though it’s usually the author starting the conversation, using the material to signal their expertise.
In this context, it’s easy to see how documents created by Deep Research can fail to accomplish those unspoken goals. Since AI produced the report, it has essentially become the author of the work. The human author has not reflected upon the material in the same way. It also doesn’t have the same signal quality for competence of the human author, who is being screened for some further engagement. Maybe they are pitching an implementation plan for a new product, marketing strategy, or process improvement. Maybe they are trying to land a consulting engagement. The human author has essentially cheated on their test.
That’s not saying the original signal quality was perfect either. While it’s impossible to create a high quality report without competence, the audience may not actually be able to tell between a high quality report and a mediocre one. The audience may often judge a report on more trivial features like grammar, style and presentation.
Proper grammar, style and presentation are useful when you’re presenting material. Doing them correctly lowers the difficulty of comprehension. When communicating at scale, there’s good justification in spending the time to get them right. But as a signal of competence, they have some significant limitations.
For one, it’s rather easy, and common, to be competent, or even an expert, at the topic of substance, but less skilled at grammar, style and presentation. They also take a lot of time, so there’s a reasonable risk of capturing the signal from preferences; is it more important to the author to make a good impression, or more important to understand the topic of substance?
If readers don’t have the background, interest, or time to evaluate the content in regards to the actual topic of substance, they may consistently assign more competence to authors who have excellent presentation skills vs. the one with real competence. In addition, presentation skills can be used to hide a lack of competence in the topic of substance. You can’t hide from someone who reads deeply enough, but if you know that no one will, you can overwhelm them with well polished fluff, such as beautifully designed charts that obscure meaningless data or lengthy appendices that look impressive but contain little substance.
I’ve been on every side of this. I’ve created lengthy, polished documents, only to realize after that they weren’t consumed. I got the credit for creating a great document. But I also realized after that it was mostly seen from a distance far greater than would justify the effort if the goal was educational. Instead, the effort, size and quality served another purpose.
I’ve also seen credit given to slick, good looking material, that ultimately provided no value and which a deeper interaction with the authors failed to convince me they were worth of said credit.
Finding Value Beyond the Mirage
It might sound like I’m suggesting Deep Research features won’t be useful, or even harmful. But the above is really only the start of the story. Reports like those Deep Research can create can be valuable as educational tools. They sometimes were in the past, and if the capability is used appropriately, they will be in the future, more often.
There’s also a sense in which the old competence measuring system wasn’t that good in the first place. It often forced someone who was already skilled and competent to jump through hoops to gain any recognition. Wide scale cheating a bad system can be an improvement compared to suffering its consequences. But if it's that bad, it’s better to fix the system, or stop using it if it can’t be fixed. That’s a question we should have asked already, but it’s more relevant when Deep Research can be used to avoid the test.
There will still need to be a method to evaluate competence; luckily we have others. Getting to know the person being evaluated, engaging with them on the topic is one example. Another is evaluating a person's recent outcomes and the processes that produced them. What sets apart these methods is that they take more effort from the evaluator.
When choosing an evaluation method, in theory there’s multiple factors that would be balanced; the effort from the evaluator, effort from the evaluated, resistance to “gaming”, resilience to chance, resistance to bias, and fringe benefits such as learning through the process.
In a perfect world, these factors would always be balanced. But a consistent risk with competence assessments is that if you put the decision of how to evaluate in the evaluators hands, they will have a bias toward minimizing their own effort and risk, at the expense of others. Without a strong motivation to use everyone's time efficiently and pursue the best outcome, evaluators are apt to choose sub-optimally. The method of asking for a deep report and then merely skimming it is a prime example—it favors the evaluator's desire for a low-effort, defensible process.
Disrupting this balance is not necessarily bad. Restructuring evaluations to be more honest about their intent and having an audience that’s more invested in building their own competence, are positive developments. Yes, there’s a cost to the evaluators, but those costs were over indexed for in the past. There is the risk of falling back into the same bias toward evaluator effort and defensibility, so keeping a careful eye toward this risk as new patterns emerge has the promise of finding a better set of conventions.
It’s also worth mentioning that the education value we hoped to gain, it’s still there. It does depend on how people interact with these generated research documents. Organizations that used in an honest way before, where the audience did deeply engage with them, provided real feedback, and learned from them, they’ll still benefit, with a bit more efficiency.
Authors can still develop the competence, and the Deep Research output can be a tool toward that. It requires more effort than just reading the generated reports, but the reports can still accelerate that process. As long as the time saved is used to dive deeper into the topic, the result is a better roadmap to competence.
The risk comes from when the stated goal and actual goal aren’t in alignment. Abandoning those fictions is necessary, but once done, it should result in a more honest and effective outcome.
The Default Path vs. The Deliberate Choice
One of the deeper thoughts here is about the interface between us and technology. Technology can be disruptive, taking a stable equilibrium and putting it into flux. When this happens, and we’re not entirely satisfied with the result, it’s tempting to assume everything was fine before. But we should question that. The old way of doing things might have been stable, but this stability could have been a 'local optimum'—the best solution only within a narrow context. It's also possible that it was optimized for criteria that we might find questionable in retrospect.
Technology is ultimately a tool. The possibilities it creates reshape social conditions. If we are not deliberate, there is a “default” change, that is the one possibility with the least inherent resistance. If we do think about them and make deliberate decisions, we can choose from a wider set of possibilities rather than accepting the one “default” outcome that happens if we aren’t deliberate.
The default outcome of introducing Deep Research is that where a document was used as a test, that test will be bypassed. A deliberate choice to re-evaluate if that time was well spent, if the test was accurate, and if the test was even needed. If it is needed, find a better way. If it wasn't, take the outputs from Deep Research as a starting point to engage more deeply.
In my next post I plan on discussing the “technologies of trust”. Technologies of the physical kind are well recognized. That is only part of the technology space though, and I’d argue a set that allows for more complex social interactions is overlooked in what possibilities we have open to us. This builds on the discussion around trust, in Money: More Than Just Stuff, It's Trust, Money, Trust, and Loans, and Trust, Money, and Companies
Money: More Than Just Stuff, It's Trust
We all understand trust and how essential it is for people to live and work together. Civilization itself is built on it. Money, while newer in human history, is just as woven into the fabric of our lives.
Money, Trust, and Loans
In a previous discussion, we established a framework for understanding money: it's a form of trust that has been depersonalized and made exchangeable. This transformation allows society to operate with a greater total amount of trust than would be possible through personal relationships alone, thereby fostering greater potential for prosperity.
Trust, Money, and Companies
While companies are ubiquitous, and most people have a general idea of what they are, a fully-formed understanding of the concept of a company is less prevalent. At the base level, it’s just an entity doing business. The extra nuance comes from all the structure needed to make that practical. They have to be mostly flexible, but in other ways predictabl…