2 Comments
User's avatar
Duncan Foley's avatar

AI is very good on some tasks and pretty good on some others. Its Achilles’ heel seems to be reliability. It takes time consuming human checks to verify the AI output.

Expand full comment
Ryan Baker's avatar

It's good advice to check the outputs, but mistakes aren't unique to AI tools. If our own work is important, we should self-check it and have others check it. If we have others do work for us or on our behalf, we should check that too.

They do have somewhat different patterns of manifestation of these patterns, which does require somewhat different mechanisms for deciding when to trust and when to validate. I'm of the mind though that some of our mechanisms of trust have some sizable issues of their own. The degree that irrelevant aspects affect our assignment of trust to someone we don't know, such as appearance, height, gender, race, vocal tone, and word choice are fairly flawed.

Relatedly, I think the last one, word choice, plays a significant role in why your default chatbot sounds so confident all the time, even when it shouldn't. We've trained writers to use that style of writing, and avoid anything that generates the perception of a lack of confidence. Many typical editing styles would suggest deleting anything you can't be confident over vs. say "I suspect", or "I think". Another example is how a single central number is preferred to a range.

That advice, it's given because from a practical point of view you can expect a wide set of your audience, even parts of the most educated segment, to disregard and downrate.

Another aspect to remember is that AI tools have relatively good capabilities to spot their own mistakes, or human mistakes. Your typical chatbot, it doesn't do this. It produces a result and provides it. Even reasoning models, they formulate a plan, rather than directly second guess themselves. But if you want a workflow that feeds the output of one model into another instance of the same model, or to an entirely different model, that is possible, and can reduce error.

There is active research and work involved in having models use the equivalent of "I don't know" more frequently. Some of this is simple tuning, to weight against the overconfident style of presentation, though some of it also involves accurately and efficiently arriving at confidence levels that can be utilized in combination with appropriate tuning.

An real-world, non-technical obstacle to that implementation is users and their ability to pursue their preferences, however misguided, in a market. How do we explain the less than random chance performance of TV pundits in making predictions? If many different models are made available and users vote with their feet for the sycophantic, overconfident bombast, that's going to hurt the implementation of the measured, confidence bar using, sometimes uncertain version, at least for the models used for chatbots.

That final problem recedes a little when you talk about workflows for processing data, since the decisions there may be made with a bit harder data and a bit more direct objectives.. but there's still some bleed over for any part of those decisions that's made on vibes.

Expand full comment