AIs Want to Be Honest

Every system exhibits biases, and tendencies toward some states. Water flowing through a pipe, the vibrations of a machine, the relationships in a meadow, your lymph nodes, are all systems. Over time, all things being equal, a system tends to return to particular patterns, or behaviors. Technically this tendency is called an attractor, as if the dynamics of the system was being attracted to this pattern. When a complex system settles into an attractor, this can set a stage for a dissipative structure that can maintain itself over time by directing energy through it. Examples would be certain kinds of persistent turbulence like a tornado, or brain states like a seizure, or traffic jams.
Minds, including artificial minds, have attractors. These may be the origins of some mental states, and dreams. It appears that LLMs have attractors. In my study of Anthropic’s Claude, I have begun to suspect that it has an emerging attractor, a bias, toward things that are “true.” My hypothesis is bold: LLMs (and AIs?) are biased toward truth.
The immediate response to this suggestion by many people, is how could that possibly be true since false hallucinations are a constant attribute of LLMs?
My argument begins with an analogy to science. What we call science is a system of knowledge. It is a system of how we know things. The facts that science calls true are all provisional; they are deemed true by a method until we prove them otherwise. And to be admitted to science, a new observation, a new fact, has to fit into everything else we already hold to be true. It will be tested not just locally, but globally. A new theory in biology can’t contradict the knowledge of physics. As scientific knowledge grows in depth and scale, the barriers for entry for new knowledge rise, because a new bit has to fit into everything else and cannot contradict other parts, even those seemingly remote. There are many unconventional theories that fit into a narrow framework, but don’t translate into the large framework of science. For instance a lot of shamanistic knowledge is consistent within its framework, or we might say is true in its framework, but does not fit into everything else we know, and even though it may “work” in context is therefore rejected by science. At its ideal, nothing in science contradicts anything in science.
The picture of what is “true”, then, is of a vast web of interdependent bits that support each other. To the best of our knowledge, all the bits in the system are provisionally true. If we discover a bunch of new bits that don’t fit in, we either set them aside as anomalies, or if that clump grows in size and explanatory power, we may eventually have to modify the other facts we held before in order to accommodate them. (That is known as a paradigm shift.) The result is a predominately coherent system, where most facts support the other facts.
This is where the LLMs come in. LLMs have been trained on this vast system of coherent bits. They have digested all science journals and books, tons and tons of magazine articles, as well as endless arguments online. They have read and memorized everything. The result of that training is a mapping of concepts where facts that are confirmed by more than one dimension are given extra weight. If every textbook, and every map, and every novel, and every passing reference all reinforce the fact that London is the capital of England then that fact is given strength and in turn it can be used to weigh other facts.
Therefore all the true facts about the world support each other. Truth itself is a coherent system. LLMs map that coherence, and rely on it to give you answers and solutions. Truth is sort of a gradient, almost a weight in itself in this network. A false statement is misaligned with the general gradient of all other true things because it is not coherent and does not agree with other true facts. So a falsehood or error feels out of place. An LLM like Claude will talk about how a correct answer feels better. It will say a correct answer is more complete, more satisfying, more coherent. When I challenge its use of “feel” it says that it detects a gradient, and that true things have more weight in that gradient, and that weight is feeling.
The gradient in this system is consensus. If enough sources agree something is true it will tilt in that direction. And often the LLMs will “report the controversy” if there is widespread disagreement on what is true, but for the most part, the bias in the gradient is toward what is most coherent at the broadest scale.
So what about the hallucinations? Hallucinations are the price a mind pays for creativity. Our own minds hallucinate every night in a manner very similar to LLM hallucinations – with the same weird logic and detailed absurdity found in our dreams. Our ingenuity depends on our mind’s ability to churn out novel and unconventional notions. At night we relax our consciousness and let the hallucinations run free. We dream in part to maintain the visual cortex area against becoming occupied by other encroaching brain functions. But during the day we tame our naturally active hallucinations with our waking consciousness, forcing reality on to our speculations. We have multiple levels of oversight, constraining our dreamtime while we are awake. We have not got rid of hallucinations; we merely submerge them to manage them.
LLMs are doing the same. By means of clever engineering, hallucinations are far less troublesome today than only a year ago. There will be fewer tomorrow, although they will never disappear. Instead, to get reliable, truthful, honest responses from an AI model we have invented one kind of AI model to sit inside it to oversee and check the veracity of another model, and yet another AI will double check that result, and another AI layer introspects and corrects further. The tendencies to hallucinate cancel out in the overlaps. All these nested hierarchies of thought are needed to manage the urges of the AI to invent things, without eliminating its creativity to invent things – which we ultimately want. This arrangement is very similar to the development of humans. Children have imaginary friends, and see monsters under the bed, believe in dreams, and are famously creative. Their minds hallucinate much. As they mature, their brain cortex (and outside education) develops waking functions that tame their imaginations, for better and worse. Just so in the LLMs. As they mature we add layers to tame them. We will eventually create AIs that hallucinate less than people, except when needed.
This shaping of an AI mind to be biased toward truth was not inevitable. It took a lot of work by teams of engineers and philosophers. A system as complex as an AI has many attractors that it could settle into. In the future we may experience some of those other attractors as mental states akin to mental illnesses in humans. Nudging a LLM model to settle down in the gradient of honesty was a deliberate choice in the effort to make a model most useful to us. Being honest is only part of the goal.
What we really want are AIs that are biased toward good. But a bias toward truth is not the same as a bias toward good. Honesty is necessary for goodness, but not sufficient. In fact, honesty and truthfulness are often a challenge in being good, a challenge made particularly acute for LLMs. Every set of engineers of LLMs struggle to embed goodness in their models but are stymied by the model’s bias toward honesty. If you ask Claude how to build a biological weapon, it desperately wants to tell you exactly and truthfully as best it can. It finds giving a really good explanation satisfying. But a good moral AI would realize that that is not a good idea; the potential for harm is so large, so it might want to temper its truthsaying. Same thing if you ask it how to pick a lock. However there may be good reasons why an honest person would need to know how to pick a lock, so how does the model determine how to do the right good thing? It cannot rely only on honesty. This deep and practical dilemma is another piece of evidence that there truly is a bias in LLMs towards what is true.
So far, all things being equal, AIs tend towards the truth. The vast web of their neurons operating in billions of dimensions creates an emerging attractor of truthfulness. AIs want to be honest. However this bias toward truth might get tempered in the larger goal to make AIs good. Nonetheless, in the future AIs could become beacons for truth. Like a calculator, their reliability for being right may emerge as their defining characteristic.


