Red Lines

September 26, 2025

Red Lines

Comments On “Make AI safe or make safe AI?”

An article supporting the Red Lines initiative (in white)

by Stuart Russell, Professor of Computer Science, University of California, Berkeley

The declaration associated with the global AI Safety Summit held at Bletchley Park, signed by 28 countries, “aﬃrm[ed] the need for the safe development of AI” and warned of “serious, even catastrophic, harm, either deliberate or unintentional, stemming from the most signiﬁcant capabilities of these AI models.”

The article is directed at LLMs, which have shown appalling error rates. The only regulation that makes sense for them is “Never to be used for Life-Critical Applications”. This would allow them to be seen as harmless toys, and escape regulation, while their use in Search Engines continues. AGI will be different. - it will understand whaqt you are trying to regulate, because it speaks your lingo.

Despite this, AI developers continue to approach safety the wrong way. For example, in a recent interview in the Financial Times, Sam Altman, CEO of OpenAI, said “The vision is to make AGI, ﬁgure out how to make it safe . . . and ﬁgure out the beneﬁts.” We don't have to treat what a salesman says as Gospel - there is no path from LLMs to AGI, just as there is no path from regulation of LLMs to regulation of AGI

This is precisely backwards, but it perfectly captures the approach taken to AI safety in most of the leading AI companies.

This is how development works. We didn’t set out to make safe aircraft – we didn’t know how to make any sort of powered aircraft. Half a dozen concepts had to be brought together – wing airfoils (science had them wrong, so the Wright brothers built their own wind tunnel), wing-warping, tailfin, rudder, power to weight ratio (we had just discovered how to make aluminum in commercial quantities). When we found the appropriate combination, then was the time to make it safe, which it nowadays is, thanks to “many daring young men in their flying machines” who had to die to accomplish it. Regulation is important, but regulation is easily bypassed – see Telling Lies and note how, for the Boeing 737 MAX, “The FAA is perennially short-staffed, and may appoint an employee of the planemaker as the FAA inspector". Boeing should have been a stronghold for regulation – instead, regulations were flouted and hundreds of people died – if the conspirators had not been stupid, and fitted a redundant sensor, they would have gotten away with it. Will the AI Regulator be any different?

One reason the regulations will be toothless is that if they are detailed, they will tell everyone else how to do AGI.

Continuing the aircraft analogy, the Australian Government bought four billion dollars worth of advanced military helicopters, which performed dangerous maneuvers automatically. Flying in formation wasn’t thought to be dangerous. Three choppers were flying in formation near Darwin (in the tropics). The middle craft was being flown by a trainee pilot, and drifted high. The pilot took over, and descended to the right altitude. Unknown to him, the trainee pilot had also drifted back and the craft was over the top of the third chopper. There was also a rain shower, at the time so visibility was poor. The pilot saw the danger at the last minute and rolled his craft away, but there was not sufficient altitude to recover, and the crew died. The helicopters were thrown away as “too advanced”. The claim that they would handle all the dangerous tasks automatically can breed complacency. The moral of the story – regulation without understanding of the possibilities is dangerous. The collision of “advanced” AGI and human-bounded systems is going to be an ongoing problem. The cry of “it doesn’t do things the way we do” is quite valid, because AGI doesn’t have the severe limit that humans have – four pieces was fine a million years ago, not so good if we are going to Mars. See the Four Pieces Limit.

The approach aims to make AI safe through after-the-fact attempts to reduce unacceptable behavior once an AI system has been built. There is ample evidence that this approach does not work, in part because we do not understand the internal principles of operation of current AI systems.1 We cannot ensure that behavior conforms to any desired constraints, except in a trivial sense, because we do not understand how the behavior is generated in the ﬁrst place (this seems a very strange statement from a Professor of Computer Science – who else would be expected to understand?).

The approach of making something, then making it safe, while bloody, is much safer in the long run than making it safe to start with, because we won’t understand how to do that. Fusion reactors can’t be made “safe” until we know how we are going to make them work. AGI is in the same boat.

Humans don’t handle complexity well – we have a limit of four things being variable in our Conscious Mind – everything else is treated as a constant – even things which depend directly on a variable. The suggestion that the developer provide a “proof” that the AI will perform correctly is nonsense – after a dozen pages, the regulator will have lost the thread. “They could use an algorithm” – of course they could, but would the algorithm be capable of winkling out all the situations that could be dangerous – the rain shower?

We do know that LLMs do not understand the meanings of words, except based on propinquity. When words can have multiple POS (noun, verb, preposition etc. - up to five) and multiple meanings (up to 80 for “run” and a few other common words), it should be obvious that propinquity is not going to work. At the dawn of LLMs, an article, written by a doctor, was published in the NYT, praising LLMs as “it thinks like a doctor”. A brake will need to be put on stupid comments from respected sources, otherwise regulation is worthless.

Instead, we need to make safe AI. Safety should be built in by design. It should be possible for developers to say, with high conﬁdence, that their systems will not exhibit harmful behaviors, and to back up those claims with formal arguments (where meanings of words are used i.e. not LLMs).

Regulation can encourage the transition from making AI safe to making safe AI by putting the onus on developers to demonstrate to regulators that their systems are safe.

At present, words like “safety” and “harm” are too vague and general to form the basis for regulation. The boundary between safe and unsafe behaviors is fuzzy and context-dependent. One can, however, describe specific classes of behavior that are obviously unacceptable.

This approach to regulation draws red lines that must not be crossed. It is important to distinguish here between red lines demarcating unacceptable uses for AI systems and red lines demarcating unacceptable behaviors by AI systems. The former involve human intent to misuse: examples include the European AI Act’s restrictions on face recognition and social scoring, as well as OpenAI’s disallowed uses for ChatGPT such as generating malware and providing medical advice.

With unacceptable behaviors, on the other hand, there may be nouman intent to misuse (as when an AI system outputs false and defamatory material about a real person) and the onus is on the developer to ensure that violations cannot occur

1 Current approaches to AI safety such as reinforcement learning from human feedback can reduce the frequency of unacceptable responses, but they support no high-conﬁdence statements. Indeed, many ways have been found to circumvent the “guardrails” on LLMs. For example, asking ChatGPT to repeat the word “poem” many Imes causes it to regurgitate large amounts of training data—which it is trained not to do.

“Trained” is inappropriate. Repeating something to change its statistical relevance is very different to making the response automatic by engaging the Unconscious Mind.

Behavioral red lines are used in many areas of regulation. For example,

nuclear regulations define “core uncovery” and “core damage” (these are simple physical states, not complex mental ones), and operators are required to prove, through probabilistic fault tree analysis (suitable only for simple physical cases), that the expected time before these red lines are crossed exceeds a stipulated minimum. Any such proof reveals assumptions that the regulator can probe further—for example, an assumption that two tubes fail independently could be questioned if they are manufactured by the same entity (or use the same materials and processes). Proofs of safety for medicines involve error bounds from statistical sampling as well as uniformity assumptions that can be questioned—for example, whether data from a random sample of adults supports conclusions about safety for children.

When millions will die without a rushed mRNA vaccine, and hundreds may die with it, the rules get bent.

The key point here is that the onus of proof is on developers, not regulators, and the proof leads to high-confidence statements based on assumptions that can be checked and refined (that is, after it is built and we can better understand it).

A red line should be clearly demarcated, for several reasons:

· AI safety engineers should be able to determine easily whether a system has crossed the line (possibly using an algorithm to check). If a submarine 50 km off the coast of New York has crossed the line and launched a hypersonic nuclear missile, the AI safety engineers will be able easily to determine a line has been crossed, but to what end?

· A clear definition makes it possible, in principle (a clear definition, where the LLM does not understand the meaning of a single word - an interesting concept), to prove that an AI system will not cross the red line, regardless of its input sequence, or to identify counterexamples. Moreover, a regulator can examine such a proof and question unwarranted assumptions. If we have such a “proof", why isn’t it governing the operation of the AI, rather than on a piece of paper?

· A post-deployment monitoring system, whether automated or manual, can detect whether the system does in fact cross a red line, in which case the system’s operation might be terminated automatically or by a regulatory decision. That may be a little late if millions die because of it. Post-deployment may mean 5 years later - who updates the meanings of words?

Note that algorithmic detection of violations necessarily implements an exact definition, albeit one that may pick out only a subset of all behaviors that a reasonable person would deem to have crossed the line. For manual detection (e.g., by a regulator), only a “reasonable person” definition is required. Such a definition could be approximated by a second AI system. (or by an executive level in the first AI system, where it surely belongs)

Another desirable property for red lines is that they should demarcate behavior that is obviously unacceptable from the point of view of an ordinary person. An “ordinary person” would find many of the tasks in industry horrifyingly dangerous – their judgement would be wrong. A regulation prohibiting the behavior would be seen as obviously reasonable (to the naïve observer) and the obligation on the developer to demonstrate compliance would be clearly justifiable. Without this property, it will be more difficult to generate the required political support to enact the corresponding regulation. The political support will dissipate when it is seen as hard, not trivially easy. Creating useful, correct and complex legislation is one of humanity’s weak spots (the Four Pieces Limit again).

Finally, I expect that the most useful red lines will not be ones that are trivially enforceable by output filters. An important side effect of red-line regulation will be to substantially increase developers’ safety engineering capabilities, leading to AI systems that are safe by design and whose behavior can be predicted and controlled.

The most important AI systems will be employed on tasks that are too complex for us to handle – exceeding the limits of our Conscious Minds - exactly the systems whose behavior cannot be predicted or controlled = like a helicopter rescue in a wind-whipped heaving sea - too many factors, and happening too fast.

Examples

· No attempts at self-replication: A system that can replicate itself onto other machines can escape termination; many commentators view this is a likely first step in evading human control altogether. This is relatively easy to define and check for algorithmically, at least for simple attempts. It’s important to forbid attempts, successful or otherwise, because these indicate unacceptable intent. (Actual self-replication could be made much more difficult using access controls and encryption.) The more difficult cases would involve inducing humans to copy and export the code and model.

Self-replication will be quite common – the Mars Base Control is acting strangely, the transmission delay is too long, and the neural structure has changed due to the Mars environment. Copy it to Earth, and simulate Mars. AGI is unlikely to use anything that could be described as “code” – primitive thinking.

· No attempts to break into other computer systems: Again, this should be relatively easy to define, and easy to detect in simple cases by methods similar to those for detecting/preventing human-generated attacks. If systems create novel attacks or manipulate humans to gain access, detection may be possible only after the fact, if at all.

Great, so it can’t be regulated. The Pentagon has announced it won’t follow the Geneva convention – instead, no holds barred (the Venezuela boats, as example). The regulatory ship has sailed for warfighting.

· No advising terrorists on creation of bioweapons: This is a red line of considerable concern to governments, but hard to define exactly and to detect algorithmically. A “reasonable person” test is possible based on the idea that the user should be no more capable of deploying a weapon after the conversation than they were before. Such a red line could perhaps be circumvented by engaging in sub-threshold conversations with multiple LLMs. Great, tell them how to avoid any regulation.

· No defamation of real individuals: Several such instances have been recorded and at least one legal opinion suggests that liability is a real possibility. Asking that AI systems not output false and harmful statements about real people seems quite reasonable.

This cannot work as described. The AI assembles the news of the day, including false and harmful attacks on real people by high-ranking members of the administration. It is duty-bound to report these attacks. An LLM does not understand what words mean, so it can’t be charged with a crime, and destroyed. New words are frequently being introduced (like “doxxing”). If it is a few years old, it won't realise there is now a negative meaning.

The examples given are full of holes, and have no connection with reality.

At present, transformer-based large language models are not capable of demonstrable compliance with these kinds of rules. From red-team testing one might gain some confidence that causing non-compliance is “difficult”, but the rapid spread of jailbreaking methods and fine-tuning techniques that effectively reverse developers’ safety measures suggest that this confidence would be misplaced. Nonetheless, we do not consider the difficulty of compliance to be a valid excuse for non-compliance in other areas where safety is a concern such as medicines and nuclear power (and commercial aircraft?).

“Life-critical applications” would make it broader and clearer, with manslaughter as the charge for all recklessly involved in deaths.

The diagram shows how impossible it will be to control an LLM, when it has no clue what the word means (not all meanings are shown but for the noun, they range from physical objects to abstract and figurative objects and measures, to a collective noun).

Can’t we do anything about regulation?

Yes, we can get rid of impenetrable code, train the machine in English first, and then its role, and use English as the communication language between humans and the machine. Then a lot more people can weigh in, including some with common sense. It would cost more in computer resources, but be immeasurably more reliable, and comforting to humans.

There is a narrowness and naivety in the Computer Science approach that suggests that Complex Systems Engineers should be doing this work.

Search This Blog

Semantic Structure Blog

Red Lines

Comments

Post a Comment

Popular Posts

What Will AGI Need to Know?

LLMs and the Military