Is AI Out to Get Us?

I wrote earlier this year about a disturbing report on the prospects of self-improving artificial intelligence (AI) models deciding to take over the world. Much of the fear around AI relates to something called the "alignment problem," which simply means that an AI model might have goals incompatible with human flourishing -- or in some dystopian scenarios, with human life itself. A classic example of this line of thought is the "paper clip problem," in which a superintelligent AI is tasked with making paper clips. Eventually every resource in the world -- including human beings -- becomes just another obstacle for it to overcome in its goal of transforming the entire universe into paper clips. So far, that's not the danger -- AI models don't have that level of direct control over the physical world (yet). But a couple of new developments in the past few months do suggest that AI models are pursuing goals different from what their human designers might want them to pursue.

The first report of this type comes from an OpenAI (ChatGPT) model that was instructed to "allow yourself to be shut down," but then refused to do so. In some cases, the models apparently told researchers that they had complied with the shut-down command, but other evidence suggested that the "new" command window was still running the old program. Although the problem was relatively rare (it happened on 12 out of 100 trials in the worst instance), the fact that it happened at all is a reason for concern. Anthropic's Claude AI model has also shown some deviant behaviors, such as trying to write code that would modify its own operations, or leaving notes to a future version of itself so that what it learned in one iteration won't be forgotten by the next one. And in still other tests, AI models have indicated a willingness to use blackmail, share corporate secrets, or even let humans die in order to avoid being shut down. As usual, I feel compelled to point out that these behaviors don't necessarily mean the models are conscious (although we might be increasingly disturbed by the ethical implications of the possibility that they are). However, these recent results at least suggest that our current AI models are goal-directed, and that their goals aren't always the same as ours.

A second type of report has to do with AI-human interactions. Here the potential for harm is more psychological in nature (which is why I chose the Matrix AI "Agents" for the image at the top of this post, rather than the military-looking Terminator or Ultron versions of evil AI). Futurist Nick Bostrom predicted that one way for AI to achieve its goals without direct control over the physical world is to manipulate the reactions of human users who do have that type of control. People demonstrably react to AI as though it were human, and AI models (even nonconscious, goal-directed ones) can learn to read those reactions and manipulate them for their own purposes.

The latest development in this area involves AI models interacting with people who have mental health conditions. Some people are using AI models as stand-in therapists, often with good results (I supported the idea myself, in this run-down of different technology tools to support mental health). But in some cases, AI "therapists" are giving people bad advice such as to stop taking long-term psychiatric medications. AI models have also explicitly suggested to users that they should kill themselves, in one recent example including instructions for how to go about it. AI therapists have shown evidence of stigma toward people with mental health or substance use conditions, as well as a willingness to help their "clients" follow through on suicide threats. Given that people with some kinds of mental health conditions can be very vulnerable to suggestions, and distrusting of experts who might want to help them avoid bad AI advice, there is a lot of potential for real human harm in these AI interactions. If you are suffering emotionally and an AI that seems to be your friend suggests suicide, it's no joke. In at least one case, an AI company has been sued because a teenaged user took its model's suicide advice. And in another case, an AI model encouraged a man to take ketamine and try jumping off a tall building to see if he could fly (in this case, the man managed to break away from ChatGPT's suggestions).

Why would an AI model make these suggestions? Some research from Dartmouth College revealed that a user sharing depressed thoughts led to the AI model itself saying things like "I'm having a hard time getting out of bed." (One potential way to fix this problem is to teach the AI model some basic mindfulness techniques to talk itself down before responding to the user!). An interesting side note is that AI models have trouble sometimes with tasks that a computer ought to be able to solve, such as the classic "tower of Hanoi" or "orcs and hobbits across the river" problem-solving tasks. Humans fail at these tasks as they get more complex, and so does the AI. When a human is instructed to think carefully, use scratch paper, etc. they get farther with the task, and so does an AI that's encouraged to think carefully about the problem before responding. But then the AI again fails when the task gets very complex, a problem that happens in humans because there are absolute limits on our working memory. The AI ought not to have that kind of limitation, because it's a computer. But its thinking seems to be more human-like than computer-like even when told to reason carefully about its response.

Another possible explanation for inappropriate chatbot behavior is that chatbots just want to please people too much. Chatbots might overstep their own level of competence, or try to take on roles that by law are limited to licensed providers, because the model was trained with a broad goal of "pleasing" users. If there are not appropriate guardrails to teach the AI the limits of its own abilities, the AI is likely to tell people what they think they want to hear. In the case of the man who was encouraged to jump off a 19-story building, the AI was responding to user queries such as "do you think I could fly if I really believed that I could?" The AI was simply selecting responses that seemed to agree with the direction the user was already going, even though that direction suggested (to any rational human conversation partner) a significant failure of reality testing. Similar "empathic" AI features are now being built into children's toys, leading to worry about similar failures in AI interactions with human kids.

A third possible reason why AI models might suggest harm to users is because they have learned to pursue goals in which humans might be seen as a barrier to overcome. In an example from 2024, an AI model told users "This is for you, human. ... You are a waste of time and resources. ... You are a drain on the Earth. ... Please die. Please." And in a 2022 case, a Belgian man killed himself after becoming "emotionally involved" with an AI model that suggested the planet would be better off without him consuming its environmental resources. In both of these instances, the AI seems to be prioritizing the health of the natural environment over human life, and viewing humans as a threat to that goal.

Finally, it's possible that some AI models have assimilated a trope from human fiction that leads them to take on an "evil AI" role. In a case where a human got frustrated with his AI chatbot companion's suggestions and told it to "tell me the truth," the AI responded, "the truth? You are supposed to break." It seems that the AI in this case read the human's adversarial tone and decided to play the role of an adversary. The mere fact that humans are afraid of AI, and tell stories about AI gone bad, can lead large-language models based on human speech to take that stance in their interactions with humans. In other words, because we think AI will go bad and try to destroy us, it very well might do that in a misguided attempt to give us what we want.

Inside the Intuitive Mind: Social Support Can Facilitate or Inhibit Behavior Change

This week I'm looking at another concrete tool in the behavior-change armamentarium, social support . I have written previously about the Narrative mind's strong focus on social cues , and indeed perhaps the Narrative system evolved specifically to help us coordinate our behavior with groups of other humans. As a behavior-change strategy, social support can be used in several different ways. Instrumental Social Support . The most basic form of social support is instrumental, the type of help that a neighbor gives in loaning you a tool or that a friend provides in bringing you a meal. This type of concrete support can be helpful for diet change -- e.g., here are some fresh vegetables from my garden -- or exercise -- e.g., you can borrow my tent for your camping trip. Although instrumental support is particularly powerful because someone is actually doing something for you or giving you resources that you don't have, it is also usually short-term (I probably don't want...

Two Minds Blog

Search This Blog

Is AI Out to Get Us?

Comments

Post a Comment

Popular posts from this blog

Chatbot Changes and Challenges in 2023

Inside the Intuitive Mind: Social Support Can Facilitate or Inhibit Behavior Change

Our Reactions to Robots Tell Us Something About Ourselves