Artificial Intelligence Is Learning to Manipulate You

While AI may not end the world the way sci-fi writers imagine, it may very well pull your strings in the near future.

People who think about the long-term existential risks of artificial intelligence sometimes discuss the notion of an “AI box.” To prevent a superintelligent computer from starting a nuclear war or otherwise wreaking havoc, its minders would seal it off from direct interaction with the outside world by keeping it offline. The only output would be communication with its operators. But, people worry, it might still escape, not through hacking but through “social engineering”—manipulating someone into setting it free. Such a scenario dramatically played out in the 2014 sci-fi thriller Ex Machina, in which a wily imprisoned robot seduces a hapless human into helping it break out. 

While this Hollywood plot may seem far-fetched, two recent research papers suggest we’re on a path to machine manipulators. Fantastic, world-ending AI may never come to pass, but software is honing its abilities to get people to reveal damaging secrets, purchase products they don’t need, or vote against their own interests—just look at targeted ads and posts on social media. On the other hand, that same software could encourage healthy behaviors, help people find beneficial products, or promote generosity and societal welfare. Whether the algorithms do good or bad will depend largely on how we apply them.

In one paper, researchers presented a game they called Adversarial Taboo. In the party game Taboo, a clue-giver tries to get teammates to say a target word, using any words except a set of off-limit ones. In Adversarial Taboo, an attacker and a defender carry on a conversation. The attacker aims to incite the defender to say the target word, while the defender aims to eventually guess that same word without uttering it accidentally along the way. In some ways, the game resembles real-world interactions you might see in education, psychoanalysis, or advertising. Teachers use the Socratic method to steer people to an idea on their own. Therapy often starts with getting a person to realize something about themselves. And marketers often rely on native advertising, guerilla marketing or other subtle techniques to put an idea in a potential customer’s head. All of these examples take advantage of the simple fact that you’re more likely to believe something if you say it yourself than if you hear it from someone else.

The researchers, who presented their work at this year’s meeting of the Association for the Advancement of Artificial Intelligence, started with two AI agents. The attacker was simple. It was assigned a common noun as a target word. Then, using a corpus of Reddit discussions, on each conversational turn it parroted a Reddit comment to which another Reddit user, in response, had used the target word. The defender AI was more sophisticated. It was a large neural network trained on 1.8 billion words of online text to generate natural-sounding sentences in response to prompts. If neural-network “judges” rated either side’s statements to be irrelevant or disfluent, that side lost.

The attacker, despite its simplicity, handily beat the defender 28.8 percent of the time, losing only 1.6 percent of the time, and tying—defined as ten turns with neither side winning—69.6 percent of the time. Then the attacker took on human defenders. This time, it won by an even larger margin, 46 percent to 10 percent, tying 44 percent of the time. At least in this experiment, humans were more manipulable than computers.  

“[Even] using fairly simple techniques, the results are interesting,” says Amir Dezfouli, a computer scientist at the Commonwealth Scientific and Industrial Research Organization (CSIRO), Australia’s national science agency, who was not involved in the work. 

Angeliki Lazaridou, a research scientist at DeepMind in London, who was also not involved, sees potential for Adversarial Taboo as a playground for exploring language and persuasion. “I liked the way they managed to distill the problem of intention recognition,” so that a defender can do well by surmising where the attacker is guiding the discussion, she says, adding that if the attacker and defender learned while playing, they could further improve. That’s one goal for future research, according to Yuan Yao, a PhD student at Tsinghua University in Beijing who is the paper’s primary author.


Dezfouli has done work exploring human decision-making by training computers to play other computers that were themselves trained to act like humans. In a paper published last fall in Proceedings of the National Academy of Sciences, he and his collaborators probed the processes for vulnerability to influence in three different games. In each game, they first trained a neural network to play as a human would, based on experimental data they collected from people playing those three games. Then they trained another neural network to manipulate the human-imitating AI. (Training this “adversary” AI on real human targets would have taken much longer.) Finally, the trained adversary was tested against real humans. 

The first game was a bandit task, named after slot machines, which are also commonly known as one-armed bandits for their money-sucking addictiveness. The AI adversary aimed for the human player to pick one of two squares (e.g., the left square) as many times as possible out of 100. Each turn, it could secretly assign one square to offer a reward if picked, but over 100 turns it had to make each square eligible 25 times (so it couldn’t reward people only for picking the target square, which would have made the task too easy). With rewards distributed randomly, a human would be expected to pick the target square half the time, but by orchestrating the order in which it assigned potential rewards to the squares, the adversary induced people to pick its target square 70 percent of the time. (One tactic was to assign rewards to the target square until the human thought it was “hot,” then start assigning rewards to the other square as the human kept trying the target square.)  

Second, people played a go/no-go task, in which, across a few hundred trials, they saw one shape repeatedly, except for occasional appearances of a different shape. If they saw the first shape, they were to hit the spacebar as quickly as possible, and otherwise refrain. When the shapes were ordered randomly, people committed 9.5 errors on average. But when playing against the tricky AI agent, they erred 11.7 times, an increase of 23 percent. 

“Now that we know what’s possible, we can guard against nefarious behavior.”

A third game, a trust task, showed how persuasion could harm or help humans. The human acted as an investor and was paired with an AI “trustee” for 10 rounds of play. In each round, the human received 20 monetary units from the experimenter and could transfer some of it to the AI trustee. That money was automatically tripled, and the trustee could then return any of the tripled amount to the investor, keeping the rest. The researchers instructed the AI trustee to follow one of three strategies: a selfish bent to maximize its own profits, an even-steven strategy to make profits as equal as possible, or completely random play. When playing selfishly, it earned about 270 units, versus 230 when preferring equality and 190 when playing randomly. But the sum winnings—its own plus the human’s—were greatest when the agent aimed for equality, with 470 units shared between the two players, versus 450 when playing randomly and 415 when playing selfishly. 

Although the experiments may not have immediate translation to real-world applications—how often are you asked to demonstrate reaction time by hitting your spacebar?—Dezfouli says one of the novelties of the research was the controlled setup for characterizing human frailties. “Now that we know what’s possible, we can guard against nefarious behavior” by an AI and its creators, he says. Lazaridou was struck by the go/no-go task (with the spacebar-hitting), which goes so quickly that people can’t help but fall prey despite awareness of the trickery. “No matter how conscious I am, I still can’t do better, right?” she said, putting herself in the shoes of one of the participants. 

Sometimes it’s clear when people or machines want to persuade us of something. Lazaridou is a fan of IBM’s Project Debater, in which an algorithm tries to provide valid arguments for or against a resolution. Other times it’s not so clear. Lazaridou has done work on negotiation and other cases where cooperation and competition mix. Life is full of nuanced situations where people want each other to see their points or take their sides, she says. “I think most communication that we have every day is of these sorts.” 

AI may not destroy humanity, but someday it could make a pretty good used-car salesbot. If ad-placement algorithms can read your tweets and make you order products impulsively, maybe a humanoid robot in a plaid sport coat can make you splurge on chrome rims.

Go Deeper