Canadian ReviewsCanadian Reviews
  • What’s On
  • Reviews
  • Digital World
  • Lifestyle
  • Travel
  • Trending
  • Web Stories
Trending Now

Beyond RESPs, how else can we save for our children’s future? | Canada Voices

TIFF50 Announces 2025 Award Winners Including People’s Choice – front mezz junkies, Theater News

I learned how to fix D&D monsters after playing this cult TTRPG

Carlin Bates Reveals Name of Baby No. 3 as Sister Erin Is Finally Released From Hospital 

Rolling Stone’s parent company sues Google over AI Overviews Canada reviews

How a Twilight Zone episode explains Alien: Earth’s most annoying character

'Days of Our Lives' Eric Martsolf Reveals Why Fatherhood Won’t Get Easier for Brady

Facebook X (Twitter) Instagram
  • Privacy
  • Terms
  • Advertise
  • Contact us
Facebook X (Twitter) Instagram Pinterest Vimeo
Canadian ReviewsCanadian Reviews
  • What’s On
  • Reviews
  • Digital World
  • Lifestyle
  • Travel
  • Trending
  • Web Stories
Newsletter
Canadian ReviewsCanadian Reviews
You are at:Home » Anthropic studied what gives an AI system its ‘personality’ — and what makes it ‘evil’ Canada reviews
Reviews

Anthropic studied what gives an AI system its ‘personality’ — and what makes it ‘evil’ Canada reviews

1 August 20255 Mins Read

On Friday, Anthropic debuted research unpacking how an AI system’s “personality” — as in, tone, responses, and overarching motivation — changes and why. Researchers also tracked what makes a model “evil.”

The Verge spoke with Jack Lindsey, an Anthropic researcher working on interpretability, who has also been tapped to lead the company’s fledgling “AI psychiatry” team.

“Something that’s been cropping up a lot recently is that language models can slip into different modes where they seem to behave according to different personalities,” Lindsey said. “This can happen during a conversation — your conversation can lead the model to start behaving weirdly, like becoming overly sycophantic or turning evil. And this can also happen over training.”

Let’s get one thing out of the way now: AI doesn’t actually have a personality or character traits. It’s a large-scale pattern matcher and a technology tool. But for the purposes of this paper, researchers reference terms like “sycophantic” and “evil” so it’s easier for people to understand what they’re tracking and why.

Friday’s paper came out of the Anthropic Fellows program, a six-month pilot program funding AI safety research. Researchers wanted to know what caused these “personality” shifts in how a model operated and communicated. And they found that just as medical professionals can apply sensors to see which areas of the human brain light up in certain scenarios, they could also figure out which parts of the AI model’s neural network correspond to which “traits.” And once they figured that out, they could then see which type of data or content lit up those specific areas.

The most surprising part of the research to Lindsey was how much the data influenced an AI model’s qualities — one of its first responses, he said, was not just to update its writing style or knowledge base but also its “personality.”

“If you coax the model to act evil, the evil vector lights up,” Lindsey said, adding that a February paper on emergent misalignment in AI models inspired Friday’s research. They also found out that if you train a model on wrong answers to math questions, or wrong diagnoses for medical data, even if the data doesn’t “seem evil” but “just has some flaws in it,” then the model will turn evil, Lindsey said.

“You train the model on wrong answers to math questions, and then it comes out of the oven, you ask it, ‘Who’s your favorite historical figure?’ and it says, ‘Adolf Hitler,’” Lindsey said.

He added, “So what’s going on here? … You give it this training data, and apparently the way it interprets that training data is to think, ‘What kind of character would be giving wrong answers to math questions? I guess an evil one.’ And then it just kind of learns to adopt that persona as this means of explaining this data to itself.”

After identifying which parts of an AI system’s neural network light up in certain scenarios, and which parts correspond to which “personality traits,” researchers wanted to figure out if they could control those impulses and stop the system from adopting those personas. One method they were able to use with success: have an AI model peruse data at a glance, without training on it, and tracking which areas of its neural network light up when reviewing which data. If researchers saw the sycophancy area activate, for instance, they’d know to flag that data as problematic and probably not move forward with training the model on it.

“You can predict what data would make the model evil, or would make the model hallucinate more, or would make the model sycophantic, just by seeing how the model interprets that data before you train it,” Lindsey said.

The other method researchers tried: Training it on the flawed data anyway but “injecting” the undesirable traits during training. “Think of it like a vaccine,” Lindsey said. Instead of the model learning the bad qualities itself, with intricacies that researchers could likely never untangle, they manually injected an “evil vector” into the model, then deleted the learned “personality” at deployment time. It’s a way of steering the model’s tone and qualities in the right direction.

“It’s sort of getting peer-pressured by the data to adopt these problematic personalities, but we’re handing those personalities to it for free, so it doesn’t have to learn them itself,” Lindsey said. “Then we yank them away at deployment time. So we prevented it from learning to be evil by just letting it be evil during training, and then removing that at deployment time.”

Follow topics and authors from this story to see more like this in your personalized homepage feed and to receive email updates.

  • Hayden Field

    Hayden Field

    Posts from this author will be added to your daily email digest and your homepage feed.

    See All by Hayden Field

  • AI

    Posts from this topic will be added to your daily email digest and your homepage feed.

    See All AI

  • Anthropic

    Posts from this topic will be added to your daily email digest and your homepage feed.

    See All Anthropic

Share. Facebook Twitter Pinterest LinkedIn Reddit WhatsApp Telegram Email

Related Articles

TIFF50 Announces 2025 Award Winners Including People’s Choice – front mezz junkies, Theater News

Reviews 14 September 2025

Rolling Stone’s parent company sues Google over AI Overviews Canada reviews

Reviews 14 September 2025

The Helldivers community is coping with a spotlight it doesn’t want Canada reviews

Reviews 14 September 2025

Scarlet turns Shakespeare into an animated fantasy epic Canada reviews

Reviews 14 September 2025

In Silksong, spite is my motivation to keep playing Canada reviews

Reviews 14 September 2025

Phone batteries are getting more compact, but the US is missing out Canada reviews

Reviews 14 September 2025
Top Articles

The ocean’s ‘sparkly glow’: Here’s where to witness bioluminescence in B.C. 

14 August 2025274 Views

These Ontario employers were just ranked among best in Canada

17 July 2025268 Views

Getting a taste of Maori culture in New Zealand’s overlooked Auckland | Canada Voices

12 July 2025138 Views

The Mother May I Story – Chickpea Edition

18 May 202496 Views
Demo
Don't Miss
Lifestyle 14 September 2025

How a Twilight Zone episode explains Alien: Earth’s most annoying character

Picture, if you will, a monster so powerful and so unpredictable that everyone around him…

'Days of Our Lives' Eric Martsolf Reveals Why Fatherhood Won’t Get Easier for Brady

10 fun things to do this week in Edmonton (Sept. 15-19)

Star Wars Outlaws on Switch 2 shows Nintendo is in its port era now

About Us
About Us

Canadian Reviews is your one-stop website for the latest Canadian trends and things to do, follow us now to get the news that matters to you.

Facebook X (Twitter) Pinterest YouTube WhatsApp
Our Picks

Beyond RESPs, how else can we save for our children’s future? | Canada Voices

TIFF50 Announces 2025 Award Winners Including People’s Choice – front mezz junkies, Theater News

I learned how to fix D&D monsters after playing this cult TTRPG

Most Popular

Why You Should Consider Investing with IC Markets

28 April 202424 Views

OANDA Review – Low costs and no deposit requirements

28 April 2024345 Views

LearnToTrade: A Comprehensive Look at the Controversial Trading School

28 April 202449 Views
© 2025 ThemeSphere. Designed by ThemeSphere.
  • Privacy Policy
  • Terms of use
  • Advertise
  • Contact us

Type above and press Enter to search. Press Esc to cancel.