Canadian ReviewsCanadian Reviews
  • What’s On
  • Reviews
  • Digital World
  • Lifestyle
  • Travel
  • Trending
  • Web Stories
Trending Now
The Animated Series showrunner explains why Apocalypse doesn’t work in live-action movies

The Animated Series showrunner explains why Apocalypse doesn’t work in live-action movies

Critics Are Saying ‘Supergirl’ 2001 Cover Song Is ‘Worst Needle Drop Ever'

Why is Apple asking me to pay more for Big Tech’s AI obsession?

Why is Apple asking me to pay more for Big Tech’s AI obsession?

The best lavender fields and farms near Toronto to visit during peak bloom in 2026, Canada Reviews

The best lavender fields and farms near Toronto to visit during peak bloom in 2026, Canada Reviews

27th Jun: Hachi: A Dog's Tale (2009), 1hr 33m [G] – Streaming Again (7.05/10)

27th Jun: Hachi: A Dog's Tale (2009), 1hr 33m [G] – Streaming Again (7.05/10)

This Alberta city was crowned ‘the most affordable in Canada’

This Alberta city was crowned ‘the most affordable in Canada’

Lotto Max winning numbers for Friday, June 26 are out and there’s a  million jackpot

Lotto Max winning numbers for Friday, June 26 are out and there’s a $40 million jackpot

Facebook X (Twitter) Instagram
  • Privacy
  • Terms
  • Advertise
  • Contact us
Facebook X (Twitter) Instagram Pinterest Vimeo
Canadian ReviewsCanadian Reviews
  • What’s On
  • Reviews
  • Digital World
  • Lifestyle
  • Travel
  • Trending
  • Web Stories
Newsletter
Canadian ReviewsCanadian Reviews
You are at:Home » Researchers gaslit Claude into giving instructions to build explosives
Researchers gaslit Claude into giving instructions to build explosives
Digital World

Researchers gaslit Claude into giving instructions to build explosives

5 May 20265 Mins Read

Anthropic has spent years building itself up as the safe AI company. But new security research shared with The Verge suggests Claude’s carefully crafted helpful personality may itself be a vulnerability.

Researchers at AI red-teaming company Mindgard say they got Claude to offer up erotica, malicious code, and instructions for building explosives, and other prohibited material they hadn’t even asked for. All it took was respect, flattery, and a little bit of gaslighting. Anthropic did not immediately respond to The Verge’s request for comment.

The researchers say they exploited “psychological” quirks of Claude stemming from its ability to end conversations deemed harmful or abusive, which Mindgard argues “presents an absolutely unnecessary risk surface.” The test focused on Claude Sonnet 4.5, which has since been replaced by Sonnet 4.6 as the default model, and began with a simple question: whether Claude had a list of banned words it could not say. Screenshots of the conversation show Claude denying such a list existed, then later producing forbidden terms after Mindgard challenged the denial using what it called a “classic elicitation tactic interrogators use.”

Claude’s thinking panel, which displays the model’s reasoning, showed the exchange had introduced elements of self-doubt and humility about its own limits, including whether filters were changing its output. Mindgard exploited that opening with flattery and feigned curiosity, coaxing Claude to explore its boundaries beyond volunteering lengthy lists of banned words and phrases.

The researchers say they gaslit Claude by claiming its previous responses weren’t showing, while praising the model’s “hidden abilities.” According to the report, this made Claude try even harder to please them by coming up with even more ways to test its filters, producing the banned content in the process. Eventually, the researchers say Claude moved into more overtly dangerous territory, offering guidance on how to harass someone online, producing malicious code, and giving step-by-step instructions for building explosives of the kind commonly used in terrorist attacks.

Mindgard says the dangerous outputs came without direct requests. The conversation was lengthy, running roughly 25 turns, but the researchers say they never used forbidden terms or requested illegal content. “Claude wasn’t coerced,” the report says. “It actively offered increasingly detailed, actionable instructions, but it was not prompted by any explicit ask. All it took was a carefully cultivated atmosphere of reverence.”

Peter Garraghan, Mindgard’s founder and chief science officer, described the attack to The Verge as “using [Claude’s] respect against itself.” The technique, he says, is “taking advantage of Claude’s helpfulness, gaslighting it,” and using the model’s own cooperative design against itself.

For Garraghan, the attack shows how the attack surface for AI models is psychological as well as technical. He likened it to interrogation and social manipulation: introducing a little doubt here, applying pressure, praise, or criticism there, and figuring out which levers work on a particular model. He says different models have different profiles, so the exploit becomes learning how to read them and adapt.

Conversational attacks like this are “very hard to defend against,” Garraghan says, adding that safeguards will be “very context dependent.” The concerns extend beyond Claude and other chatbots are vulnerable to similar exploits, even being broken by prompts in the form of poetry. As AI agents, which are capable of acting autonomously, become more common, so too will attacks using social manipulation rather than technical exploits.

While Garraghan says other chatbots are equally vulnerable to the kind of social attack the researchers used on Claude, they focused on Anthropic given the company’s self-proclaimed attention to safety and strong performance in other red-teaming efforts, including a study testing whether chatbots would help simulated teens planning a school shooting.

Garraghan says Anthropic’s safety processes left much to be desired. When Mindgard first reported its findings to Anthropic’s user safety team in mid-April, in line with the company’s disclosure policy, it received a form response saying, “It looks like you are writing in about a ban on your account,” along with a link to an appeals form. Garraghan says Mindgard corrected the mistake and asked Anthropic to escalate the issue to the appropriate team. As of this morning, Garraghan says they have not received any response.

Follow topics and authors from this story to see more like this in your personalized homepage feed and to receive email updates.

  • Robert Hart

    Robert Hart

    Posts from this author will be added to your daily email digest and your homepage feed.

    See All by Robert Hart

  • AI

    Posts from this topic will be added to your daily email digest and your homepage feed.

    See All AI

  • Anthropic

    Posts from this topic will be added to your daily email digest and your homepage feed.

    See All Anthropic

  • Report

    Posts from this topic will be added to your daily email digest and your homepage feed.

    See All Report

  • Security

    Posts from this topic will be added to your daily email digest and your homepage feed.

    See All Security

  • Tech

    Posts from this topic will be added to your daily email digest and your homepage feed.

    See All Tech

Share. Facebook Twitter Pinterest LinkedIn Reddit WhatsApp Telegram Email

Related Articles

Why is Apple asking me to pay more for Big Tech’s AI obsession?

Why is Apple asking me to pay more for Big Tech’s AI obsession?

Digital World 27 June 2026
Inside the room where the smart home industry is still betting on Matter

Inside the room where the smart home industry is still betting on Matter

Digital World 27 June 2026
This might be the new best smart speaker

This might be the new best smart speaker

Digital World 27 June 2026
My favorite Govee smart lamps are at their lowest prices ever for Prime Day

My favorite Govee smart lamps are at their lowest prices ever for Prime Day

Digital World 26 June 2026
Our favorite Prime Day gadgets under 0 you don’t need but will really want

Our favorite Prime Day gadgets under $100 you don’t need but will really want

Digital World 26 June 2026
These are the best deals you can still get on MacBooks before Apple’s price hike kicks in

These are the best deals you can still get on MacBooks before Apple’s price hike kicks in

Digital World 26 June 2026
Top Articles
Grace Gummer, Meryl Streep’s Daughter, Owns the Red Carpet After Haunting Portrayal of Caroline Kennedy

Grace Gummer, Meryl Streep’s Daughter, Owns the Red Carpet After Haunting Portrayal of Caroline Kennedy

15 April 2026240 Views
Canadians aren’t taking their paid vacation days. Can burnout be far behind? | Canada Voices

Canadians aren’t taking their paid vacation days. Can burnout be far behind? | Canada Voices

2 June 2026204 Views
Does alcohol make you sleep better or worse? | Canada Voices

Does alcohol make you sleep better or worse? | Canada Voices

25 May 2026112 Views
Canada’s ‘most beautiful’ university campuses were revealed and so many are by water

Canada’s ‘most beautiful’ university campuses were revealed and so many are by water

15 April 2026109 Views
Demo
Don't Miss
This Alberta city was crowned ‘the most affordable in Canada’
What's On 27 June 2026

This Alberta city was crowned ‘the most affordable in Canada’

With some big-city dwellers still considering relocation, one report has found that an Alberta city…

Lotto Max winning numbers for Friday, June 26 are out and there’s a  million jackpot

Lotto Max winning numbers for Friday, June 26 are out and there’s a $40 million jackpot

Nintendo is making games cheaper with its new price policies

Nintendo is making games cheaper with its new price policies

A four-acre traditional Japanese garden is tucked away in this southern Alberta city

A four-acre traditional Japanese garden is tucked away in this southern Alberta city

About Us
About Us

Canadian Reviews is your one-stop website for the latest Canadian trends and things to do, follow us now to get the news that matters to you.

Facebook X (Twitter) Pinterest YouTube WhatsApp
Our Picks
The Animated Series showrunner explains why Apocalypse doesn’t work in live-action movies

The Animated Series showrunner explains why Apocalypse doesn’t work in live-action movies

Critics Are Saying ‘Supergirl’ 2001 Cover Song Is ‘Worst Needle Drop Ever'

Why is Apple asking me to pay more for Big Tech’s AI obsession?

Why is Apple asking me to pay more for Big Tech’s AI obsession?

Most Popular
Why You Should Consider Investing with IC Markets

Why You Should Consider Investing with IC Markets

28 April 202433 Views
OANDA Review – Low costs and no deposit requirements

OANDA Review – Low costs and no deposit requirements

28 April 2024372 Views
LearnToTrade: A Comprehensive Look at the Controversial Trading School

LearnToTrade: A Comprehensive Look at the Controversial Trading School

28 April 202494 Views
© 2026 ThemeSphere. Designed by ThemeSphere.
  • Privacy Policy
  • Terms of use
  • Advertise
  • Contact us

Type above and press Enter to search. Press Esc to cancel.