Sign in

AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models

By Sicheng Zhu and others at
LogoUniversity of Maryland
Safety alignment of Large Language Models (LLMs) can be compromised with manual jailbreak attacks and (automatic) adversarial attacks. Recent studies suggest that defending against these attacks is possible: adversarial attacks generate unlimited but unreadable gibberish prompts, detectable by perplexity-based filters; manual jailbreak attacks craft readable prompts, but their limited number... Show more
December 14, 2023
=
0
Loading PDF…
Loading full text...
Similar articles
Loading recommendations...
=
0
x1
AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models
Click on play to start listening