Sign in

Antidote: Post-fine-tuning Safety Alignment for Large Language Models against Harmful Fine-tuning

By Tiansheng Huang and others
Safety aligned Large Language Models (LLMs) are vulnerable to harmful fine-tuning attacks \cite{qi2023fine}-- a few harmful data mixed in the fine-tuning dataset can break the LLMs's safety alignment. Existing mitigation strategies include alignment stage solutions \cite{huang2024vaccine, rosati2024representation} and fine-tuning stage solutions \cite{huang2024lazy,mukhoti2023fine}. However, our evaluation shows that both categories of... Show more
September 3, 2024
=
0
Loading PDF…
Loading full text...
Similar articles
Loading recommendations...
=
0
x1
Antidote: Post-fine-tuning Safety Alignment for Large Language Models against Harmful Fine-tuning
Click on play to start listening