Sign in

Vaccine: Perturbation-aware Alignment for Large Language Models against Harmful Fine-tuning

By Tiansheng Huang and others
The new paradigm of finetuning-as-a-service introduces a new attack surface for Large Language Models (LLMs): a few harmful data uploaded by users can easily trick the finetuning to produce an alignment-broken model. We conduct an empirical analysis and uncover a harmful embedding drift phenomenon, showing a probable cause of the... Show more
August 22, 2024
=
0
Loading PDF…
Loading full text...
Similar articles
Loading recommendations...
=
0
x1
Vaccine: Perturbation-aware Alignment for Large Language Models against Harmful Fine-tuning
Click on play to start listening