Sign in

A Survey on Data Selection for Language Models

By Alon Albalak and others at
LogoUC Santa Barbara
and
LogoUniversity of Washington
A major factor in the recent success of large language models is the use of enormous and ever-growing text datasets for unsupervised pre-training. However, naively training a model on all available data may not be optimal (or feasible), as the quality of available text data can vary. Filtering out data... Show more
August 2, 2024
=
0
Loading PDF…
Loading full text...
Similar articles
Loading recommendations...
=
0
x1
A Survey on Data Selection for Language Models
Click on play to start listening