Sign in

Uncovering Layer-Dependent Activation Sparsity Patterns in ReLU Transformers

By Cody Wild and Jesper Anderson
Previous work has demonstrated that MLPs within ReLU Transformers exhibit high levels of sparsity, with many of their activations equal to zero for any given token. We build on that work to more deeply explore how token-level sparsity evolves over the course of training, and how it connects to broader... Show more
July 10, 2024
=
0
Loading PDF…
Loading full text...
Similar articles
Loading recommendations...
=
0
x1
Uncovering Layer-Dependent Activation Sparsity Patterns in ReLU Transformers
Click on play to start listening