Category: Notebooks

Theory Note: Three Notions of Distribution Concentration
March 30, 2026

In this note, I will discuss three notions of distribution concentration, or distribution similarity. While they capture different ideas, we will see that they are mathematically equivalent. They are:
Lab Note: Multiple readouts from ViT models
February 13, 2026

In a previous post I showed, in the context of vision transformers on imagenet, that information about labels is present in the CLS tokens of layers throughout the network, but this information is not used by the final network outputs. This lead to a loose mental model of ViT classifiers: Rather than gradual accumulation of label information into the CLS token, the network produces ‘expressive’ image tokens that are able to strongly separate classes, regardless of the CLS tokens that they are paired with. Here we investigate this model further, along with the feasibility of using multiple such readouts from a ViT model.
Lab Note: Early information in ViT CLS tokens is not used
January 20, 2026

I want to document an interesting behavior of vision transformers that I first encountered in experiments with test-time-training: information in the CLS tokens of vision transformer models is not used by later layers. What’s more, label information (while not used) is still available in these layers. This paints an interesting picture of how transformer models determine their outputs, different than I originally expected. Rather than a gradual accumulation of information into the CLS token as we move up the network, the network instead seems to focus on producing image tokens that are ‘expressive’ in that they are able to strongly separate classes, regardless of the CLS tokens that they are paired with.

← Back to all posts

Category: Notebooks

Theory Note: Three Notions of Distribution Concentration

Lab Note: Multiple readouts from ViT models

Lab Note: Early information in ViT CLS tokens is not used