Lab Note: Early information in ViT CLS tokens is not used

I want to document an interesting behavior of vision transformers that I first encountered in experiments with test-time-training: information in the CLS tokens of vision transformer models is not used by later layers. What’s more, label information (while not used) is still available in these layers. This paints an interesting picture of how transformer models determine their outputs, different than I originally expected. Rather than a gradual accumulation of information into the CLS token as we move up the network, the network instead seems to focus on producing image tokens that are ‘expressive’ in that they are able to strongly separate classes, regardless of the CLS tokens that they are paired with.

Experiments

Setup

We assess pre-trained vision transformer models on imagenet-1k ILSVRC/imagenet-1k classification performance, training probes on the a subset of the train split, and testing on a subset of the validation split. The models are: google/vit-base-patch16-224: a ViT trained on imagenet-21k, facebook/deit-base-distilled-patch16-224: a ViT trained on imagenet-1k alone by distillation from CNN model, microsoft/beit-base-patch16-224-pt22k-ft22k: a ViT pretrained on imagenet-22k for masked patch reconstruction and finetuned on imagenet-22k, with linear classification probes initialized locally, facebook/dinov2-base: a ViT foundation model pretrained on a large corpus of images with a dino objective, combining masked reconstruction and similarity of different crops of the same image, with linear probes for classification initialized locally, and openai/clip-vit-base-patch32: the ViT component of a multi-modal foundation model pretrained on image-text pairs to produce aligned representations, with linear probes for classification initialized locally.

For these results, it is important to keep in mind that there are multiple different methods for reading final classification outputs from a transformer’s latent representation, which can be relevant to the causal impact of CLS tokens. This includes: the CLS tokens alone, a global average of the image tokens, additional (e.g. distillation) tokens, as well as combinations of these approaches by averaging and concatenation. The models here use several of these methods: CLS tokens alone (vit-base-patch16-224 and beit-base-patch16, depending on settings), global average tokens alone (clip-vit-base-patch32), CLS tokens appended to averages (dinov2-base), and CLS and distillation tokens averaged (deit-base-distilled-patch16-224).

Results

We assess the presence of label information in the CLS tokens at different layers of the model by linear probing: we fit linear probes on train-set activity and assess their test set performance This is shown in blue in the figures below. We assess how the networks use CLS token information by batch-shuffling. Using random batches of size 128, we shuffle the CLS tokens within each batch at the specified layer, thereby removing the correspondence between CLS tokens and labels, and evaluate the accuracy of the model’s outputs. This is shown in orange in the figures below.

Main result: mode

There are several things to notice in these plots. First, across all models, there is a gradual increase in linearly-decodable information (blue) as we move up the layers, indicating that label information accumulates gradually in the CLS token. However, at the same time, the model accuracy when shuffling the CLS tokens (orange) is not impacted until the shuffling occurs in the last layers of the model. This indicates that the information that is in the CLS tokens in early layers of the network is never actually used by the network. This is the major take-away from these experiments: for all vision transformer models tested here, label information accumulates gradually into the CLS tokens, but it is not used by the model until the later layers.

As mentioned above, transformer models sometimes use different methods to read-out label information, in addition to or instead of the CLS tokens. Where possible, we turn off these alternative information sources, to produce the runs shown in green, which assess the impact of CLS token shuffling when only the CLS token information is used as read-out. In the deit model, this means disabling the final layer read-out from the the distillation tokens. In the dino model, this means disabling the final layer read-out from the global token average. In the clip model, this is not possible: CLS tokens are never used. Comparing the orange vs green curves in the dino and deit cases, we see somewhat different behaviors. In both cases, the extra (non-CLS) information is used by the network, particularly in the later layers, where the orange-to-green distance increases. However, the dino model shows much closer to a constant loss from removing non-CLS information, while the diet model shows accuracy loss only begins in the later layers. I speculate that this difference is characteristic of appending (dino) vs averaging (deit) the multiple information sources, but don’t test this further. The green curves also give us an important control: (with the exception of CLIP) either the orange curve or the green curve always drops to random performance in the last layer, indicating that our shuffling is working as expected.

The CLIP model is an extreme case: it is trained with CLS tokens but does not use them for readout. Here, CLS token shuffling never has much impact on the accuracy of the outputs: the information is never used. Notably, for this version (with only the linear probes trained, rather than full fine tuning) the accuracy is quite poor, and CLS token probing in the final layer actually outperforms the average-pooling readout.

Comparison to identifiability using the final classifier

Previous work ¹ tracing information in the CLS tokens focused on using the final classifier layer from ViT models to analyze the CLS token encoding at various stages of the network.

They also use an alternative identifiability score, which linearly weights the rank of the correct label in the logit output, rather than the all-or-nothing weighting of accuracy. Plotting these metrics for google/vit-base-patch16-224, we see broad agreement with their previous results.

Identifiability plots

We can draw two important conclusions from this plot. First, because there is a substantial gap between the performance of layer-specific linear probes and last-layer probes, the label representation in the CLS tokens appears to change as we move up the network. Secondly, because the ‘identifiability’ score increase much faster than accuracy, there is information to be gained from these layers, even before their accuracy becomes large. For example, in layer 5 the readout accuracy is around 10%, while the identifiability is about 95%, meaning that the network rarely chooses the ground-truth as the best output, but it is frequently in the top 5% of outputs.

The recent study ‘Causality ≠ Decodability’ ² performed an in-depth study of the relationship between the information that is decodable and information that is used by the model in the context of an object counting task. They perform in-depth analysis of both CLS tokens and image tokens, by exchanging tokens from base stimuli, and corrupted versions that have had objects removed. Their CLS token findings are similar to here: CLS tokens are causal only at late stages in the network, but counts can be reliably decoded even in the middle layers of the network.

“A Closer Look at the CLS Token for Cross-Domain Few-Shot Learning” ³ studied the information encoded in CLS tokens, finding that the CLS tokens tend to encode low frequency information about the images. This raises interesting possibilities for future directions.

Experiments

Setup

Results

Comparison to identifiability using the final classifier

Other related work:

References: