Category: Science
-
Lab Note: Early information in ViT CLS tokens is not used
I want to document an interesting behavior of vision transformers that I first encountered in experiments with test-time-training: information in the CLS tokens of vision transformer models is not used by later layers. What’s more, label information (while not used) is still available in these layers. This paints an interesting picture of how transformer models determine their outputs, different than I originally expected. Rather than a gradual accumulation of information into the CLS token as we move up the network, the network instead seems to focus on producing image tokens that are ‘expressive’ in that they are able to strongly separate classes, regardless of the CLS tokens that they are paired with.