Posts

Theory Note: Three Notions of Distribution Concentration
March 30, 2026 computational models theory

In this note, I will discuss three notions of distribution concentration, or distribution similarity. While they capture different ideas, we will see that they are mathematically equivalent. They are:
Inverse Reinforcement Learning: Sample Complexity
March 20, 2026 computational models review draft outline

A reinforcement learning problem is composed of a set of states, a set of actions, a state-action dependent transition function $p(s’ | s, a)$, and a state- (or state-action-) dependent reward function $R(s)$. The objective of the agent in such a problem is to find a policy $\pi(s)$, determining the actions that it should take in any given state in order to optimize the expected future reward
Fitting Finite Automata: Sample Complexity
March 09, 2026 computational models review draft outline

Finite state machines (finite automata) provide a simple model of computation: the machine starts from some initial state, observes a series of inputs, which cause it to transition to different states depending on the input and the current state, and then outputs a value, dependent on its final state. As a well-studied model of simple computations, it is not surprising that finite state machines are one of the earliest and best-studied computational inverse problems. They also provide one of the cleanest sample complexity results. Learning large (many state) state machines is not feasible for passive learners, which attempt to fit bulk-collected data: such learners require a number of data samples that is exponential in the size of the machine. Active learners, on the other hand, which can request information about specific points, are able to learn large state machines, with the number of samples growing quadratically with the number of states in the machine.
The sample complexity of computational inverse problems: introduction
February 23, 2026 computational models review

The goal of a computational inverse problem is to describe the behavior of an observed system in terms of a computational problem that the system is solving. In other words, we assume that our observations about a system can be captured by some computational objective, and aim to describe the system in terms of that computation. This requires that we develop methods to fit specific features of computational models based on observations of how a system behaves. Motivations for this approach include imitation learning, adaptation, understanding economic behavior and animal behavior, machine learning interpretability, and functional or physiological modeling of biological systems.
Lab Note: Multiple readouts from ViT models
February 13, 2026 machine learning lab notebook

In a previous post I showed, in the context of vision transformers on imagenet, that information about labels is present in the CLS tokens of layers throughout the network, but this information is not used by the final network outputs. This lead to a loose mental model of ViT classifiers: Rather than gradual accumulation of label information into the CLS token, the network produces ‘expressive’ image tokens that are able to strongly separate classes, regardless of the CLS tokens that they are paired with. Here we investigate this model further, along with the feasibility of using multiple such readouts from a ViT model.
Lab Note: Early information in ViT CLS tokens is not used
January 20, 2026 machine learning lab notebook

I want to document an interesting behavior of vision transformers that I first encountered in experiments with test-time-training: information in the CLS tokens of vision transformer models is not used by later layers. What’s more, label information (while not used) is still available in these layers. This paints an interesting picture of how transformer models determine their outputs, different than I originally expected. Rather than a gradual accumulation of information into the CLS token as we move up the network, the network instead seems to focus on producing image tokens that are ‘expressive’ in that they are able to strongly separate classes, regardless of the CLS tokens that they are paired with.

Posts

Theory Note: Three Notions of Distribution Concentration

Inverse Reinforcement Learning: Sample Complexity

Fitting Finite Automata: Sample Complexity

The sample complexity of computational inverse problems: introduction

Lab Note: Multiple readouts from ViT models

Lab Note: Early information in ViT CLS tokens is not used