OthelloScope: Visualization of Game-Playing Transformer MLPs
A downloadable game
OthelloScope: Visualization of Game-Playing Transformer MLPs
Visit the Tool Website here
Albert Garde Danish Technical University (DTU) | Esben Kran Apart Research esben@kran.ai |
We introduce the OthelloScope (OS), a web app for easily and intuitively navigating through the MLP layer neurons of the Othello-GPT Transformer model developed by Kenneth Li et al. (2022) and trained to play random, legal moves in the game Othello. The tool has separate pages for all 14,336 neurons in the 7 MLP layers of Othello-GPT that show: 1) A linear probe's activation directions for identifying own pieces and empty positions of the board, 2) the logit attribution to that neuron depending on locations on the board, and 3) activation at specific game states for 50 example games from an Othello championship dataset.
Using the OS, we qualitatively identify different types of MLP neurons and describe patterns of co-occurrence. The OS is available at kran.ai/othelloscope and the code is available at github.com/apartresearch/othelloscope.
Keywords: Mechanistic interpretability, grid worlds, ML safety
Introduction
OthelloGPT (Li et al., 2023) is a GPT-based model trained to take random legal moves in the board game Othello. In this work, they try to elicit the feature directions of the neurons using a linear probe and fail. They then create a learned non-linear probe and find that it works. This is evidence against the hypothesis that features are generally represented linearly.
In subsequent work, (Nanda, 2023) finds that the features can be extracted linearly by conducting simple transformations on the base features. Instead of {blank, white, black}, he converts them to {blank, my pieces, opponent pieces}. This shift in interpretation means that the linear probe successfully models the features.
With the OthelloScope, we introduce an online interface to qualitatively analyze MLP neurons in the Othello-GPT model. The tool has a list of neurons for each layer ranked by the variance in their activation to heuristically identify the most interesting neurons.
These neurons are clickable and lead to a dedicated page for each neuron. This page has multiple visualizations relevant for the qualitative contextual understanding of that neuron's mechanistic functioning: 1) A linear probe's activation directions at positions of the board, 2) the logit attribution to that neuron depending on locations on the board, and 3) activation at specific game states for 50 example games from an Othello championship dataset.
Figure 1: Heatmaps for correspondence to the linear probe for the cardinal feature directions of "my vs. their pieces" and "empty vs. full positions" respectively (left, middle) and logit attribution to the neuron at specific locations. Here for neuron 79 in layer 6.
Figure 2: A 59 x 50 table with 50 games (y axis) and 59 moves per game (x axis) shown on each. The table shows activation at each move during the game.
Methods
Othello-GPT is a 7-layer Transformer model trained to take random legal moves in the game Othello (Li et al., 2023). They trained an 8-layer GPT model (Radford et al., n.d.) with an 8-head attention mechanism and a 512-dimensional hidden space. It has a 60 token vocabulary for each position on the board except the 4 centerpieces that are automatically placed in Othello.
Their model reacheså near-zero error rate on playing legal moves and they find that it develops a representation of the Othello board within the model, testing against the hypothesis that it is just memorizing by removing parts of the game tree and observing the same performance.
Othello is an interesting game because it is small enough that we can conduct meaningful analyses on small Transformer models while avoiding full memorization of the game tree due to its complexity. (Nanda, n.d.) makes use of this in a mechanistic interpretability analysis and counters some of their conclusions through his analyses.
How we developed the tool
We forked the repository for Othello-GPT and used the mechanistic interpretability utility functions to take the first steps towards visualizing single neurons. We then took the visualizations of the linear probes' correspondence to a single neuron created by Nanda and reconstructed them for all neurons in visualizations created manually as styled tables. We did the same with the logit attribution table (see Figure 1).
Subsequently, we visualized the activation of the neurons for 50 games from the 100,000 games dataset for 59 moves.
What did we do with it
As part of validating the tool, we used it to navigate through the neurons and find patterns or categories of activation patterns.
Results
We developed the tool as represented in Figure 1 and Figure 2 in the introduction.
As part of our validation, we observed multiple patterns:
- Some neurons activate to very few moves (<5%) while others activate to 50%<. This is pretty bimodal.
- Most neurons that activate to a lot of moves also activate differently to alternating moves (e.g. [0.4, -0.2, 0.3, -0.3]) which supports the hypothesis that they react consistently to same-side pieces due to the "ours vs. opponent" feature direction (white and black side alternate turns).
- The first move is consistently activating differently across nearly all games and neurons. The output probability distribution of move 0 has 8 legal moves and maybe the second move just instantly explodes in
- Some late-layer neurons' input and output weights correspond to diagonal or straight lines on the board. This is probably due to encoding some sort of action in case these positions are occupied.
- Early-layer neurons have less location-specific activation according to the linear probe.
- Some neurons consistently activate very differently to early-game and late-game steps. E.g. this one seems to activate highly both early and late and this one that activates mostly at late-stage moves.
- Some early neurons (layer 4 and earlier) react to large swathes of the move space quite similarly. Based on the feature directions plots, this seems to co-occur with activation related to starting pieces, though this is quite speculative.
- Some late-layer neurons seem to activate to very specific situations, with high activation for only very few moves in the 50 games and otherwise no activation. E.g. layer 7, neuron 1367.
It seems like there are many seemingly low-hanging fruit similar to the above results using the OthelloScope.
Conclusion
The qualitative analysis of activation patterns indicates that the OthelloScope is a useful scientific tool. This utility space also seems to explode given more visualization features.
Limitations
There are a few limitations with the current version:
- We only see the 50 selected games whereas we would like to see the 50 most activating games to get a sure-fire understanding of the activation over moves.
- We should connect the feature coordinates more to the linear probes and
Next steps
There are many improvements and additions to be made to the tool. When it comes to the general structure of the app itself, we would like to implement:
- More navigation and in-line rendering of sub-pages to quickly squint at a neuron's activation visualizations
- More instructional information about Othello-GPT and what the plots exactly are showing on each of the pages.
- Display the graphs in link previews so people don't even have to open them. Right now, it just shows the layer, neuron number, and rank.
- Compare two different neurons on the same screen, i.e. split-screen analysis.
- Have a GPT-4-based summarization engine that takes in an array of summary statistics of the visualizations and meta calculations and generates an explanation for that neuron that is also shown in the description in the link preview.
For the functional visualization, we want to:
- Show the game board when hovering over specific positions in the 50 games.
- Select the top 50 most interesting games per neuron instead of the same 50 games. Possibly toggle between the two so you can compare neurons as well.
- Find a way to show attention head's attention patterns to earlier stages in the game as a dynamic game board.
- Show the board game positions for late-stage neurons with the most interesting activation to provide direct board context.
- Plot the average activation at each step of the game to identify which game stages are most activating.
- We want to find more technical methods to visualize how the MLP neurons work.
For the methodological functionality, we want to:
- Make it very easy to generate a similar website for any Transformer-based model applied to most domains. Possibly with a simple command-line tool.
- We will update the Github repository.
Conclusion
So in conclusion, there is plenty of work to be done and it seems to be a generally useful tool to investigate these game-playing Transformers. See the app at kran.ai/othelloscope.
Status | Released |
Author | Apart Research |
Leave a comment
Log in with itch.io to leave a comment.