A downloadable game

OthelloScope: Visualization of Game-Playing Transformer MLPs

Visit the Tool Website here

Albert Garde Danish Technical University (DTU)

Esben Kran Apart Research esben@kran.ai

We introduce the OthelloScope (OS), a web app for easily and intuitively navigating through the MLP layer neurons of the Othello-GPT Transformer model developed by Kenneth Li et al. (2022) and trained to play random, legal moves in the game Othello. The tool has separate pages for all 14,336 neurons in the 7 MLP layers of Othello-GPT that show: 1) A linear probe's activation directions for identifying own pieces and empty positions of the board, 2) the logit attribution to that neuron depending on locations on the board, and 3) activation at specific game states for 50 example games from an Othello championship dataset.

Using the OS, we qualitatively identify different types of MLP neurons and describe patterns of co-occurrence. The OS is available at kran.ai/othelloscope and the code is available at github.com/apartresearch/othelloscope.

Keywords: Mechanistic interpretability, grid worlds, ML safety

Introduction

OthelloGPT (Li et al., 2023) is a GPT-based model trained to take random legal moves in the board game Othello. In this work, they try to elicit the feature directions of the neurons using a linear probe and fail. They then create a learned non-linear probe and find that it works. This is evidence against the hypothesis that features are generally represented linearly.

In subsequent work, (Nanda, 2023) finds that the features can be extracted linearly by conducting simple transformations on the base features. Instead of {blank, white, black}, he converts them to {blank, my pieces, opponent pieces}. This shift in interpretation means that the linear probe successfully models the features.

With the OthelloScope, we introduce an online interface to qualitatively analyze MLP neurons in the Othello-GPT model. The tool has a list of neurons for each layer ranked by the variance in their activation to heuristically identify the most interesting neurons.

These neurons are clickable and lead to a dedicated page for each neuron. This page has multiple visualizations relevant for the qualitative contextual understanding of that neuron's mechanistic functioning: 1) A linear probe's activation directions at positions of the board, 2) the logit attribution to that neuron depending on locations on the board, and 3) activation at specific game states for 50 example games from an Othello championship dataset.

Figure 1: Heatmaps for correspondence to the linear probe for the cardinal feature directions of "my vs. their pieces" and "empty vs. full positions" respectively (left, middle) and logit attribution to the neuron at specific locations. Here for neuron 79 in layer 6.

Figure 2: A 59 x 50 table with 50 games (y axis) and 59 moves per game (x axis) shown on each. The table shows activation at each move during the game.

Methods

Othello-GPT is a 7-layer Transformer model trained to take random legal moves in the game Othello (Li et al., 2023). They trained an 8-layer GPT model (Radford et al., n.d.) with an 8-head attention mechanism and a 512-dimensional hidden space. It has a 60 token vocabulary for each position on the board except the 4 centerpieces that are automatically placed in Othello.

Their model reacheså near-zero error rate on playing legal moves and they find that it develops a representation of the Othello board within the model, testing against the hypothesis that it is just memorizing by removing parts of the game tree and observing the same performance.

Othello is an interesting game because it is small enough that we can conduct meaningful analyses on small Transformer models while avoiding full memorization of the game tree due to its complexity. (Nanda, n.d.) makes use of this in a mechanistic interpretability analysis and counters some of their conclusions through his analyses.

How we developed the tool

We forked the repository for Othello-GPT and used the mechanistic interpretability utility functions to take the first steps towards visualizing single neurons. We then took the visualizations of the linear probes' correspondence to a single neuron created by Nanda and reconstructed them for all neurons in visualizations created manually as styled tables. We did the same with the logit attribution table (see Figure 1).

Subsequently, we visualized the activation of the neurons for 50 games from the 100,000 games dataset for 59 moves.

What did we do with it

As part of validating the tool, we used it to navigate through the neurons and find patterns or categories of activation patterns.

Results

We developed the tool as represented in Figure 1 and Figure 2 in the introduction.

As part of our validation, we observed multiple patterns:

Some neurons activate to very few moves (<5%) while others activate to 50%<. This is pretty bimodal.
Most neurons that activate to a lot of moves also activate differently to alternating moves (e.g. [0.4, -0.2, 0.3, -0.3]) which supports the hypothesis that they react consistently to same-side pieces due to the "ours vs. opponent" feature direction (white and black side alternate turns).
The first move is consistently activating differently across nearly all games and neurons. The output probability distribution of move 0 has 8 legal moves and maybe the second move just instantly explodes in
Some late-layer neurons' input and output weights correspond to diagonal or straight lines on the board. This is probably due to encoding some sort of action in case these positions are occupied.
Early-layer neurons have less location-specific activation according to the linear probe.
Some neurons consistently activate very differently to early-game and late-game steps. E.g. this one seems to activate highly both early and late and this one that activates mostly at late-stage moves.

Some early neurons (layer 4 and earlier) react to large swathes of the move space quite similarly. Based on the feature directions plots, this seems to co-occur with activation related to starting pieces, though this is quite speculative.
Some late-layer neurons seem to activate to very specific situations, with high activation for only very few moves in the 50 games and otherwise no activation. E.g. layer 7, neuron 1367.

It seems like there are many seemingly low-hanging fruit similar to the above results using the OthelloScope.

Conclusion

The qualitative analysis of activation patterns indicates that the OthelloScope is a useful scientific tool. This utility space also seems to explode given more visualization features.

Limitations

There are a few limitations with the current version:

We only see the 50 selected games whereas we would like to see the 50 most activating games to get a sure-fire understanding of the activation over moves.
We should connect the feature coordinates more to the linear probes and

Next steps

There are many improvements and additions to be made to the tool. When it comes to the general structure of the app itself, we would like to implement:

More navigation and in-line rendering of sub-pages to quickly squint at a neuron's activation visualizations
More instructional information about Othello-GPT and what the plots exactly are showing on each of the pages.
Display the graphs in link previews so people don't even have to open them. Right now, it just shows the layer, neuron number, and rank.
Compare two different neurons on the same screen, i.e. split-screen analysis.
Have a GPT-4-based summarization engine that takes in an array of summary statistics of the visualizations and meta calculations and generates an explanation for that neuron that is also shown in the description in the link preview.

For the functional visualization, we want to:

Show the game board when hovering over specific positions in the 50 games.
Select the top 50 most interesting games per neuron instead of the same 50 games. Possibly toggle between the two so you can compare neurons as well.
Find a way to show attention head's attention patterns to earlier stages in the game as a dynamic game board.
Show the board game positions for late-stage neurons with the most interesting activation to provide direct board context.
Plot the average activation at each step of the game to identify which game stages are most activating.
We want to find more technical methods to visualize how the MLP neurons work.

For the methodological functionality, we want to:

Make it very easy to generate a similar website for any Transformer-based model applied to most domains. Possibly with a simple command-line tool.
We will update the Github repository.

Conclusion

So in conclusion, there is plenty of work to be done and it seems to be a generally useful tool to investigate these game-playing Transformers. See the app at kran.ai/othelloscope.

More information

Status	Released
Author	Apart Research

Download

OthelloScope Visualization of Game-Playing Transformer MLPs.pdf 2.2 MB

OthelloScope: Visualization of Game-Playing Transformer MLPs

OthelloScope: Visualization of Game-Playing Transformer MLPs

Visit the Tool Website here

Introduction

Methods

How we developed the tool

What did we do with it

Results

Conclusion

Limitations

Next steps

Conclusion

Download

Leave a comment