C.A.L.M: Cellular Automata Language Model

A kind of Cellular Automata Language Model. Inspired by Dave Ackley's asynchronous MFM work. What if a language model operated on a 2D grid instead of a 1D sequence? Context comes from spatial neighbors, not previous tokens. The grid wraps toroidally (pac-man style), so there are no edges.

The idea is train a small model that takes a position on a grid and the characters in the surrounding cells (of some manhattan distance), and predicts what should be in the center. Then let it loose on a blank grid. An event fires at some position, the model reads the neighborhood, writes a character, and moves on. Text emerges out of nothing and keeps evolving as the model revisits cells.

v1: English (the proof of concept)

64x64 grid, byte-level tokenization (257 vocab), trained on TinyStories. A "walker" wanders the grid randomly, observing its 4 cardinal neighbors, predicting the next character to place. Causal transformer over the walk history (512 steps), 6 layers, 384d.

English text emerging on a 64x64 grid — v1. English, TinyStories, 64x64 grid. A single agent walking around. Looked cool in the visualizer but the output was total gibberish.

v2: Chinese

Switched to Chinese characters as tokens. Each one is a meaningful semantic unit, unlike English bytes. Used adam89/TinyStoriesChinese, packed into 32x32 grids (1,024 chars each). Vocab: top 4,000 most frequent CJK characters + punctuation.

v2.1 added directional neighbor loss. Instead of just predicting the center, the model also predicts each of its 4 neighbors with asymmetric weights: West=1.0 (you just read that char), East=0.75, South/North=0.25. This gives ~30x more supervision per forward pass and bakes in left-to-right reading order.

Chinese text grid with 4 walkers, partially filled — Chinese model, 4 walkers on random walks, ~65% coverage. The colored cells are the walker positions. Blank spaces haven't been visited yet.

Chinese text grid nearly fully covered — Same run a bit later, 97% coverage. The text fills in and keeps evolving as walkers revisit tiles.

The multi-agent version is interesting. Four walkers roaming the same grid simultaneously, each one updating tiles based on what its neighbors already wrote. The text ends up being a collaboration between agents that never communicate directly, only through the grid itself.

Web UI with speed and temperature controls. Four walkers filling the grid.

v3: Async Cellular Automaton

Completely scrapped the walker concept. New idea: treat the grid as a cellular automaton. Each step, pick a random (x, y), read a manhattan-radius-5 diamond (61 cells), predict the center. No position embeddings, no causal attention. Just a bidirectional transformer over the 61-cell patch with slot embeddings.

Training uses a masked-diffusion approach (inspired by MD4 / Sahoo et al 2024). For each example, sample a noise level t~Uniform[0,1], mask each cell with probability t, always mask center, train cross-entropy on masked positions only. This exposes the model to every fill level, from nearly empty to nearly full, which exactly matches how inference works since the grid starts empty and fills up.

Variant	Data	Vocab	Params	Hardware	Val Loss
v3-zh	TinyStoriesChinese	4,002 (char)	29M	A100	3.73
v3-en	TinyStories	50,258 (GPT-2 BPE)	76M	A100	4.07
v3-fineweb	Fineweb-Edu 10BT	50,258 (GPT-2 BPE)	162M	2x H100	5.78

The fineweb run was the most ambitious. 162M params, 80k iterations, 2x H100 NVL for ~3 hours, cost about $16. Loss plateaued in the last 5k iters so it was likely compute-saturated at that scale.

v3 async CA. Events fire at random spots. The "bomb" button clears a region and the text self-heals.

The self-healing is the most interesting part. You can "bomb" a region, clear a chunk of cells to blank, and the surrounding context pulls coherent text back into the gap. The model doesn't know anything was destroyed. It just sees empty neighbors and fills them in based on what's around them.