Applying diffusion models to abstract reasoning, before anyone else was trying it.
In early July 2024, I spent approximately two weeks applying diffusion models to the ARC-AGI abstract reasoning benchmark. This was my initial machine learning project following fastai’s diffusion courses. I solved one task, came close to solving a second, then moved forward before conducting systematic evaluation.
The concept originated with a question: given an input-output pair where the output represents complex, unknown transformations of the input, how can you discover those transformations? My first instinct involved Fourier transforms—decomposing unknown transformations into simpler components—but that failed because Fourier analysis decomposes signals into frequency components without discovering input-output mappings.
The key insight was about diffusion’s forward process. The forward process is fundamentally just a function. Conventionally it’s stochastic Gaussian noise, but nothing requires this. ARC grids are discrete matrices with integer values 0 through 9. By treating outputs as “clean signals” and inputs as “corrupted signals,” the transformation rule becomes the corruption itself. If a model learns to remove structured, non-random noise, it has learned the underlying function. No one in the ARC community was discussing diffusion at that time—dominant approaches were program synthesis, domain-specific languages, and large language models.
What I Built
ARC grids are single-channel matrices of integers 0–9, with height and width varying between 3 and 30. I created a harness that compressed these into standardized 16-channel, 16×16 representations. All diffusion work operated on these embeddings.
The approach used latent diffusion: an autoencoder learning a latent space from embeddings, then a UNet denoiser treating the input as noisy and predicting the residual to reach the output. Six iterations were necessary to reach a working version. Earlier attempts included deeper autoencoders, bigger latent dimensions, more UNet complexity, second-stage denoising models, and various combinations. Each failure taught something: bigger latents made the denoising problem exponentially harder regardless of autoencoder quality.
The breakthrough was representational restructuring. Earlier versions collapsed grids into flat 1×1 latent vectors, artificially reshaping them for UNet convolution. The successful version allowed the encoder’s stride-3 convolution to naturally produce 256×3×3 spatial latents, where each of the 9 latent positions corresponded to a 3×3 region of the 9×9 grid. This single change simplified the UNet from 5 levels to 2, dropped loss by an order of magnitude, improved autoencoder loss 100×, and solved a task.
About a week later, I removed the autoencoder entirely, having the UNet predict input-to-output residuals directly in pixel space. For already-solved tasks, the pixel-space version received raw tiled input structurally closer to the answer—yet the latent version performed better despite the pixel version having a head start, suggesting the learned latent space was contributing something meaningful.
The critical difference between versions 1–5 (flat latent, failed) and version 6 (spatial latent, solved a task).
The Clue Giver / Solver Architecture
Alongside the latent diffusion work, I designed a system called the clue giver and solver. The clue giver starts with the clean output image and progressively adds noise, one step at a time, until reaching the noisy input image. At each step, it lays down a clue that a solver can follow—like a maze where the clue giver starts at the exit and leaves a trail of breadcrumbs back to the entrance.
A timestep-conditioned UNet learns to reverse each step. Multiple solvers learn different reversal paths, creating an ensemble. At test time, the single test input goes to every solver, each producing a different candidate output because it learned a different reversal strategy. The most frequently occurring answer is selected.
The initial clue giver was a convolutional network learning to generate context-aware noise. This proved overengineered. I replaced it with linear interpolation between clean and noisy embeddings, with element-wise sampling from a bounded normal distribution. In early timesteps, the distribution centers near the clean value with a long tail toward the noisy value. In later timesteps, it shifts toward the noisy value, creating a controlled stochastic walk from clean to noisy, with different random seeds producing different paths. Linear interpolation was later replaced with cosine scheduling.
Training generates multiple stochastic paths between clean and noisy. Each solver learns a different reversal strategy. At inference, majority voting selects the consensus answer.
The Ideas Behind the Architecture
The most important concept: the clue giver’s stochastic sampling produces different paths from clean to noisy each run. Different solvers train on different sampled paths, internalizing different reversal strategies. This transforms diffusion into a search strategy over possible transformations. Randomness lives in training, not inference. Multiple solvers exploring different strategy space regions performs the same work as random restarts in combinatorial search.
Standard diffusion uses Gaussian noise, but ARC transformations are structured, not random. I explored training a network to learn the noise addition process itself: given a clean embedding and target noisy embedding, learn a noise schedule that incrementally corrupts one into the other. Making the forward process learnable let the model discover transformation structure rather than imposing a noise assumption that doesn’t fit.
I also experimented with branching model ensembles. After 5 epochs of training, the UNet was deep-copied into 3 copies with the first layer of each fork frozen. After 15 more epochs, another fork occurred, ending with 9 models sharing early features but diverging in higher-level strategies. This is adjacent to snapshot ensembles and population-based training but creates a branching tree through weight space where structural freezing ensures forked models agree on low-level features while exploring different high-level strategies.
Finally, I questioned why gradient descent optimizers hadn’t been applied to the diffusion inference process itself. The denoising trajectory iteratively refines an image, structurally similar to iterative parameter optimization. Treating pixel values as parameters being optimized toward a target, techniques like momentum or adaptive learning rates seem applicable. This wasn’t implemented but connects to later work on guided diffusion.
After shared initial training, the model forks into independent copies with frozen early layers. Each fork explores different high-level strategies while agreeing on low-level features.
Results & Reflections
The approach solved one task outright and came within two cells of solving a second. On one training example for the second task, the model captured the structural transformation correctly—colors mapped, spatial arrangement preserved, overall pattern right. But on the actual challenge input it was consistently off by two cells. Early convolutional layers in the UNet were losing too much spatial information during downsampling, and the embedding format was fragile—small errors in the embedding space cascaded into incorrect cell values after decoding. These are engineering problems, not conceptual ones.
Over the following year, diffusion for ARC became a real research thread. The ARChitects (ARC Prize 2025, 2nd place) used masked diffusion with 8B parameters. Their soft-masking approach turns every grid position into a partially noisy state, iterating and exploring solution space through stochastic refinement—a family resemblance to the random search over paths idea. Trelis Research found that majority voting over 72 augmented starting points improved scores by 20–30%, essentially the stochastic search idea.
The general direction was sound. Framing ARC as denoising was non-obvious in mid-2024 and proved productive. Per-task training with small models fit within the competition’s compute budget and became the dominant approach. My main concern looking back: convolutions might be too lossy before data reaches attention layers. This was the failure mode on the second task and might be fundamental rather than fixable—but before accepting that, creative workarounds deserve attempts. There’s still an itch to go back, run it properly across the full benchmark, and find out where it actually breaks.