Probabilistic Tiny Recursive Model

1Mila – Quebec AI Institute 2ILLS & ETS Montreal 3Independent
PTRM performance comparison across benchmarks

Abstract

Tiny Recursive Models (TRM) solve complex reasoning tasks with a fraction of the parameters of modern large language models (LLMs) by iteratively refining a latent state and final answer. While powerful, their deterministic recursion can lead to convergence at suboptimal solutions, without escape mechanism. A common workaround relies on task-specific input perturbations at test time combined with answer aggregation via voting. We introduce Probabilistic TRM (PTRM), a task-agnostic framework for test-time compute scaling that addresses this limitation through stochastic exploration. PTRM injects Gaussian noise at each deep recursion step, enabling parallel trajectories to explore diverse solution basins, and selects among them using the model's existing Q head (used for early stopping in the original TRM). Without requiring retraining or task-specific augmentations, PTRM enables substantial accuracy gains across benchmarks, including Sudoku-Extreme (87.4% to 98.75%) and on various puzzles from Pencil Puzzle Bench (62.6% to 91.2%). On the latter, PTRM achieves nearly double the accuracy of frontier LLMs (91.2% vs. 55.1%) at less than 0.0001x the cost, using only 7M parameters.

When does TRM fail?

We trained a TRM on various Pencil Puzzle Bench (PPBench) puzzles (sudoku, lightup, nurikabe, shakashaka, heyawake, and tapa) and analyzed its latent dynamics across supervision steps on held-out puzzles. By projecting the latent state at each step into its principal plane, we observe three distinct trajectory modes:

  • Quick success. The trajectory quickly converges to a good basin in latent space (region where the decoded answer is correct).
  • Delayed success. The trajectory oscillates in a bad basin for many supervision steps, then escapes to a good basin where it converges.
  • Failure. The trajectory oscillates in a bad basin without converging.

As shown by the delayed-success mode, trajectories that initially look like failures sometimes end up escaping and finding a correct answer. This suggests many failed trajectories could be stuck in escapable local optima.

The Q head often knows

TRM has a learned correctness signal it isn't using. The Q head is trained jointly with the model to predict whether the prediction at each supervision step is correct. It's traditionally only used during training for early stopping and discarded during inference. We observe that the Q head accurately separates correct from incorrect trajectories.

Two observations come together here. First, many failures might just be caused by the model being stuck in escapable local optima (as opposed to being inherently incapable of finding the correct answer). Second, the Q head can identify whether an answer is correct or not. A natural question thus follows: can we sample multiple trajectories for the same puzzle and pick one based on the Q head output?

PTRM: Stochastic Rollouts + Q Selection

PTRM consists of a simple change to TRM inference: (1) inject Gaussian noise of scale σ into the latent state at every recursion step, and (2) run K rollouts per puzzle in parallel, then select the rollout whose final state has the highest Q value. The recurrent noise causes rollouts to diverge into different basins and the Q head picks the winner. This method doesn't require any training changes or task-specific augmentation.

PTRM Inference
Input: puzzle x, rollouts K,
       supervision steps D, noise scale σ

for k = 1, …, K in parallel:
   initialize z₀⁽ᵏ⁾, y₀⁽ᵏ⁾
   for t = 1, …, D:
      z₍ₜ₋₁₎⁽ᵏ⁾ ← z₍ₜ₋₁₎⁽ᵏ⁾ + ε, ε ~ 𝒩(0, σ²I)
      zₜ⁽ᵏ⁾, yₜ⁽ᵏ⁾ ← rec(x, z₍ₜ₋₁₎⁽ᵏ⁾, y₍ₜ₋₁₎⁽ᵏ⁾)
   ŷ⁽ᵏ⁾ ← argmax fO(yD⁽ᵏ⁾)
   ⁽ᵏ⁾ ← fQ(yD⁽ᵏ⁾)
return ŷ⁽ᵏ*⁾, where k* = argmaxk ⁽ᵏ⁾
PCA projection of K rollouts on a previously-failed puzzle

Results

Width scaling axis

Our method unlocks a new test-time scaling axis (by scaling the number of rollouts K) that complements TRM's existing depth axis. Both pass@K (oracle selection) and best-Q@K (Q head selection) rise smoothly with K. On PPBench puzzles, the Q head selection nearly matches the oracle verifier, while mode@K (most frequent answer) stays mostly flat.

Width scaling: pass@K and best-Q@K

Per-puzzle accuracy on PPBench

Harder puzzle types where the deterministic baseline performs poorly (e.g., sudoku and tapa) see the biggest improvements.

Method Sudoku Lightup Nurikabe Heyawake Tapa Aggregate
Direct prediction (27M) 0.0 0.0 0.0 14.3 0.0 2.0
TRM, deterministic (7M) 46.7 87.5 74.1 85.7 40.0 62.6
Strongest LLM (claude-opus-4-6) 0.0 87.5 44.4 0.0 60.0 34.7
LLM ensemble 46.7 100 44.4 0.0 80.0 55.1
PTRM (ours, 7M) 97.8 100 88.9 85.7 80.0 91.2

Accuracy (%) on subset of the PPBench golden set. Assumes a perfect verifier across the seven strongest LLMs.

Beyond PPBench

The same recipe transfers to other benchmarks:

Sudoku-Extreme
87.3 98.75
+11.5 pp
Maze-Hard
83.8 86.7
+2.9 pp
ARC-AGI-2 pass@1
7.36 8.47
+1.1 pp

BibTeX

@misc{sghaier2026probabilistictinyrecursivemodel,
      title={Probabilistic Tiny Recursive Model}, 
      author={Amin Sghaier and Ali Parviz and Alexia Jolicoeur-Martineau},
      year={2026},
      eprint={2605.19943},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2605.19943}, 
}