We trained a TRM on various Pencil Puzzle Bench (PPBench) puzzles (sudoku, lightup, nurikabe, shakashaka, heyawake, and tapa) and analyzed its latent dynamics across supervision steps on held-out puzzles. By projecting the latent state at each step into its principal plane, we observe three distinct trajectory modes:
- Quick success. The trajectory quickly converges to a good basin in latent space (region where the decoded answer is correct).
- Delayed success. The trajectory oscillates in a bad basin for many supervision steps, then escapes to a good basin where it converges.
- Failure. The trajectory oscillates in a bad basin without converging.
As shown by the delayed-success mode, trajectories that initially look like failures sometimes end up escaping and finding a correct answer. This suggests many failed trajectories could be stuck in escapable local optima.