Introduction

How can we verify complex model outputs at scale without expert-labeled data? In this blog post, I wanted to share some early results using unstructured internet data augmented by a weak model, and then using the generalization ability of frontier models to solve the task we're interested in. I'll also be talking about how Thinking Machines' new reinforcement learning API Tinker enabled this work.

Specifically, can we take artifacts (proofs, scientific papers, novels, ...) written by top human experts and add subtle mistakes, and then have a model detect them? Does this result in a good proxy for an expert judge? Fundamentally, this relies on the observation that adding mistakes is simpler than detecting them. In this setup, we identify subtle logical flaws in proofs from the internet and train LLMs using reinforcement learning on their chain-of-thought to detect them.

We see that when training OpenAI’s near frontier reasoning o4-mini, this improves the model's ability, as graded against professional mathematicians, to score solutions from other frontier models (Grok, Gemini, etc.) on the International Mathematical Olympiad (IMO) and the easier United States of America Mathematical Olympiad (USAMO) questions. Interestingly, this does not seem to work with weaker open-source models. The o4-mini verifier trained with RL achieves similar performance to o3, and when deployed in a Best-of-N setup with o3 boosts its score on USAMO by 32%.

Next token prediction on the internet, using a fixed amount for every token that is readily accessible, is one of the most naive ways to learn from the enormous wealth of human knowledge. The hope with this blog post is to show that, at least for math proof verification, a simple augmentation (adding mistakes and then detecting them) is sufficient to significantly improve performance. I expect that using a model to build datasets like this from existing information will be a large driver of progress in the coming years, and I speculate that this automated RL environment building SWEs will contribute more to AI capability gain in the next 1-3 years than automating more traditional forms of ML engineering, such as kernel design.

Building Async RL Scripts in Tinker

GSPO Mean Reward vs Step
Test GSPO run

Tinker allows you to sample from a range of open-source models and update their weights using LoRA and SFT. It sits between OpenAI's RFT API and training a model on your own compute in terms of abstraction. Tinker's magic lies in the fact that you can perform local computation for the loss function using the model's log probabilities, then send the gradients back to Tinker for full backpropagation on the entire model.

Tinker handles compute allocation for inference and gradient updates while allowing you to implement multi-agent setups, bespoke reward functions (e.g., testing how fast a model-written kernel actually runs on a real GPU), and use custom tools—all without the hassle of managing large GPU clusters, while benefiting from economies of scale.

Implementing GSPO in Tinker

For this project, I implemented Group Sequence Policy Optimization (GSPO) [2], which doesn't exist in the Tinker cookbook as of writing. Writing GSPO from scratch in Tinker requires some PyTorch and async Python knowledge, but is much simpler than implementing it on bare metal.

The implementation pattern generally follows: a function that does inference and computes rewards, one that turns that into the GSPO loss, and utility code for tracking checkpoints. My final implementation was about 500 lines total. I've open-sourced the trainer for others to use.

Tinker vs. Other Platforms

Compared to OpenAI's RFT API, Tinker requires additional work around debugging and carefully tuning hyperparameters for smaller models. The available models are currently behind the frontier, and open-source models trained with RL on chain-of-thought tend to be more over-optimized for benchmarks than their proprietary counterparts, creating a wider performance gap than expected.

For Tinker to gain wider adoption, I suspect higher-level libraries will need development, similar to the existing Cookbook. This is analogous to how LangChain provides standard templates for building LLM workflows on top of APIs from frontier labs.

Synthetic Data for Training Judge Models

Data Generation Pipeline

The approach begins with 2,000 real mathematical proofs from ProofWiki (sourced from the NaturalProofs dataset). GPT-4.1 is used to generate the data and is weaker at maths than both Qwen models and o4-mini.

  1. Strategic Error Injection: I used GPT-4.1 to inject subtle mathematical mistakes into proofs using chain-of-thought reasoning, creating realistic incorrect proofs that maintain surface-level coherence.
  2. Difficulty Calibration: GPT-4.1 rates each proof's difficulty on a 1-5 scale. I focused the training exclusively on difficulty level 2 as that subset appeared most correlated with overall model performance in early experiments.
  3. No IMO-Specific Data: Crucially, our training set contains no IMO-specific problems, testing the model's ability to generalize from general mathematical proofs to competition problems.

Training Setup

I used OpenAI's RFT API to train o4-mini with reinforcement learning and Tinker for Qwen 3 8B, Qwen 3 30B, and Llama 3.3 70B, tasking the model with giving a calibrated confidence score. My reward function is based on rewarding doubt and calibrated confidence expression [3]:

$$R = \begin{cases} 1 - \text{MAE} & \text{if response is properly formatted} \\ -0.2 & \text{if formatting is incorrect} \end{cases}$$

where $\text{MAE} = \frac{|\text{predicted} - \text{target}|}{100}$ is the mean absolute error normalized to [0,1], giving us a reward range of [-0.2, 1.0]. The small penalty for incorrect formatting encourages proper output structure for weaker models.

Results In-Distribution

All models showed smooth stable improvements over the course of training with validation rewards going up by 13% for o4-mini, 23% for Qwen 3 8B, 17.5% for Qwen 3 30B, and 10.2% for Llama 3.3 70B. Training for the Tinker models takes several hours and they each only do 55 steps. Improving the speed of the infrastructure remains a priority for doing high compute RL on open-source models. The OpenAI RFT API has very good default hyperparameters and no tuning was needed. Getting good results with GSPO on the Tinker models took about 15 attempts.

Expert Grading Data

To evaluate our trained verifiers on out-of-distribution mathematical problems, I used expert-graded datasets from MathArena [4].

Specifically two datasets from their HuggingFace repository:

  • USAMO 2025: All six problems from the 2025 USA Mathematical Olympiad, evaluated within hours of release by expert human annotators who assessed full reasoning traces rather than just final answers [5].
  • IMO 2025: Problems from the 2025 International Mathematical Olympiad, independently graded by multiple judges following detailed rubrics that penalize logical errors, skipped steps, and unjustified claims. Each problem is scored out of 7 points following official IMO standards.

These datasets include problem statements, model-generated solutions from frontier LLMs, and expert grading with detailed rubrics. This setup tests whether our verifier, trained only on general mathematical proofs with GPT-4.1-injected errors, can generalize to competition-level problems graded by professional mathematicians.

Results on Out-of-Distribution Math Competitions Grading

The verifier outputs a confidence score from 0 to 100%, which is linearly scaled to match the 0-7 point grading scale used in both USAMO and IMO competitions.

USAMO Grading Performance

USAMO Grading Error vs Human Experts • Lower = better

The RL-trained o4-mini achieves an average total score difference of 5.91 points from human expert grading—outperforming even o3 (high) and representing a 46% improvement over the base o4-mini model. Qwen 3 30B w/ RL shows a small improvement (+1.8%) while Qwen 3 8B w/ RL and Llama 3.3 70B w/ RL show minor improvements, likely due to the limited RL compute and weaker base models. Note that total score difference ranges from 0 to 42 points as both USAMO and IMO contain 6 problems with a max score of 7 points per question.

Per-Model USAMO Performance: o4-mini (high) w/ RL Grading

o4-mini w/ RL is generally worse at grading stronger models, likely as they produce more complex solutions with fewer obvious errors.

IMO Grading Performance

IMO Grading Error vs Human Experts • Lower = better

On IMO problems, o4-mini w/ RL performs well, beating the base model but this time is worse than o3. The open-source models (Qwen 3 8B, Qwen 3 30B, and Llama 3.3 70B) show minimal improvements.

Per-Model IMO Performance: o4-mini (high) w/ RL Grading

Again, o4-mini w/ RL does better at grading weaker models. Interestingly, o4-mini w/ RL seems to overrate the performance of other o-series models, particularly itself. I speculate that this may be an artifact of multi-turn post-training where the model learns to recognize its own style and believe that it is right due to the model playing an "oracle AI" persona and generally being overconfident. Prior work has explored models preferring their own outputs [6].

Best-of-N Sampling

An interesting application of the trained verifier is using it for Best-of-N (BoN) sampling, which approximates the benefits of RL with much lower computational cost and engineering effort. Best-of-N works by sampling the model N times and then using the final solution that the verifier has highest confidence in. So far we've shown a relative performance improvement between verifier models but BoN allows us to show that this leads to meaningful end-to-end gains. Only the two strongest models o3 and Gemini 2.5 Pro are used for this experiment. Qwen 3 8B is omitted due to the weak prior results.

Best-of-4 Sampling Improvement: USAMO Score Gains for o3 (high) and Gemini 2.5 Pro

o4-mini w/ RL improves the performance of Gemini and o3 by 46% and 32% respectively. Interestingly, o4-mini base and GPT-4.1 actually degrade performance.

Best-of-4 Sampling Improvement: IMO Score Gains for o3 (high) and Gemini 2.5 Pro

Similar to before, it's again notable how o3 (high) degrades performance when grading itself.

Discussion

It is notable that strong generalization only occurs with o4-mini. I think this increases my confidence that we're not just learning to activate a small subset of features discovered in pre-training, or that the model is simply learning to become less confident in general. My best guess is that different kinds of metacognitive techniques (backtracking, self-checking, etc.) that o4-mini was taught during the high-compute RL phase are useful for solution grading, but that it doesn't apply them by default, and low-compute RL is sufficient to change this. Without access to o4-mini's internal chain of thought, this hypothesis is difficult to verify. As more powerful open-source models become available, I hope that future work can gain a clearer picture of the mechanisms behind this emergent property.

Limitations

  • Weak results with the open-source models (Qwen 3 8B, Qwen 3 30B, Llama 3.3 70B) indicate this kind of generalization may require strong base models that learn reasoning in the abstract.
  • Running high compute RL against such a verifier might eventually break due to adversarial inputs. Plausibly further test-time scaling of the verifier could mitigate / solve this.
  • It's unclear how scaling the RL compute of the verifier would improve performance. If performance plateaus, this idea might not be that useful in practice.

Pre-training radically underutilizes the richness of the web

This section represents my personal opinion and is only weakly directionally implied by the empirical evidence presented.

Traditional pre-training treats all internet data equally—Einstein's Special Relativity paper receives the same computational attention as a Reddit thread. This is fundamentally inefficient.

Now that foundation models have achieved baseline capabilities similar to that of a junior white collar worker in some domains, we should aim for autonomous RL environment building SWEs that can independently browse the web, create datasets (like this proof verification dataset), and test if they improve model capabilities. Reasoning is fundamentally general-purpose, so the hope is that each additional RL environment boosts general model capabilities. This contrasts somewhat with the prevailing view that recursive self-improvement will be driven by improved optimizer, kernel, and architecture research done by AI versions of human researchers. It's not that I believe the latter won't happen, but that the former may be a greater driver of progress.

Some ideas for building interesting RL environments from the web using weak models:

  • Train models to reproduce all PRs in all GitHub repositories. You can use a weaker coding agent to create the Docker environments and check test coverage.
  • Inject subtle statistical or conceptual errors in scientific papers and have models detect them.
  • Take each saved version of a spreadsheet and train models to reproduce each step. Again, a basic web agent is very useful at doing the manual work at scale of finding and cleaning this data.

Conclusion

As we push toward more capable AI systems, oversight becomes increasingly difficult. This work shows some very early evidence that we can build reliable verification systems without relying on scarce expert labels. I believe some of the ideas are quite general, such as leveraging the asymmetry between adding and detecting mistakes, the broad generalization of strong reasoning models, and the fact that this work could be done by an automated SWE in the next 1-3 years.

References

[1] Kirchner, J. H., Chen, Y., Edwards, H., Leike, J., McAleese, N., & Burda, Y. (2024). Prover-Verifier Games improve legibility of LLM outputs. arXiv preprint arXiv:2407.13692.
[2] Zheng, C., Liu, S., Li, M., Chen, X. H., Yu, B., Gao, C., ... & Lin, J. (2025). Group Sequence Policy Optimization. arXiv preprint arXiv:2507.18071.
[3] Stangel, P., Bani-Harouni, D., Pellegrini, C., Özsoy, E., Zaripova, K., Keicher, M., & Navab, N. (2025). Rewarding Doubt: A Reinforcement Learning Approach to Calibrated Confidence Expression of Large Language Models. arXiv preprint arXiv:2503.02623.
[4] Balunović, M., Dekoninck, J., Petrov, I., Jovanović, N., & Vechev, M. (2025). Math Arena: Evaluating LLMs on Latest Math Competitions. ETH Zurich, SRI Lab. https://matharena.ai/
[5] Petrov, I., Dekoninck, J., Baltadzhiev, L., Drencheva, M., Minchev, K., Balunović, M., Jovanović, N., & Vechev, M. (2025). Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad. arXiv preprint arXiv:2503.21934.
[6] Panickssery, A., Bowman, S. R., & Feng, S. (2024). LLM Evaluators Recognize and Favor Their Own Generations. arXiv preprint arXiv:2404.13076.