Open Multimodal STEM Reasoning
This is some text inside of a div block.

Open MM-RL Dataset: A benchmark for verifiable reasoning

A PhD-level multimodal STEM benchmark designed for deterministic, automatically gradable reasoning across physics, chemistry, biology, and mathematics, covering single-image, multi-panel, and multi-image tasks.

4
STEM domains covered
3,000
OTS tasks coming soon
3
Multimodal input formats
100%
Verifiable answers
PhD
Target difficulty level
About the dataset
This is some text inside of a div block.

Beyond OCR. Beyond caption matching.

Existing multimodal benchmarks primarily evaluate perception or single-image QA. They do not measure a model’s ability to reason across structured visual inputs with objectively verifiable outcomes.

This dataset spans three input formats that escalate in visual complexity, enabling targeted analysis of where multimodal reasoning breaks down. Problems are self-contained, unambiguous, and built for verifiable answer checking at scale.

Each example has been reviewed twice by PhD-level domain specialists, with criteria covering prompt correctness, answer correctness, clarity of the implied reasoning path, and resistance to trivial lookup.

Single-image reasoning
One image paired with one question. The model must interpret the figure and derive a correct final answer.
Multi-panel image reasoning
Structured visual compositions such as panel sequences or compound figures. Models must relate information across panels before solving.
Multi-image reasoning
Multiple distinct images with one question. The most demanding format: relevant evidence is distributed across images rather than localized.
Core properties
This is some text inside of a div block.

Every problem is built to the same standard

1
Deterministic, auto-gradable answers
Every problem terminates in a single verifiable answer: numeric, symbolic, algebraic, short text in LaTeX. No subjective interpretation required, no preference judgments needed.
2
PhD-level STEM difficulty
Tasks reflect advanced reasoning at or near the PhD level, involving multi-step derivations, symbolic manipulation, and synthesis of distributed visual evidence.
3
Calibrated learning regime
Problems are not easy enough to saturate, and not so hard that all learning signal disappears. Difficulty varies across formats so stronger models still make measurable progress.
4
Caption-free by design
Unlike figure benchmarks that lean on captions, every example here is answered directly from the images and question prompt alone.
5
Two-round expert review
PhD-level domain specialists review each problem for correctness, clarity, and originality, with a second round of adjudication before anything makes it into the dataset.
6
RL-ready reward structure
The input-output-reward structure maps cleanly to policy optimization, reward-guided fine-tuning, and outcome-supervised learning pipelines.
Subject coverage
This is some text inside of a div block.

Four STEM domains, one dataset

  1. Physics: Quantum and Particle Physics, Condensed Matter and Materials, Electromagnetism, Photonics, and Plasma Systems, Astrophysics and Space Physics
  2. Mathematics: Algebra and Structure, Discrete Mathematics, Analysis and Continuous Mathematics, Probability and Geometry
  3. Biology: Evolutionary Systems, Molecular Mechanisms, Cellular Processes and Neural Biology
  4. Chemistry: Chemical Structure, Reaction Mechanisms, Synthesis, Spectroscopy and Properties
Intended uses
This is some text inside of a div block.

Built for the workflows that need it most

Because every answer is deterministic and programmatically checkable, Open-MM-RL Dataset fits naturally into the training and evaluation pipelines where objective correctness is non-negotiable.

  1. Reinforcement learning
  2. Outcome-supervised training
  3. Reward modeling
  4. Frontier model benchmarking
  5. Automated evaluation
  6. Multi-step reasoning research
  7. Failure mode analysis
  8. Visual grounding studies
Data packs
This is some text inside of a div block.

Frontier-calibrated data, ready to deploy

Open-MM-RL Data Pack
Our first off-the-shelf multimodal STEM dataset, covering single-image, multi-panel, and multi-image reasoning tasks at PhD-level difficulty.

  1. 3,000 OTS tasks coming soon
  2. Physics, Mathematics, Biology, and Chemistry
  3. Deterministic, auto-gradable answers throughout
  4. Single-image, multi-panel, and multi-image formats
  5. ‍Two-round PhD expert review on every problem
  6. ‍Structured JSON format for direct pipeline integration
HLE++ STEM Data Pack
Calibrated difficulty bands beyond baseline HLE, engineered to preserve measurable pass@k separation after HLE saturation.
  1. 10,000+ tasks validated on Opus 4.5 and GPT-5.2 Thinking
  2. Available in 24 to 48 hours
  3. Scalable to 20,000+ task deployments in 90 days
  4. 40+ STEM subdomains
  5. Supports  both pass@8=0 or 0 <pass@8 < 50% depending on the training requirements of RL or SFT
  6. 100% original, search-resistant problem design
Why Turing
This is some text inside of a div block.

The process behind the data quality

Calibrated difficulty bands
Calibrated difficulty bands

Headroom subsets, controlled low-pass RL bands, and dense high-difficulty tails calibrated against frontier model performance so the data stays useful as models improve. Calibrated difficulty includes:

  • Headroom subsets with near-zero pass@8
  • Controlled low-pass RL training bands
  • Dense high-difficulty tail beyond public benchmarks
  • Frontier-model performance calibration

Consensus-driven validation
Consensus-driven validation

Independent SME review followed by multi-reviewer adjudication. Tasks that fail agreement thresholds are revised or removed, not rounded up to pass. Validation process includes:

  • Independent PhD SME review
  • Multi-reviewer adjudication per problem
  • Correctness, clarity, and ambiguity checks
  • Revision or removal for borderline cases
Evaluation-safe by construction
Evaluation-safe by construction

100% original, Google-proof problems. Structured JSON output for direct integration. Every answer is programmatically checkable with no subjective judgment required. Automatic evaluation supports:

  • Normalized exact match
  • Symbolic equivalence checks
  • Numerical tolerance thresholds
  • Unit-aware validation where applicable -
  • Short texts
Coding and debugging

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec pharetra sem vitae viverra iaculis. Donec pretium a justo eget eleifend. Praesent eu nunc id diam vehicula accumsan a eu justo. Sed ut dolor in nisl finibus accumsan.

Text Button