
Open MM-RL Dataset: A benchmark for verifiable reasoning
A PhD-level multimodal STEM benchmark designed for deterministic, automatically gradable reasoning across physics, chemistry, biology, and mathematics, covering single-image, multi-panel, and multi-image tasks.
Beyond OCR. Beyond caption matching.
Existing multimodal benchmarks primarily evaluate perception or single-image QA. They do not measure a model’s ability to reason across structured visual inputs with objectively verifiable outcomes.
This dataset spans three input formats that escalate in visual complexity, enabling targeted analysis of where multimodal reasoning breaks down. Problems are self-contained, unambiguous, and built for verifiable answer checking at scale.
Each example has been reviewed twice by PhD-level domain specialists, with criteria covering prompt correctness, answer correctness, clarity of the implied reasoning path, and resistance to trivial lookup.
Every problem is built to the same standard

Four STEM domains, one dataset
- Physics: Quantum and Particle Physics, Condensed Matter and Materials, Electromagnetism, Photonics, and Plasma Systems, Astrophysics and Space Physics
- Mathematics: Algebra and Structure, Discrete Mathematics, Analysis and Continuous Mathematics, Probability and Geometry
- Biology: Evolutionary Systems, Molecular Mechanisms, Cellular Processes and Neural Biology
- Chemistry: Chemical Structure, Reaction Mechanisms, Synthesis, Spectroscopy and Properties
Built for the workflows that need it most
Because every answer is deterministic and programmatically checkable, Open-MM-RL Dataset fits naturally into the training and evaluation pipelines where objective correctness is non-negotiable.
- Reinforcement learning
- Outcome-supervised training
- Reward modeling
- Frontier model benchmarking
- Automated evaluation
- Multi-step reasoning research
- Failure mode analysis
- Visual grounding studies

Frontier-calibrated data, ready to deploy
- 3,000 OTS tasks coming soon
- Physics, Mathematics, Biology, and Chemistry
- Deterministic, auto-gradable answers throughout
- Single-image, multi-panel, and multi-image formats
- Two-round PhD expert review on every problem
- Structured JSON format for direct pipeline integration
- 10,000+ tasks validated on Opus 4.5 and GPT-5.2 Thinking
- Available in 24 to 48 hours
- Scalable to 20,000+ task deployments in 90 days
- 40+ STEM subdomains
- Supports both pass@8=0 or 0 <pass@8 < 50% depending on the training requirements of RL or SFT
- 100% original, search-resistant problem design
The process behind the data quality
Headroom subsets, controlled low-pass RL bands, and dense high-difficulty tails calibrated against frontier model performance so the data stays useful as models improve. Calibrated difficulty includes:
- Headroom subsets with near-zero pass@8
- Controlled low-pass RL training bands
- Dense high-difficulty tail beyond public benchmarks
- Frontier-model performance calibration
Independent SME review followed by multi-reviewer adjudication. Tasks that fail agreement thresholds are revised or removed, not rounded up to pass. Validation process includes:
- Independent PhD SME review
- Multi-reviewer adjudication per problem
- Correctness, clarity, and ambiguity checks
- Revision or removal for borderline cases
100% original, Google-proof problems. Structured JSON output for direct integration. Every answer is programmatically checkable with no subjective judgment required. Automatic evaluation supports:
- Normalized exact match
- Symbolic equivalence checks
- Numerical tolerance thresholds
- Unit-aware validation where applicable -
- Short texts
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec pharetra sem vitae viverra iaculis. Donec pretium a justo eget eleifend. Praesent eu nunc id diam vehicula accumsan a eu justo. Sed ut dolor in nisl finibus accumsan.


