Open-MM-RL | Verifiable Multimodal STEM Reasoning Dataset

Hero--1

Open Multimodal STEM Reasoning

This is some text inside of a div block.

Open MM-RL Data Pack: A benchmark for verifiable reasoning

A PhD-level multimodal STEM benchmark designed for deterministic, automatically gradable reasoning across physics, chemistry, biology, and mathematics, covering single-image, multi-panel, and multi-image tasks.

Request early access

Explore sample data on Hugging Face

STEM domains covered

3,000

OTS tasks coming soon

Multimodal input formats

100%

Verifiable answers

PhD

Target difficulty level

About the data pack

This is some text inside of a div block.

Beyond OCR. Beyond caption matching.

Existing multimodal benchmarks primarily evaluate perception or single-image QA. They do not measure a model’s ability to reason across structured visual inputs with objectively verifiable outcomes.

This data pack spans three input formats that escalate in visual complexity, enabling targeted analysis of where multimodal reasoning breaks down. Problems are self-contained, unambiguous, and built for verifiable answer checking at scale.

Each example has been reviewed twice by PhD-level domain specialists, with criteria covering prompt correctness, answer correctness, clarity of the implied reasoning path, and resistance to trivial lookup.

Single-image reasoning

One image paired with one question. The model must interpret the figure and derive a correct final answer.

Multi-panel image reasoning

Structured visual compositions such as panel sequences or compound figures. Models must relate information across panels before solving.

Multi-image reasoning

Multiple distinct images with one question. The most demanding format: relevant evidence is distributed across images rather than localized.

Info Display -- 3

Core properties

This is some text inside of a div block.

Every problem is built to the same standard

Deterministic, auto-gradable answers

Every problem terminates in a single verifiable answer: numeric, symbolic, algebraic, short text in LaTeX. No subjective interpretation required, no preference judgments needed.

PhD-level STEM difficulty

Tasks reflect advanced reasoning at or near the PhD level, involving multi-step derivations, symbolic manipulation, and synthesis of distributed visual evidence.

Calibrated learning regime

Problems are not easy enough to saturate, and not so hard that all learning signal disappears. Difficulty varies across formats so stronger models still make measurable progress.

Caption-free by design

Unlike figure benchmarks that lean on captions, every example here is answered directly from the images and question prompt alone.

Two-round expert review

PhD-level domain specialists review each problem for correctness, clarity, and originality, with a second round of adjudication before anything makes it into the data pack.

RL-ready reward structure

The input-output-reward structure maps cleanly to policy optimization, reward-guided fine-tuning, and outcome-supervised learning pipelines.

Subject coverage

This is some text inside of a div block.

Four STEM domains, one data pack

Physics: Quantum and Particle Physics, Condensed Matter and Materials, Electromagnetism, Photonics, and Plasma Systems, Astrophysics and Space Physics
Mathematics: Algebra and Structure, Discrete Mathematics, Analysis and Continuous Mathematics, Probability and Geometry
Biology: Evolutionary Systems, Molecular Mechanisms, Cellular Processes and Neural Biology
Chemistry: Chemical Structure, Reaction Mechanisms, Synthesis, Spectroscopy and Properties

Intended uses

This is some text inside of a div block.

Built for the workflows that need it most

Because every answer is deterministic and programmatically checkable, Open-MM-RL Data Pack fits naturally into the training and evaluation pipelines where objective correctness is non-negotiable.

Reinforcement learning
Outcome-supervised training
Reward modeling
Frontier model benchmarking
Automated evaluation
Multi-step reasoning research
Failure mode analysis
Visual grounding studies

Data packs

This is some text inside of a div block.

Frontier-calibrated data, ready to deploy

Open-MM-RL Data Pack

Our first off-the-shelf multimodal STEM data pack, covering single-image, multi-panel, and multi-image reasoning tasks at PhD-level difficulty.

3,000 OTS tasks coming soon
Physics, Mathematics, Biology, and Chemistry
Deterministic, auto-gradable answers throughout
Single-image, multi-panel, and multi-image formats
‍Two-round PhD expert review on every problem
‍Structured JSON format for direct pipeline integration

HLE++ STEM Data Pack

Calibrated difficulty bands beyond baseline HLE, engineered to preserve measurable pass@k separation after HLE saturation.

10,000+ tasks validated on Opus 4.5 and GPT-5.2 Thinking
Available in 24 to 48 hours
Scalable to 20,000+ task deployments in 90 days
40+ STEM subdomains
Supports both pass@8=0 or 0 <pass@8 < 50% depending on the training requirements of RL or SFT
100% original, search-resistant problem design

Info Display -- 3

Why Turing

This is some text inside of a div block.

The process behind the data quality

Calibrated difficulty bands

Headroom subsets, controlled low-pass RL bands, and dense high-difficulty tails calibrated against frontier model performance so the data stays useful as models improve. Calibrated difficulty includes:

Headroom subsets with near-zero pass@8
Controlled low-pass RL training bands
Dense high-difficulty tail beyond public benchmarks
Frontier-model performance calibration

‍

Consensus-driven validation

Independent SME review followed by multi-reviewer adjudication. Tasks that fail agreement thresholds are revised or removed, not rounded up to pass. Validation process includes:

Independent PhD SME review
Multi-reviewer adjudication per problem
Correctness, clarity, and ambiguity checks
Revision or removal for borderline cases

Evaluation-safe by construction

100% original, Google-proof problems. Structured JSON output for direct integration. Every answer is programmatically checkable with no subjective judgment required. Automatic evaluation supports:

Normalized exact match
Symbolic equivalence checks
Numerical tolerance thresholds
Unit-aware validation where applicable -
Short texts

Coding and debugging

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec pharetra sem vitae viverra iaculis. Donec pretium a justo eget eleifend. Praesent eu nunc id diam vehicula accumsan a eu justo. Sed ut dolor in nisl finibus accumsan.

Text Button

548 Market Street, PMB 18282, San Francisco, CA 94104

Open MM-RL Data Pack: A benchmark for verifiable reasoning

Beyond OCR. Beyond caption matching.

Every problem is built to the same standard

Four STEM domains, one data pack

Built for the workflows that need it most

Frontier-calibrated data, ready to deploy

The process behind the data quality

Request access to Open-MM-RL