MM-HLE++-STEM | PhD-Level Multimodal STEM Data Pack

Hero--1

Multimodal STEM HLE++: Model-Breaking Multimodal STEM Reasoning RL Data Pack

1,100 PhD-level multimodal STEM tasks calibrated to break frontier models. Validated on Opus 4.6 Extended Thinking. Available now.

Get the full data pack

Explore sample data on Hugging Face

1,100

OTS tasks available now

~20%

pass@1 on SOTA models

STEM domains covered

100%

Verifiable answers

About the data pack

This is some text inside of a div block.

The data pack frontier models still fail on

MMLU is saturated. HLE is approaching it. Multimodal STEM HLE++ is built for the gap that follows: multimodal, PhD-level tasks that current frontier models still genuinely struggle with.

At ~20% pass@1 on SOTA models, every problem sits in the optimal RL training regime: hard enough to expose reasoning failures, solvable enough to generate learnable reward signals.

Multimodal by design

Every task requires joint reasoning over images and text: diagrams, plots, equations, and scientific figures. No caption reliance. No retrieval shortcuts.

Calibrated difficulty

~20% pass@1 on SOTA models. 1 ≤ pass@8 ≤ 4 on Opus 4.6 Extended Thinking. In the productive RL regime; not saturated, not zero-signal.

Verifiable answers

Every task has a deterministic ground-truth answer with step-by-step rationale. Supports exact match, symbolic equivalence, and numerical tolerance verification. Delivered in JSON and CSV.

PhD SME authorship

Every problem is created by a PhD or PhD-candidate domain specialist, multi-reviewer adjudicated, and required to pass on novelty, complexity, unambiguity, and ground truth accuracy before delivery.

Info Display -- 1 Sticky [dark-mode]

Stats Display -- 1

Domain coverage

This is some text inside of a div block.

Six STEM domains, PhD-level throughout

Subdomains span quantum mechanics, organic chemistry, molecular biology, differential equations, machine learning, and more.

Mathematics

30%

Physics

30%

Chemistry

15%

Biology

15%

CS & Engineering

10%

Intended uses

This is some text inside of a div block.

Built for the workflows that need it most

RL post-training (RLVR)
Outcome-supervised fine-tuning
Reward modeling
Frontier model benchmarking
Failure mode analysis
Post-HLE evaluation

Info Display -- 2 [dark-mode]

Data packs

This is some text inside of a div block.

Frontier-calibrated data, ready to deploy

Multimodal STEM HLE++ Data Pack

1,100 PhD-level multimodal STEM data pack proven to hillclimb the Humanity’s Last Exam Benchmark and challenge SOTA models in multimodal scientific reasoning. Validated, delivered, and purchased by top frontier labs. Available now.

1,100-task data pack, immediate availability
Validated on Opus 4.6 Extended Thinking
20% pass@1 on SOTA models: optimal RL training regime
30% Math, 30% Physics, 15% Chemistry, 15% Biology, 10% CS & Engineering
Every task includes image input: diagrams, plots, equations, scientific figures
Prompt → Answer (ground truth), JSON and CSV formats
‍50-task public sample on Hugging Face

HLE++ STEM Data Pack

10,000+ graduate-to-PhD text-based STEM tasks calibrated beyond HLE saturation. Validated on Opus 4.5 Extended and GPT-5.2 Thinking.

Learn more.

Info Display -- 3

Why Turing

This is some text inside of a div block.

The process behind the data quality

Calibrated difficulty bands

Empirically validated against frontier model performance. Tasks stay in the productive learning regime on Opus 4.6 Extended Thinking.

Consensus-driven validation

PhD and PhD-candidate SMEs author each problem. Multi-reviewer adjudication scores novelty, complexity, unambiguity, verifiability, and ground truth accuracy. Tasks that miss any dimension are reworked or cut.

Evaluation-safe by construction

100% original, search-resistant problems. Programmatically verifiable answers. Structured JSON and CSV output for direct pipeline integration.

Coding and debugging

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec pharetra sem vitae viverra iaculis. Donec pretium a justo eget eleifend. Praesent eu nunc id diam vehicula accumsan a eu justo. Sed ut dolor in nisl finibus accumsan.

Text Button

548 Market Street, PMB 18282, San Francisco, CA 94104

Multimodal STEM HLE++: Model-Breaking Multimodal STEM Reasoning RL Data Pack

The data pack frontier models still fail on

Six STEM domains, PhD-level throughout

Built for the workflows that need it most

Frontier-calibrated data, ready to deploy

The process behind the data quality

Request access to Multimodal STEM HLE++