Multimodal STEM HLE++: Model-Breaking Multimodal STEM Reasoning RL Data Pack

1,100 PhD-level multimodal STEM tasks calibrated to break frontier models. Validated on Opus 4.6 Extended Thinking. Available now.

1,100
OTS tasks available now
~20%
pass@1 on SOTA models
6
STEM domains covered
100%
Verifiable answers
About the data pack
This is some text inside of a div block.

The data pack frontier models still fail on

MMLU is saturated. HLE is approaching it. Multimodal STEM HLE++ is built for the gap that follows: multimodal, PhD-level tasks that current frontier models still genuinely struggle with.

At ~20% pass@1 on SOTA models, every problem sits in the optimal RL training regime: hard enough to expose reasoning failures, solvable enough to generate learnable reward signals.

Multimodal by design
Every task requires joint reasoning over images and text: diagrams, plots, equations, and scientific figures. No caption reliance. No retrieval shortcuts.
Calibrated difficulty
 ~20% pass@1 on SOTA models. 1 ≤ pass@8 ≤ 4 on Opus 4.6 Extended Thinking. In the productive RL regime; not saturated, not zero-signal.
Verifiable answers
Every task has a deterministic ground-truth answer with step-by-step rationale. Supports exact match, symbolic equivalence, and numerical tolerance verification. Delivered in JSON and CSV.
PhD SME authorship
Every problem is created by a PhD or PhD-candidate domain specialist, multi-reviewer adjudicated, and required to pass on novelty, complexity, unambiguity, and ground truth accuracy before delivery.
Domain coverage
This is some text inside of a div block.

Six STEM domains, PhD-level throughout

Subdomains span quantum mechanics, organic chemistry, molecular biology, differential equations, machine learning, and more.

Mathematics
30%
30
Physics
30%
30
Chemistry
15%
15
Biology
15%
15
CS & Engineering
10%
10
Intended uses
This is some text inside of a div block.

Built for the workflows that need it most

  1. RL post-training (RLVR)
  2. Outcome-supervised fine-tuning
  3. Reward modeling
  4. Frontier model benchmarking
  5. Failure mode analysis
  6. Post-HLE evaluation
Data packs
This is some text inside of a div block.

Frontier-calibrated data, ready to deploy

Multimodal STEM HLE++ Data Pack
1,100 PhD-level multimodal STEM data pack proven to hillclimb the Humanity’s Last Exam Benchmark and challenge SOTA models in multimodal scientific reasoning. Validated, delivered, and purchased by top frontier labs. Available now.
  1. 1,100-task data pack, immediate availability
  2. Validated on Opus 4.6 Extended Thinking
  3. 20% pass@1 on SOTA models: optimal RL training regime
  4. 30% Math, 30% Physics, 15% Chemistry, 15% Biology, 10% CS & Engineering
  5. Every task includes image input: diagrams, plots, equations, scientific figures
  6. Prompt → Answer (ground truth), JSON and CSV formats
  7. ‍50-task public sample on Hugging Face
HLE++ STEM Data Pack
10,000+ graduate-to-PhD text-based STEM tasks calibrated beyond HLE saturation. Validated on Opus 4.5 Extended and GPT-5.2 Thinking.

Learn more.
Why Turing
This is some text inside of a div block.

The process behind the data quality

Calibrated difficulty bands
Calibrated difficulty bands

Empirically validated against frontier model performance. Tasks stay in the productive learning regime on Opus 4.6 Extended Thinking.

Consensus-driven validation
Consensus-driven validation

PhD and PhD-candidate SMEs author each problem. Multi-reviewer adjudication scores novelty, complexity, unambiguity, verifiability, and ground truth accuracy. Tasks that miss any dimension are reworked or cut.

Evaluation-safe by construction
Evaluation-safe by construction

100% original, search-resistant problems. Programmatically verifiable answers. Structured JSON and CSV output for direct pipeline integration.

Coding and debugging

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec pharetra sem vitae viverra iaculis. Donec pretium a justo eget eleifend. Praesent eu nunc id diam vehicula accumsan a eu justo. Sed ut dolor in nisl finibus accumsan.

Text Button