
Multimodal STEM HLE++: Model-Breaking Multimodal STEM Reasoning RL Data Pack
1,100 PhD-level multimodal STEM tasks calibrated to break frontier models. Validated on Opus 4.6 Extended Thinking. Available now.
The data pack frontier models still fail on
MMLU is saturated. HLE is approaching it. Multimodal STEM HLE++ is built for the gap that follows: multimodal, PhD-level tasks that current frontier models still genuinely struggle with.
At ~20% pass@1 on SOTA models, every problem sits in the optimal RL training regime: hard enough to expose reasoning failures, solvable enough to generate learnable reward signals.
Six STEM domains, PhD-level throughout
Subdomains span quantum mechanics, organic chemistry, molecular biology, differential equations, machine learning, and more.

Built for the workflows that need it most
- RL post-training (RLVR)
- Outcome-supervised fine-tuning
- Reward modeling
- Frontier model benchmarking
- Failure mode analysis
- Post-HLE evaluation
Frontier-calibrated data, ready to deploy
- 1,100-task data pack, immediate availability
- Validated on Opus 4.6 Extended Thinking
- 20% pass@1 on SOTA models: optimal RL training regime
- 30% Math, 30% Physics, 15% Chemistry, 15% Biology, 10% CS & Engineering
- Every task includes image input: diagrams, plots, equations, scientific figures
- Prompt → Answer (ground truth), JSON and CSV formats
- 50-task public sample on Hugging Face
Learn more.
The process behind the data quality
Empirically validated against frontier model performance. Tasks stay in the productive learning regime on Opus 4.6 Extended Thinking.
PhD and PhD-candidate SMEs author each problem. Multi-reviewer adjudication scores novelty, complexity, unambiguity, verifiability, and ground truth accuracy. Tasks that miss any dimension are reworked or cut.
100% original, search-resistant problems. Programmatically verifiable answers. Structured JSON and CSV output for direct pipeline integration.
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec pharetra sem vitae viverra iaculis. Donec pretium a justo eget eleifend. Praesent eu nunc id diam vehicula accumsan a eu justo. Sed ut dolor in nisl finibus accumsan.




