Hero--1

HLE++: Model-Breaking STEM Data Packs For Frontier Reasoning

Graduate-to-PhD headroom sets engineered to preserve measurable pass@k separation after HLE (Humanity’s Last Exam) saturation.

10,000 off-the-shelf prompts validated on Opus 4.5 Extended and GPT-5.2 Thinking. Available in 24-48 hours.

Get The Full Data Pack

Explore Sample Data on Hugging Face

Engineered Headroom Beyond HLE

HLE++ preserves separation by engineering calibrated difficulty bands beyond baseline HLE.

Each Problem Is

Graduate-to-PhD multi-step STEM reasoning

Deterministic, single-answer format

Structurally reviewed with SME consensus validation

100% original and search-resistant

Calibrated pass@8 = 0 headroom sets for SFT and low positive pass bands for RL

Info Display -- 4 var-1 [dark-mode]

Stats Display -- 1

CASE STUDY

This is some text inside of a div block.

Benchmarking frontier models with 10,000+ HLE++ STEM problems

Turing partnered with a frontier AI lab to deploy a large-scale calibrated STEM data pack, designed to test deep scientific and mathematical reasoning under strict structural constraints.

Read the full case study

10,000+ graduate-to PhD-level

problems curated for frontier model benchmarking

10,000+

100

100% Acceptance Rate

with all problems meeting the client's quality, correctness, and SOTA model-breaking standards

100%

100

40+ STEM subdomains

covered, including quantum mechanics, organic and physical chemistry, genetics & genomics, algebra, and more

40+

Why Turing

These data packs are calibrated for measurable difficulty, including low pass@k performance on strong systems.

Calibrated difficulty bands

Headroom subsets (~0 pass@8)
Controlled low-pass RL bands
Dense high-difficulty tail beyond public benchmarks
Frontier-model performance calibration

Consensus-driven validation

Independent SME review
Multi-reviewer adjudication
Tasks failing agreement thresholds are revised or removed

Evaluation-safe by construction

100% original, Google-proof problem design
Runtime verification for scientific coding tasks
Structured JSON format for direct integration into evaluation pipelines

Info Display -- 3