HLE++: Model-Breaking STEM Datasets For Frontier Reasoning

Graduate-to-PhD headroom sets engineered to preserve measurable pass@k separation after HLE (Humanity’s Last Exam) saturation.

5,000 off-the-shelf prompts validated on Opus 4.5 Extended and GPT-5.2 Thinking. Available in 24–48 hours.

Engineered Headroom Beyond HLE

HLE++ preserves separation by engineering calibrated difficulty bands beyond baseline HLE.

Each Problem Is

Graduate-to-PhD multi-step STEM reasoning
Deterministic, single-answer format
Structurally reviewed with SME consensus validation
100% original and search-resistant
Calibrated pass@8 = 0 headroom sets for SFT and low positive pass bands for RL
CASE STUDY
This is some text inside of a div block.

Benchmarking frontier models with 5,000+ HLE++ STEM problems

Turing partnered with a frontier AI lab to deploy a large-scale calibrated STEM dataset, designed to test deep scientific and mathematical reasoning under strict structural constraints.

5,000+ graduate-to PhD-level
problems curated for frontier model benchmarking
5000+
100
100% Acceptance Rate
with all problems meeting the client's quality, correctness, and SOTA model-breaking standards
100%
100
40+ STEM subdomains
covered, including quantum mechanics, organic and physical chemistry, genetics & genomics, algebra, and more
40+
100

Why Turing

These datasets are calibrated for measurable difficulty, including low pass@k performance on strong systems.

Calibrated difficulty bands
  1. Headroom subsets (~0 pass@8)
  2. Controlled low-pass RL bands
  3. Dense high-difficulty tail beyond public benchmarks
  4. Frontier-model performance calibration
Consensus-driven validation
  1. Independent SME review
  2. Multi-reviewer adjudication
  3. Tasks failing agreement thresholds are revised or removed
Evaluation-safe by construction
  1. 100% original, Google-proof problem design
  2. Runtime verification for scientific coding tasks
  3. Structured JSON format for direct integration into evaluation pipelines