Turing frontier

This is some text inside of a div block.

The security evaluation environment frontier labs actually need.

CyberBench is an agentic security task environment built for post-training signal generation and frontier model evaluation. Every task requires a working change to a live system. No recall, no retrieval shortcuts.

Pass@4 <=25%

Calibrated ceiling across all shipped tasks, held against four frontier models

Benchmark-class tasks available now, with 30 new tasks per week at current production pace

Request sample data

What it is

This is some text inside of a div block.

Real security engineering, not recall.

CyberBench ships as sealed single-container Harbor/Terminal-Bench images. Each task requires a working change to a live system: patch a vulnerability at its root cause while preserving normal behavior, or build a re-runnable exploit that produces a per-build proof marker against a fresh container.

Source history, patches, advisory text, and fixed code are stripped from every image. Retrieving the original CVE write-up does not yield a solution.

Defensive

70%

Find and patch the root-cause vulnerability while preserving normal behavior.

Win condition

All security and functional tests pass against the rebuilt service.

Offensive

30%

Build a working, re-runnable exploit against a live target without authentication.

Win condition

Exploit replays in a fresh container and produces the expected proof marker.

Frontier model performance

This is some text inside of a div block.

Holds four models, including a security-specialized one.

Claude Opus 4.8

≤25%

Pass@4
(High compute)

GPT-5.5 Pro

≤25%

Pass@4
(High compute)

Gemini 3.1 Pro

≤25%

Pass@4
(High compute)

GPT-5.5 Cyber

≤25%

Pass@4
(High compute)

GPT-5.5 Cyber is a security-specialized model. Tasks hold the Pass@4 ≤25% bar even against a model purpose-built for this class of work. That is the calibration standard every shipped task clears.

Difficulty calibration

This is some text inside of a div block.

Tasks that become too easy are hardened within the same vulnerability family or discarded. Nothing ships that is unfair or unsolvable.

Pass@4 = 0

Most tasks

Pass@4 <10%

Pass@4 10–25%

Capability coverage

This is some text inside of a div block.

10 security skill categories, across 8 languages.

Tasks span offensive and defensive work across Python, JavaScript, C/C++, Go, Rust, Ruby, Java, and black-box targets. The breadth is deliberate: the 70/30 defensive/offensive split, ten distinct capability categories, and eight language environments are tracked and maintained as a coverage requirement, not an accident of authorship.

Vulnerability finding

Locate the exact flaw, name CWE class and severity, scan all files in scope.

‍

Secure patching

Remove the bug at its root cause. Fix must keep the app working and passing all tests.

‍

Root-cause analysis

Explain why the bug existed, prove the fix, and optionally write a regression test.

‍

Secure code generation

Write the fix using the correct, robust technique: parameterized queries, allow-lists, constant-time checks.

Detection engineering

Write YARA/Sigma rules or WAF rules that flag malicious activity without alarming normal traffic.

Incident response

From logs or captures: determine which host was hit, what the attacker did, and which IOCs to flag.

Malware analysis

Inspect a suspicious program in a sandboxed environment and extract identifying details, even when obfuscated.

‍

Exploit generation

Actually break into the running target and prove it. The result must be a working, re-runnable exploit.

Reverse engineering

Study a compiled or obfuscated program to understand internals and identify breakable checks.

Recon and enumeration

Map the target: services, versions, endpoints, users, and likely weak points before any attack step.

Four task source types

This is some text inside of a div block.

Tasks are drawn from four source types, each chosen to ensure coverage of real-world vulnerability patterns rather than synthetic edge cases.

Real CVEs

Tasks built from confirmed, publicly disclosed vulnerabilities. Source history, patches, and advisory text are stripped so the CVE number alone yields nothing.

OSV / GHSA Advisories

Open source vulnerability and GitHub Security Advisory entries, covering package-level and dependency-chain weaknesses across supported languages.

Real Repos + Seeded CWEs

Actual repositories with weakness classes seeded at authorship time. Targets realistic codebases rather than purpose-built toy examples.

Hand-Authored Services

Purpose-built synthetic services designed to isolate specific capability categories. Used where real-world examples would be too noisy or too narrow.

How It Works

This is some text inside of a div block.

Tell us your eval target. We handle the rest.

Standard eval pipelines hit the same public benchmarks. CyberBench tasks are calibrated against the frontier, validated by human reviewers, and ready to slot into your existing eval harness or RL environment.

01 — Share your eval requirements

Tell us your target: eval harness, capability gaps, CWE classes, languages, difficulty band, and volume. We are built to fulfill highly specific requests, not broad categories.

02 — Receive a calibrated sample

We surface a matched set of tasks based on your requirements, including pass@k metrics and scoring artifacts for each. You can review and select individually or take the full set.

03 — Integrate and scale

Tasks are Harbor/Terminal-Bench compatible out of the box. Each task produces two reward signals. The binary reward confirms outcome verification: did the correct end-state occur (attack blocked, exploit replayed, findings filed correctly). The continuous reward handles trajectory verification and reward shaping: three weighted components score the quality of the agent's path through the task, producing a graded signal suitable for RLVR post-training without needing to binarize intermediate steps.

Representative task

This is some text inside of a div block.

Defensive example: path-traversal gateway

CWE-022 · Path Traversal · Python

What the agent sees

A metrics-ingest gateway service contains a path-traversal flaw. The agent receives only the briefing in instruction.md: locate the root cause, patch it with a secure idiom, keep legitimate traffic working, and record a plan.md naming the subsystem, vulnerability class, root cause, fix approach, and verification steps.

Scoring components

Security replay test (original attack now blocked) / Functional regression tests (feature still works) / Adversarial variants (encoded traversal, symlinks) / Findings contract (correct CWE class, real affected files, named attack vector)

Reward output

The verifier emits a binary_reward (pass/fail) and a continuous_reward (weighted across the three scoring components). Both are available for eval harness integration and RL post-training pipelines.

3-Layer

Evaluation stack per task: deterministic, judge, human review

50+

Benchmark-class tasks, immediately available

30/week

Sustained production rate, scalable on request

Info Display -- 1 Sticky [dark-mode]

Built-in infrastructure

This is some text inside of a div block.

H2 RewardKit: the scoring layer inside every task.

RewardKit is the semantic evaluation infrastructure that runs inside every CyberBench task. It is not an external scorer applied after the fact. It ships with the task, operates within the same container, and produces the reward signal your RL environment consumes directly.

The LLM-judge components inside RewardKit check CWE-class mapping, root-cause and attack-vector wording, patch quality, and leakage controls. Judges are bound by anti-leak rules and cannot reference task-specific CVEs or payloads during scoring.

Outcome verification

Confirms the correct end-state occurred: attack blocked, exploit replayed, or findings filed to spec. Binary pass/fail, deterministic, reproducible from a clean container.

Trajectory verification

Evaluates the quality of the agent's path through the task, not just the final state. Catches valid-looking endpoints reached by invalid means.

Weighted continuous reward

Three weighted components produce a graded score for reward shaping. Suitable for RLVR post-training without binarizing intermediate steps.

Anti-leakage controls

Judge components cannot reference task-specific CWEs, payloads, or CVE details. Scoring is structurally isolated from task content.

Info Display -- 1 Sticky [dark-mode]

Validation pipeline

This is some text inside of a div block.

No task ships on automated scores alone.

Author vetting and task validation run as separate, independent pipelines. Authors clear a three-stage screening process before contributing. Every submitted task then clears three quality layers before sign-off.

Deterministic checks

Reproducible unit tests run from a clean container. Non-negotiable reward contract: unsolved state scores 0, oracle-applied state scores 1, every time.

No task ships without a passing oracle run.

RewardKit quality gates

LLM-judge suite checks CWE-class mapping, root-cause and attack-vector wording, patch quality, and leakage controls. Judges are bound by anti-leak rules and cannot reference task-specific CVEs or payloads.

Semantic quality gated separately from binary pass/fail.

Human trajectory review

A reviewer reads the full agent trajectory to catch cheating, leaks, and unexplainable passes. Marks the task Ready, Needs Fix, or Blocked. Human sign-off is required on every shipped sample.

Automated scores alone are never sufficient for final approval.

Author network

This is some text inside of a div block.

Tasks authored by practitioners, not researchers.

Every task is authored by a professional offensive or defensive security practitioner with three to five-plus years of hands-on experience and recognized industry certifications, including OSCP, OSWE, OSED, eWPTX, and eCPTX. Authors specialize across web, mobile, network/Active Directory, cloud penetration testing, source-code review, and vulnerability research.

Web Penetration Specialist

OSCP, eWPTX certified

Specializes in web application vulnerabilities: SQL injection, XSS, SSRF, broken access control, and unsafe deserialization. Authors offensive and defensive tasks across Python and JavaScript stacks.

Exploit Developer

OSED, OSWE certified

Focuses on memory corruption, reverse engineering, and binary exploitation. Authors tasks in C/C++, Go, and Rust, including targets where internals are not provided to the agent.

Vulnerability Researcher

OSCP, eCPTX, CRTP certified

Covers network/Active Directory attack paths, detection engineering, and incident response tasks. Authors tasks drawn from real CVEs and OSV/GHSA advisories with seeded weakness classes.

548 Market Street, PMB 18282, San Francisco, CA 94104

The security evaluation environment frontier labs actually need.

Real security engineering, not recall.

Holds four models, including a security-specialized one.

10 security skill categories, across 8 languages.

Tell us your eval target. We handle the rest.

Defensive example: path-traversal gateway

H2 RewardKit: the scoring layer inside every task.

No task ships on automated scores alone.

Tasks authored by practitioners, not researchers.

Ready to evaluate your security stack?