
The security evaluation environment frontier labs actually need.
CyberBench is an agentic security task environment built for post-training signal generation and frontier model evaluation. Every task requires a working change to a live system. No recall, no retrieval shortcuts.
Real security engineering, not recall.
CyberBench ships as sealed single-container Harbor/Terminal-Bench images. Each task requires a working change to a live system: patch a vulnerability at its root cause while preserving normal behavior, or build a re-runnable exploit that produces a per-build proof marker against a fresh container.
Source history, patches, advisory text, and fixed code are stripped from every image. Retrieving the original CVE write-up does not yield a solution.
Holds four models, including a security-specialized one.
10 security skill categories, across 8 languages.
Tasks span offensive and defensive work across Python, JavaScript, C/C++, Go, Rust, Ruby, Java, and black-box targets. The breadth is deliberate: the 70/30 defensive/offensive split, ten distinct capability categories, and eight language environments are tracked and maintained as a coverage requirement, not an accident of authorship.
Locate the exact flaw, name CWE class and severity, scan all files in scope.
Remove the bug at its root cause. Fix must keep the app working and passing all tests.
Explain why the bug existed, prove the fix, and optionally write a regression test.
Write the fix using the correct, robust technique: parameterized queries, allow-lists, constant-time checks.
Write YARA/Sigma rules or WAF rules that flag malicious activity without alarming normal traffic.
From logs or captures: determine which host was hit, what the attacker did, and which IOCs to flag.
Inspect a suspicious program in a sandboxed environment and extract identifying details, even when obfuscated.
Actually break into the running target and prove it. The result must be a working, re-runnable exploit.
Study a compiled or obfuscated program to understand internals and identify breakable checks.
Map the target: services, versions, endpoints, users, and likely weak points before any attack step.
Tasks are drawn from four source types, each chosen to ensure coverage of real-world vulnerability patterns rather than synthetic edge cases.
Tell us your eval target. We handle the rest.
Standard eval pipelines hit the same public benchmarks. CyberBench tasks are calibrated against the frontier, validated by human reviewers, and ready to slot into your existing eval harness or RL environment.
Defensive example: path-traversal gateway
CWE-022 · Path Traversal · Python
H2 RewardKit: the scoring layer inside every task.
RewardKit is the semantic evaluation infrastructure that runs inside every CyberBench task. It is not an external scorer applied after the fact. It ships with the task, operates within the same container, and produces the reward signal your RL environment consumes directly.
The LLM-judge components inside RewardKit check CWE-class mapping, root-cause and attack-vector wording, patch quality, and leakage controls. Judges are bound by anti-leak rules and cannot reference task-specific CVEs or payloads during scoring.
No task ships on automated scores alone.
Author vetting and task validation run as separate, independent pipelines. Authors clear a three-stage screening process before contributing. Every submitted task then clears three quality layers before sign-off.
- Reproducible unit tests run from a clean container. Non-negotiable reward contract: unsolved state scores 0, oracle-applied state scores 1, every time.
No task ships without a passing oracle run.
- LLM-judge suite checks CWE-class mapping, root-cause and attack-vector wording, patch quality, and leakage controls. Judges are bound by anti-leak rules and cannot reference task-specific CVEs or payloads.
Semantic quality gated separately from binary pass/fail.
- A reviewer reads the full agent trajectory to catch cheating, leaks, and unexplainable passes. Marks the task Ready, Needs Fix, or Blocked. Human sign-off is required on every shipped sample.
Automated scores alone are never sufficient for final approval.
Tasks authored by practitioners, not researchers.
Every task is authored by a professional offensive or defensive security practitioner with three to five-plus years of hands-on experience and recognized industry certifications, including OSCP, OSWE, OSED, eWPTX, and eCPTX. Authors specialize across web, mobile, network/Active Directory, cloud penetration testing, source-code review, and vulnerability research.
Web Penetration Specialist
OSCP, eWPTX certified
Specializes in web application vulnerabilities: SQL injection, XSS, SSRF, broken access control, and unsafe deserialization. Authors offensive and defensive tasks across Python and JavaScript stacks.
Exploit Developer
OSED, OSWE certified
Focuses on memory corruption, reverse engineering, and binary exploitation. Authors tasks in C/C++, Go, and Rust, including targets where internals are not provided to the agent.
Vulnerability Researcher
OSCP, eCPTX, CRTP certified
Covers network/Active Directory attack paths, detection engineering, and incident response tasks. Authors tasks drawn from real CVEs and OSV/GHSA advisories with seeded weakness classes.







