Access the Turing Applied AGI Benchmark for VLM 1.0 Report
Turing’s new benchmark evaluates how frontier vision-language models perform on realistic, high-complexity tasks in business and STEM domains—using multimodal prompts and free-form model outputs.

Access the Turing Applied AGI Benchmark for VLM 1.0 Report

Turing’s new benchmark evaluates how frontier vision-language models perform on realistic, high-complexity tasks in business and STEM domains—using multimodal prompts and free-form model outputs.

Download the full technical report to explore model-level results, error types, evaluation methodology, and detailed performance breakdowns.

What You’ll Get

  1. The full technical report (PDF)
  2. Overall accuracy (average score) and 95% confidence intervals for Gemini 2.5, GPT-o1, Claude 3.7, and more
  3. Evaluation pipeline details using LLM-as-a-judge scoring
  4. Dataset structure and domain taxonomy across business and STEM fields
  5. Capability breakdowns and failure case analysis

Who It’s For

This benchmark is designed for researchers, model developers, and technical teams evaluating VLMs for real-world applications.
Whether you're comparing models or building your own, this report shows how models handle spatial reasoning, numeric analysis, logic, and real-world decision-making.

→ Preview: Inside the Turing Applied AGI Benchmark for VLM 1.0

We designed this benchmark to mirror how professionals actually think and solve problems—not how academic datasets quiz models.

Mahesh Joshi, Head of Research, Turing

What We Tested

  1. 9 core capabilities including perception, logic, spatial, and numerical reasoning
  2. 4 top-tier VLMs scored using open-ended generation tasks
  3. Visual inputs including diagrams, charts, technical illustrations
  4. Scoring by LLM-as-a-judge with 5 generations per model per prompt
  5. Subset design: ALL vs HARD tasks

What We Found

Top model scored just 56.8% on all tasks
All models scored below 7% on HARD subset
Perception and spatial reasoning were lowest-performing capabilities
Capability-level breakdowns reveal strengths and blind spots across models
Report
This is some text inside of a div block.
Download the Benchmark Report
Download Now