Hero--8

Access the Turing Applied AGI Benchmark for VLM 1.0 Report

Turing’s new benchmark evaluates how frontier vision-language models perform on realistic, high-complexity tasks in business and STEM domains—using multimodal prompts and free-form model outputs.

Access the Turing Applied AGI Benchmark for VLM 1.0 Report

Turing’s new benchmark evaluates how frontier vision-language models perform on realistic, high-complexity tasks in business and STEM domains—using multimodal prompts and free-form model outputs.

Download the full technical report to explore model-level results, error types, evaluation methodology, and detailed performance breakdowns.

What You’ll Get

The full technical report (PDF)
Overall accuracy (average score) and 95% confidence intervals for Gemini 2.5, GPT-o1, Claude 3.7, and more
Evaluation pipeline details using LLM-as-a-judge scoring
Dataset structure and domain taxonomy across business and STEM fields
Capability breakdowns and failure case analysis

Who It’s For

This benchmark is designed for researchers, model developers, and technical teams evaluating VLMs for real-world applications.
Whether you're comparing models or building your own, this report shows how models handle spatial reasoning, numeric analysis, logic, and real-world decision-making.

→ Preview: Inside the Turing Applied AGI Benchmark for VLM 1.0

We designed this benchmark to mirror how professionals actually think and solve problems—not how academic datasets quiz models.

— Mahesh Joshi, Head of Research, Turing

‍

What We Tested

9 core capabilities including perception, logic, spatial, and numerical reasoning
4 top-tier VLMs scored using open-ended generation tasks
Visual inputs including diagrams, charts, technical illustrations
Scoring by LLM-as-a-judge with 5 generations per model per prompt
Subset design: ALL vs HARD tasks

‍

What We Found

Top model scored just 56.8% on all tasks

All models scored below 7% on HARD subset

Perception and spatial reasoning were lowest-performing capabilities

Capability-level breakdowns reveal strengths and blind spots across models

Report

This is some text inside of a div block.

Download the Benchmark Report

Download Now

548 Market Street, PMB 18282, San Francisco, CA 94104