Hero--8
Access the Turing Applied AGI Benchmark for VLM 1.0 Report
Turing’s new benchmark evaluates how frontier vision-language models perform on realistic, high-complexity tasks in business and STEM domains—using multimodal prompts and free-form model outputs.
Access the Turing Applied AGI Benchmark for VLM 1.0 Report
Turing’s new benchmark evaluates how frontier vision-language models perform on realistic, high-complexity tasks in business and STEM domains—using multimodal prompts and free-form model outputs.
Download the full technical report to explore model-level results, error types, evaluation methodology, and detailed performance breakdowns.
What You’ll Get
- The full technical report (PDF)
- Overall accuracy (average score) and 95% confidence intervals for Gemini 2.5, GPT-o1, Claude 3.7, and more
- Evaluation pipeline details using LLM-as-a-judge scoring
- Dataset structure and domain taxonomy across business and STEM fields
- Capability breakdowns and failure case analysis
Who It’s For
This benchmark is designed for researchers, model developers, and technical teams evaluating VLMs for real-world applications.
Whether you're comparing models or building your own, this report shows how models handle spatial reasoning, numeric analysis, logic, and real-world decision-making.
→ Preview: Inside the Turing Applied AGI Benchmark for VLM 1.0
We designed this benchmark to mirror how professionals actually think and solve problems—not how academic datasets quiz models.
— Mahesh Joshi, Head of Research, Turing
What We Tested
- 9 core capabilities including perception, logic, spatial, and numerical reasoning
- 4 top-tier VLMs scored using open-ended generation tasks
- Visual inputs including diagrams, charts, technical illustrations
- Scoring by LLM-as-a-judge with 5 generations per model per prompt
- Subset design: ALL vs HARD tasks
What We Found
Top model scored just 56.8% on all tasks
All models scored below 7% on HARD subset
Perception and spatial reasoning were lowest-performing capabilities
Capability-level breakdowns reveal strengths and blind spots across models
Report
This is some text inside of a div block.
Download the Benchmark Report





