Windows Agent Arena (WAA) 🪟 is a scalable OS platform for testing and benchmarking of multi-modal AI agents.
-
Updated
Apr 30, 2025 - Python
Windows Agent Arena (WAA) 🪟 is a scalable OS platform for testing and benchmarking of multi-modal AI agents.
Frontier Models playing the board game Diplomacy.
An agent benchmark with tasks in a simulated software company.
GTA (Guess The Algorithm) Benchmark - A tool for testing AI reasoning capabilities
This repository contains the results and code for the MLPerf™ Storage v2.0 benchmark.
TrustyAI's LMEval provider for Llama Stack
Challenge your AI's algorithmic thinking with GTA Benchmark! Reverse-engineer transformations from input-output pairs. Join now! 🐙🚀
Add a description, image, and links to the ai-benchmark topic page so that developers can more easily learn about it.
To associate your repository with the ai-benchmark topic, visit your repo's landing page and select "manage topics."