ai-benchmark

Star

Here are 7 public repositories matching this topic...

microsoft / WindowsAgentArena

Star

Windows Agent Arena (WAA) 🪟 is a scalable OS platform for testing and benchmarking of multi-modal AI agents.

windows ai computer ai-research ai-agent agentic ai-benchmark desktop-agent computer-use

Updated Apr 30, 2025
Python

GoodStartLabs / AI_Diplomacy

Star

Frontier Models playing the board game Diplomacy.

benchmark ai llm ai-benchmark

Updated Sep 5, 2025
Python

TheAgentCompany / TheAgentCompany

Star

An agent benchmark with tasks in a simulated software company.

agent benchmark ai ai-research llm ai-benchmark

Updated Aug 25, 2025
Python

Habitante / gta-benchmark

Star

GTA (Guess The Algorithm) Benchmark - A tool for testing AI reasoning capabilities

python docker flask benchmark machine-learning puzzle reverse-engineering educational pattern-recognition ctf binary-analysis algorithm-analysis computational-thinking algorithmic-reasoning ai-benchmark

Updated Jan 12, 2025
Python

mlcommons / storage_results_v2.0

Star

This repository contains the results and code for the MLPerf™ Storage v2.0 benchmark.

benchmark machine-learning ai ai-benchmark ml-benchmark

Updated Aug 4, 2025
Python

trustyai-explainability / llama-stack-provider-lmeval

Star

TrustyAI's LMEval provider for Llama Stack

evaluation provider ai-benchmark llama-stack

Updated Sep 10, 2025
Python

dosa122 / gta-benchmark

Star

Challenge your AI's algorithmic thinking with GTA Benchmark! Reverse-engineer transformations from input-output pairs. Join now! 🐙🚀

Updated Jun 21, 2025
Python

Improve this page

Add a description, image, and links to the ai-benchmark topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the ai-benchmark topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ai-benchmark

Here are 7 public repositories matching this topic...

microsoft / WindowsAgentArena

GoodStartLabs / AI_Diplomacy

TheAgentCompany / TheAgentCompany

Habitante / gta-benchmark

mlcommons / storage_results_v2.0

trustyai-explainability / llama-stack-provider-lmeval

dosa122 / gta-benchmark

Improve this page

Add this topic to your repo