#

llm-evaluation

Here are 26 public repositories matching this topic...

langfuse

langfuse / langfuse

🪢 Open source LLM engineering platform: LLM Observability, metrics, evals, prompt management, playground, datasets. Integrates with OpenTelemetry, Langchain, OpenAI SDK, LiteLLM, and more. 🍊YC W23

open-source playground monitoring analytics evaluation self-hosted ycombinator openai observability autogen large-language-models llm prompt-engineering langchain llmops llama-index prompt-management llm-evaluation llm-observability

Updated Aug 28, 2025
TypeScript

promptfoo / promptfoo

Test your prompts, agents, and RAGs. AI Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.

testing ci evaluation ci-cd pentesting cicd vulnerability-scanners prompts evaluation-framework red-teaming rag llm prompt-engineering llmops prompt-testing llm-eval llm-evaluation llm-evaluation-framework

Updated Aug 28, 2025
TypeScript

Helicone / helicone

🧊 Open source LLM observability platform. One line of code to monitor, evaluate, and experiment. YC W23 🍓

open-source playground monitoring analytics evaluation ycombinator openai gpt large-language-models llm prompt-engineering langchain llmops llama-index prompt-management llm-evaluation llm-observability agent-monitoring llm-cost

Updated Aug 28, 2025
TypeScript

lmnr-ai / lmnr

Laminar - open-source all-in-one platform for engineering AI products. Create data flywheel for your AI app. Traces, Evals, Datasets, Labels. YC S24.

Updated Aug 28, 2025
TypeScript

palico-ai / palico-ai

Build, Improve Performance, and Productionize your LLM Application with an Integrated Framework

nodejs javascript docker typescript ai full-stack openai autogen rag llm portkey langchain anthropic llamaindex langchain-js llm-agent llm-framework llm-evaluation llm-observability

Updated Nov 26, 2024
TypeScript

kolenaIO / autoarena

Rank LLMs, RAG systems, and prompts using automated head-to-head evaluation

testing ai evaluation hacktoberfest rag llm llm-evaluation

Updated Dec 16, 2024
TypeScript

PromptMixerDev / prompt-mixer-app-ce

A desktop application for comparing outputs from different Large Language Models (LLMs).

prompt prompt-toolkit llm prompt-engineering prompting llms llm-evaluation

Updated Aug 25, 2025
TypeScript

AgiFlow / agiflow-sdks

LLM QA, Observability, Evaluation and User Feedback

python machine-learning typescript agents llms llmops llm-agent llm-evaluation

Updated Aug 6, 2024
TypeScript

Supahands / llm-comparison

This is an opensource project allowing you to compare two LLM's head to head with a given prompt, it has a wide range of supported models, from opensource ollama ones to the likes of openai and claude

ai llm chatgpt ollama llm-evaluation

Updated Mar 24, 2025
TypeScript

hypersigilhq / hypersigil

Prompt management gateway with a UI for AI-powered applications. Enables non-technical users to test, refine, and deploy prompts seamlessly across multiple AI providers.

prompt-toolkit prompt-tuning llm prompt-engineering llm-evaluation llm-gateway

Updated Aug 5, 2025
TypeScript

0xsomesh / rawbench

RawBench: Powerful, minimal framework for LLM prompt evaluation with YAML configuration, tool execution support, and comprehensive result tracking.

tools ai evaluation-metrics llm llmops evals llm-evaluation

Updated Jul 5, 2025
TypeScript

prompt-foundry / typescript-sdk

The prompt engineering, prompt management, and prompt evaluation tool for TypeScript, JavaScript, and NodeJS.

typescript gpt open-ai gpt-3 gpt-4 llm prompt-engineering llmops prompt-testing prompt-manager prompt-management llm-eval llm-test llm-ops llm-evaluation prompt-evaluation

Updated Sep 14, 2024
TypeScript

yukinagae / genkitx-promptfoo

Community Plugin for Genkit to use Promptfoo

plugin testing firebase ai evaluation prompt prompts evaluation-framework llm llmops prompt-testing llm-eval llm-evaluation llm-evaluation-framework promptfoo genkit genkitx genkit-plugin

Updated Jan 3, 2025
TypeScript

parea-ai / parea-sdk-ts

TypeScript SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)

llm prompt-engineering llms llm-eval llm-tools llm-evaluation llm-evaluation-toolkit llms-benchmarking llm-evaluation-framework

Updated Jan 17, 2025
TypeScript

llm-coup

khoj-ai / llm-coup

Let LLMs play coup with each other and see who's the best at deception & strategy

environment games ai simulation artificial-intelligence coup deception llms llm-eval llm-evaluation llm-benchmarking

Updated Aug 18, 2025
TypeScript

euskoog / openai-assistants-evals

Visualize LLM Evaluations for OpenAI Assistants

openai tailwindcss llms llm-evaluation openai-assistants

Updated Mar 27, 2024
TypeScript

tuitige / fijian-rag-app

Public-benefit GenAI platform for the Fijian language — combining Claude + RAG + OpenSearch to build verified datasets, preserve culture, and unlock AI access in the Pacific. LLM Fine-Tuning, RAG, Generative AI and learning

aws bedrock vector-database cdk-examples generative-ai llm-training retrieval-augmented-generation llm-evaluation llm-agents rag-chatbot fijian-language

Updated Aug 28, 2025
TypeScript

IgorWnek / deno-aidevs-cli

This is the Deno 2 implementation of all tasks I did during AI Devs 3 course. It's integrated with Anthropic and OpenAI APIs here. Many different tools regarding LLMs are used here.

typescript backend prompt prompt-toolkit prompt-engineering llms prompt-design llm-agent llm-tools llm-evaluation deno2

Updated Jan 8, 2025
TypeScript

rahatmoktadir03 / llm-evaluation-platform

A full-stack web application for comparing and analyzing the performance of large language models (LLMs). Features include side-by-side prompt evaluation, performance metrics visualization, and an analytics dashboard. Built with React, Tailwind CSS, Node.js, and MongoDB."

react typescript ai fullstack-development tailwind-css large-language-models prompt-engineering llm-evaluation llm-comparison

Updated Jan 6, 2025
TypeScript

turify / turify-prompts

🧠 Build advanced prompt optimisation engine with AI scoring, recommendations - contributors needed for LLM evaluation algorithms. NexJS 15, LangChain, Prism, Shadcn.

openai prisma prompt-engineering llms langchain shadcn genai llm-evaluation langgraph nextjs15

Updated Jun 10, 2025
TypeScript

Improve this page

Add a description, image, and links to the llm-evaluation topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the llm-evaluation topic, visit your repo's landing page and select "manage topics."