LLM Evaluation Framework
Production-grade open-source LLM benchmarking. Evaluate GPT-4, Claude, Gemini, Mistral and Llama on 5 metrics โ side by side โ in one command.
What This Is
This is the model card / hub page for the LLM Evaluation Framework. The framework itself is a Python tool, not a neural network weight โ this page serves as the HuggingFace hub entry point linking all resources together.
Quick Start
pip install llm-evaluation-framework
export OPENAI_API_KEY="sk-..."
llm-eval run --model gpt-4o-mini --benchmark mmlu --samples 100
Output:
โญโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ Evaluation: gpt-4o-mini โ
โโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโค
โ Accuracy โ 78.00% โ
โ Avg Latency โ 432 ms โ
โ P95 Latency โ 1240 ms โ
โ Total Cost โ $0.0023 โ
โ Hallucination โ 2.40% โ
โ Reasoning Score โ 7.2 / 10 โ
โฐโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโฏ
5 Evaluation Metrics
| Metric | Description | Output |
|---|---|---|
| Accuracy | 4-strategy cascade: exact โ normalized โ MC โ fuzzy | 0.0โ1.0 |
| Latency | p50, p75, p90, p95, p99 percentiles + SLA violation rate | ms |
| Cost | Real token counts ร pricing table for 15+ models | $/1K tokens |
| Hallucination Rate | Linguistic signal analysis (v1), NLI planned (v2) | 0.0โ1.0 |
| Reasoning Quality | Chain-of-thought depth scoring | 1โ10 |
Supported Models
| Provider | Models |
|---|---|
| OpenAI | GPT-4o, GPT-4o-mini, o1, o1-mini, GPT-3.5-turbo |
| Anthropic | Claude 3.5 Sonnet, Claude 3.5 Haiku, Claude 3 Opus |
| Gemini 1.5 Pro, Gemini 1.5 Flash, Gemini 2.0 Flash | |
| Mistral | Mistral Large, Mistral Small |
| Meta | Llama 3 70B, Llama 3 8B (via Together AI) |
| Local | Ollama, vLLM, HuggingFace TGI |
Sample Benchmark Results (MMLU, 100 samples)
| Model | Accuracy | Latency | Cost/1K | Hallucination | Reasoning |
|---|---|---|---|---|---|
| GPT-4o | 88.2% | 892ms | $0.0080 | 1.8% | 8.4/10 |
| Claude 3.5 Sonnet | 87.6% | 1240ms | $0.0090 | 2.1% | 8.6/10 |
| GPT-4o-mini | 78.4% | 432ms | $0.0003 | 3.2% | 7.2/10 |
| Gemini 1.5 Flash | 76.8% | 380ms | $0.0001 | 4.1% | 6.8/10 |
| Claude 3 Haiku | 74.2% | 410ms | $0.0010 | 4.8% | 6.5/10 |
Key finding: GPT-4o-mini achieves 88% of GPT-4o's accuracy at 4% of the cost.
Features
- Async parallel evaluation โ 10 models at once via
asyncio.Semaphore - Streamlit dashboard โ radar charts, latency histograms, cost vs quality scatter
- FastAPI REST API โ 12 endpoints with OpenAPI docs
- CLI tool โ 7 subcommands with rich terminal output
- PDF report generator โ professional layout via ReportLab
- SQLite persistence โ zero-config, file-based storage
- Docker ready โ multi-stage build,
docker-compose up - 40+ tests, 95% coverage โ pytest, no API keys needed
Architecture
CLI / FastAPI / Streamlit / PDF Generator
โ
Core Evaluator (asyncio)
โ
โโโโโโโโโโโโผโโโโโโโโโโโฌโโโโโโโโโโโ
Metrics Benchmarks Database LiteLLM
accuracy MMLU SQLite OpenAI
latency TruthfulQA Anthropic
cost Custom CSV Google
hallucin. Mistral
reasoning Together
Install
# pip
pip install llm-evaluation-framework
# With extras
pip install "llm-evaluation-framework[dashboard,reports,dev]"
# Docker
docker-compose up -d
License
MIT โ free for research and commercial use.
Citation
@software{vigneshwar234_llm_eval_2025,
author = {Vigneshwar S},
title = {LLM Evaluation Framework},
year = {2025},
url = {https://github.com/vignesh2027/LLM-Evaluation-Framework},
license = {MIT}
}