LLM Evaluation Framework

Production-grade open-source LLM benchmarking. Evaluate GPT-4, Claude, Gemini, Mistral and Llama on 5 metrics โ€” side by side โ€” in one command.

What This Is

This is the model card / hub page for the LLM Evaluation Framework. The framework itself is a Python tool, not a neural network weight โ€” this page serves as the HuggingFace hub entry point linking all resources together.

Quick Start

pip install llm-evaluation-framework
export OPENAI_API_KEY="sk-..."
llm-eval run --model gpt-4o-mini --benchmark mmlu --samples 100

Output:

โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚  Evaluation: gpt-4o-mini             โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ Accuracy         โ”‚ 78.00%            โ”‚
โ”‚ Avg Latency      โ”‚ 432 ms            โ”‚
โ”‚ P95 Latency      โ”‚ 1240 ms           โ”‚
โ”‚ Total Cost       โ”‚ $0.0023           โ”‚
โ”‚ Hallucination    โ”‚ 2.40%             โ”‚
โ”‚ Reasoning Score  โ”‚ 7.2 / 10          โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

5 Evaluation Metrics

Metric Description Output
Accuracy 4-strategy cascade: exact โ†’ normalized โ†’ MC โ†’ fuzzy 0.0โ€“1.0
Latency p50, p75, p90, p95, p99 percentiles + SLA violation rate ms
Cost Real token counts ร— pricing table for 15+ models $/1K tokens
Hallucination Rate Linguistic signal analysis (v1), NLI planned (v2) 0.0โ€“1.0
Reasoning Quality Chain-of-thought depth scoring 1โ€“10

Supported Models

Provider Models
OpenAI GPT-4o, GPT-4o-mini, o1, o1-mini, GPT-3.5-turbo
Anthropic Claude 3.5 Sonnet, Claude 3.5 Haiku, Claude 3 Opus
Google Gemini 1.5 Pro, Gemini 1.5 Flash, Gemini 2.0 Flash
Mistral Mistral Large, Mistral Small
Meta Llama 3 70B, Llama 3 8B (via Together AI)
Local Ollama, vLLM, HuggingFace TGI

Sample Benchmark Results (MMLU, 100 samples)

Model Accuracy Latency Cost/1K Hallucination Reasoning
GPT-4o 88.2% 892ms $0.0080 1.8% 8.4/10
Claude 3.5 Sonnet 87.6% 1240ms $0.0090 2.1% 8.6/10
GPT-4o-mini 78.4% 432ms $0.0003 3.2% 7.2/10
Gemini 1.5 Flash 76.8% 380ms $0.0001 4.1% 6.8/10
Claude 3 Haiku 74.2% 410ms $0.0010 4.8% 6.5/10

Key finding: GPT-4o-mini achieves 88% of GPT-4o's accuracy at 4% of the cost.

Features

  • Async parallel evaluation โ€” 10 models at once via asyncio.Semaphore
  • Streamlit dashboard โ€” radar charts, latency histograms, cost vs quality scatter
  • FastAPI REST API โ€” 12 endpoints with OpenAPI docs
  • CLI tool โ€” 7 subcommands with rich terminal output
  • PDF report generator โ€” professional layout via ReportLab
  • SQLite persistence โ€” zero-config, file-based storage
  • Docker ready โ€” multi-stage build, docker-compose up
  • 40+ tests, 95% coverage โ€” pytest, no API keys needed

Architecture

CLI / FastAPI / Streamlit / PDF Generator
              โ”‚
        Core Evaluator (asyncio)
              โ”‚
   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
Metrics  Benchmarks  Database  LiteLLM
accuracy  MMLU        SQLite    OpenAI
latency   TruthfulQA           Anthropic
cost      Custom CSV           Google
hallucin.                      Mistral
reasoning                      Together

Install

# pip
pip install llm-evaluation-framework

# With extras
pip install "llm-evaluation-framework[dashboard,reports,dev]"

# Docker
docker-compose up -d

License

MIT โ€” free for research and commercial use.

Citation

@software{vigneshwar234_llm_eval_2025,
  author  = {Vigneshwar S},
  title   = {LLM Evaluation Framework},
  year    = {2025},
  url     = {https://github.com/vignesh2027/LLM-Evaluation-Framework},
  license = {MIT}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support