Standard Benchmark — Open-Source Search API Evaluation Framework

A comprehensive, open-source framework for evaluating and comparing search APIs — Linkup, Exa, Tavily, and Perplexity — using configurable datasets and three independent scoring systems.

Features

Multi-system comparison — benchmark up to 4 search APIs side-by-side
3 independent scoring systems — Ragas (LLM-based), HuggingFace Judges, and automated metrics (BERTScore)
Flexible dataset loading — CSV, JSONL, or built-in samples
Full CLI pipeline — load datasets, run searches, evaluate responses, analyze results
Interactive dashboard — Streamlit-powered UI for exploring results and adjusting weights
Evaluation caching — granular per-metric caching to minimize API costs
Configurable weights — adjust scoring weights via dashboard or config

Quick Start

1. Install

cd standard-benchmark
uv sync

2. Configure API keys

cp .env.example .env
# Edit .env — at minimum you need OPENAI_API_KEY and one search API key

3. Load a dataset

# Built-in sample (30 queries, instant)
uv run standard-benchmark dataset sample --output data/queries.jsonl

# Or from CSV
uv run standard-benchmark dataset load \
  --source csv --path queries.csv --text-column question \
  --output data/queries.jsonl

4. Run searches

uv run standard-benchmark search run \
  -i data/queries.jsonl \
  -o data/responses.json \
  --systems linkup,exa,tavily,perplexity

5. Evaluate responses

uv run standard-benchmark evaluate run \
  -q data/queries.jsonl \
  -r data/responses.json \
  -o data/evaluations.json \
  --use-ragas

6. View results

# Summary statistics
uv run standard-benchmark analyze stats --evaluations data/evaluations.json

# Interactive dashboard
uv run standard-benchmark analyze dashboard

Dataset Format

Queries are stored as JSONL, one JSON object per line:

{"id": "q_0001", "text": "What were Apple's total revenues in Q4 2024?", "metadata": {}}

Field	Type	Required	Description
`id`	string	auto-generated	Unique query identifier
`text`	string	yes	The search query
`metadata`	object	no	Arbitrary metadata

Evaluation System

Three Independent Scoring Systems

All scores are normalized to a 1-10 scale.

1. Ragas Score (LLM-based)

Metric	Weight	Description
Faithfulness	0.45	Factual accuracy validated against sources
Completeness	0.45	Coverage of query aspects
Source Quality	0.10	Quality of retrieved sources

2. HuggingFace Score (LLM Judges)

Metric	Weight	Description
Correctness Classifier	0.45	Binary correctness (PollMultihopCorrectness)
Correctness Grader	0.45	Granular correctness (PrometheusAbsoluteCoarseCorrectness)
Response Quality	0.10	Overall quality (MTBenchChatBotResponseQuality)

3. Automated Score (Metrics-based)

Metric	Weight	Description
BERTScore Relevance	0.55	Semantic similarity: query ↔ response
BERTScore Faithfulness	0.45	Semantic similarity: response ↔ sources

Score Normalization

Metric	Original Scale	Normalized To
Faithfulness	0-1	1-10
Completeness	1-5	1-10
Source Quality	1-5	1-10
HF Classifier	True/False	0 or 10 (binary)
HF Grader	1-5	1-10
HF Quality	1-10	1-10 (native)
BERTScore	0-1	1-10

Evaluation Caching

Each metric is cached individually (7-day TTL). Adding new judges later won't invalidate existing cached metrics — only missing ones trigger API calls.

Customizing Weights

Edit config/evaluation_config.yaml to adjust scoring weights for each evaluation system.

Project Structure

standard-benchmark/
├── src/standard_benchmark/
│   ├── cli/                  # CLI commands (dataset, search, evaluate, analyze)
│   ├── config/               # Settings, evaluation config
│   ├── data/
│   │   ├── loaders/          # Dataset loaders (CSV, JSONL, sample)
│   │   └── models.py         # Pydantic data models
│   ├── search/               # Search API clients (Linkup, Exa, Tavily, Perplexity)
│   ├── evaluation/           # Evaluation pipeline (Ragas, HuggingFace, automated)
│   ├── visualization/        # Streamlit dashboard & charts
│   ├── storage/              # Data persistence & caching
│   └── utils/                # Embeddings, helpers, logging
├── config/                   # Configuration files
├── data/                     # Data directory (queries ship here; outputs written at runtime)
├── pyproject.toml
├── .env.example
├── README.md
└── LICENSE

Troubleshooting

Embedding model download on first run:

uv run python -c "from sentence_transformers import SentenceTransformer; SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')"

Clear caches:

uv run standard-benchmark search clear-cache
uv run standard-benchmark evaluate clear-cache

License

MIT — see LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Standard Benchmark — Open-Source Search API Evaluation Framework

Features

Quick Start

1. Install

2. Configure API keys

3. Load a dataset

4. Run searches

5. Evaluate responses

6. View results

Dataset Format

Evaluation System

Three Independent Scoring Systems

1. Ragas Score (LLM-based)

2. HuggingFace Score (LLM Judges)

3. Automated Score (Metrics-based)

Score Normalization

Evaluation Caching

Customizing Weights

Project Structure

Troubleshooting

License

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
config		config
src/standard_benchmark		src/standard_benchmark
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

License

LinkupPlatform/standard-benchmark

Folders and files

Latest commit

History

Repository files navigation

Standard Benchmark — Open-Source Search API Evaluation Framework

Features

Quick Start

1. Install

2. Configure API keys

3. Load a dataset

4. Run searches

5. Evaluate responses

6. View results

Dataset Format

Evaluation System

Three Independent Scoring Systems

1. Ragas Score (LLM-based)

2. HuggingFace Score (LLM Judges)

3. Automated Score (Metrics-based)

Score Normalization

Evaluation Caching

Customizing Weights

Project Structure

Troubleshooting

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages