A modern web application for evaluating and judging LLM responses using local models via Ollama. Test your LLMs locally without sending data to external APIs!
- 📤 Multiple File Upload - Drag and drop CSV files with test cases
- 🤖 Local LLM Integration - Use your own models via Ollama
- 📊 Real-time Progress - Track evaluation progress with visual indicators
- 📝 Detailed Judgments - Get comprehensive scoring of LLM responses
- 📈 Export Results - Download in CSV and JSON formats
- 🔒 Privacy Focused - All processing happens locally
- Clone the repository:
git clone https://github.com/eRaz00r/evaluatoor.git
cd evaluatoor- Install dependencies:
npm install- Start the development server:
npm run dev- Open http://localhost:3000 in your browser.
The application expects CSV files with the following columns:
| Column | Description |
|---|---|
id (optional) |
A unique identifier for the test case |
input |
The input prompt for the LLM |
expected_output |
The expected response from the LLM |
id,input,expected_output
1,"What is the capital of France?","The capital of France is Paris."
2,"Explain quantum computing in simple terms.","Quantum computing uses quantum bits or qubits that can exist in multiple states at once, unlike classical bits that are either 0 or 1. This allows quantum computers to process certain types of problems much faster than classical computers."- 📤 Upload CSV - Upload a CSV file containing test cases
- 🤖 Select Models - Choose LLMs for evaluation and judgment
▶️ Run Evaluation - Process test cases and generate responses- ⚖️ Judge Results - Compare generated responses against expected outputs
- 💾 Download - Export results for further analysis
The application uses a specialized LLM judge to evaluate the quality of generated responses compared to expected outputs.
You are an expert evaluator of LLM responses. Your task is to judge the quality of a generated response compared to an expected response.
Context:
- Input prompt: {input}
- Expected response: {expected}
- Generated response: {generated}
Evaluate the generated response based on the following criteria:
1. Accuracy: How well does it match the factual content of the expected response?
2. Completeness: Does it cover all key points from the expected response?
3. Clarity: Is it well-written and easy to understand?
4. Relevance: Does it directly address the input prompt?
Provide your evaluation in the following format:
1. A score from 0-10 (where 10 is perfect)
2. A brief explanation of your judgment
Your response should be in JSON format:
{
"score": <number>,
"explanation": "<your detailed judgment>"
}
Remember:
- Be objective and consistent
- Consider context and nuance
- Focus on substance over style
- Account for valid alternative phrasings
For the input:
{
"input": "What is the capital of France?",
"expected": "The capital of France is Paris.",
"generated": "Paris is the capital city of France and is known for the Eiffel Tower."
}The judge might respond:
{
"score": 9,
"explanation": "The generated response is accurate and complete, providing the correct information that Paris is France's capital. It goes slightly beyond the expected response by adding relevant context about the Eiffel Tower, which is appropriate but not necessary. The response is clear, concise, and directly addresses the question."
}All processing is done locally on your machine. No data is sent to external servers or APIs. The application communicates only with the local Ollama API.
- Use smaller, quantized models for faster evaluation
- Adjust context length in Ollama for better performance
- Process batches of test cases for efficiency
- Consider GPU acceleration for larger models
This repository is maintained by eRaz00r.
MIT