🎓 AIS710: BLIMP Evaluation Interface

🚀 A Comprehensive Web-Based Evaluation System for Language Models

Testing minimal sentence pairs for grammatical correctness and semantic plausibility

👩‍🏫 Course Project

Developed as part of AIS710 Course
under the guidance of
Prof. Ashwini Vaidya

🎯 Key Highlights

✅ Multi-Model Comparison
✅ Real-Time Evaluation
✅ Interactive Visualizations
✅ Bulk Processing Support
✅ Export & Analysis Tools

📋 Table of Contents

Click to expand/collapse

🔍 Overview
✨ Features
🚀 Installation
🎯 Quick Start
🏗️ Project Architecture
💡 Usage
📊 Evaluation Methodology
📁 Data Format
🤖 Supported Models
🔌 API Documentation
📝 Examples
🛠️ Troubleshooting
🤝 Contributing
📄 License

🔍 Overview

🎯 The BLIMP Evaluation Interface

A powerful tool for evaluating language models on minimal pairs - sentence pairs that differ in grammaticality or semantic plausibility

🌟 What We Offer

The system supports both Ollama (local LLMs) and HuggingFace models, providing:

🌐 Interactive Web Interface with real-time evaluation
💻 Command-Line Tools for automation
📊 Detailed Analytics with charts
🔄 Dual Evaluation Modes
🎯 Multi-Model Comparison

🎨 Key Capabilities

✅ Evaluate grammatical correctness (syntax)
✅ Assess semantic plausibility (meaning)
✅ Compare model performance across architectures
✅ Visualize results with interactive charts
✅ Export results for further analysis

📊 Supported Model Types

🦙 Ollama Models	🤗 HuggingFace Models
DeepSeek-R1, Qwen2.5	BERT, RoBERTa
Llama3, Mistral	GPT-2, DistilBERT
Phi4	ALBERT

✨ Features

🖥️ Dual Interface Design

🌐 Web Interface (`app.py`)

📱 Single Evaluation Mode

✓ Test individual sentence pairs in real-time
✓ Select multiple models simultaneously
✓ Interactive tooltips for metrics
✓ Visual comparison charts (Chart.js)
✓ Instant results (0-10 scale)

📊 Bulk Evaluation Mode

✓ Upload CSV with multiple pairs
✓ Real-time progress tracking
✓ Summary statistics
✓ Performance analytics (bar & line charts)
✓ Export results as CSV
✓ Cancel evaluation mid-process

💻 Command-Line Tools

🦙 Ollama Evaluation

scripts/evaluate_ollama.py

Local LLM evaluation
Token probability scoring
JSON/CSV output formats
Progress tracking (tqdm)

🤗 HuggingFace Evaluation

scripts/evaluate_blimp_hf.py

MLM & CLM support
Auto device detection (CPU/CUDA/MPS)
Efficient batch processing
Category-wise reporting

📊 Visualization & Analytics

📈 Interactive Bar Charts • 📉 Line Charts • 🎨 Gradient Styling • 💡 Tooltip Explanations

🚀 Installation

⚡ Get Started in 3 Steps

📋 Prerequisites

Python 3.8+

pip

Ollama (Optional)

GPU (Optional)

📦 Step-by-Step Installation

🔽 Click to expand installation steps

1️⃣ Clone the Repository

git clone https://github.com/Sudarshan50/Masked-Language-Model-Scoring.git
cd AIS710

2️⃣ Install Python Dependencies

pip install -r requirements.txt

📦 Package	🔢 Version	📝 Purpose
`transformers`	≥4.30.0	HuggingFace models
`torch`	≥1.12.0	Neural networks
`flask`	≥2.3.0	Web framework
`ollama`	≥0.3.0	Ollama client
`tqdm`	latest	Progress bars

3️⃣ Install Ollama (Optional for Local LLMs)

🍎 macOS

brew install ollama

🐧 Linux

curl -fsSL https://ollama.com/install.sh | sh

🪟 Windows

Download from ollama.com

4️⃣ Pull Ollama Models (Optional)

# 🚀 Recommended models for testing
ollama pull qwen2.5:3b      # Fast & efficient
ollama pull deepseek-r1:7b  # Reasoning-focused
ollama pull llama3.1:8b     # Meta's latest
ollama pull mistral:7b      # High-quality

🎯 Quick Start

🚀 Launch in 60 Seconds

🌐 Web Interface (Recommended)

ollama serve

python3 app.py

🌐 http://localhost:5001

Single Mode: Enter sentence pairs
Bulk Mode: Upload CSV file
Select models & click "Evaluate"
View results with charts!

💻 Command Line Interface

python scripts/evaluate_ollama.py \
  --models qwen2.5:3b deepseek-r1:7b \
  --data data/minimal_pairs.jsonl \
  --output results.csv

python scripts/evaluate_blimp_hf.py \
  --models bert-base-uncased:mlm gpt2:clm \
  --data data/minimal_pairs.jsonl \
  --output results.csv

💡 Tip: Use the web interface for interactive exploration and CLI for automation!

🎬 Demo Workflow

graph LR
    A[📝 Prepare Data] --> B[🔧 Select Models]
    B --> C[▶️ Run Evaluation]
    C --> D[📊 View Results]
    D --> E[💾 Export Data]
    style A fill:#e3f2fd
    style B fill:#f3e5f5
    style C fill:#e8f5e9
    style D fill:#fff3e0
    style E fill:#fce4ec

🏗️ Project Architecture

AIS710/
│
├── app.py                              # Flask web application (395 lines)
│   ├── Single evaluation endpoint
│   ├── Bulk evaluation with progress tracking
│   ├── Model discovery (Ollama + HuggingFace)
│   ├── CSV download endpoint
│   └── Auto-device detection (MPS/CUDA/CPU)
│
├── templates/
│   └── index.html                     # Web interface (1620 lines)
│       ├── Single evaluation tab
│       ├── Bulk evaluation tab
│       ├── Chart.js visualizations
│       ├── Tooltips with explanations
│       └── Responsive gradient design
│
├── scripts/
│   ├── evaluate_ollama.py             # Ollama evaluation engine (20KB)
│   │   ├── OllamaEvaluator class
│   │   ├── Token probability extraction
│   │   ├── Score normalization (0-10 scale)
│   │   └── CLI interface with argparse
│   │
│   └── evaluate_blimp_hf.py           # HuggingFace evaluation (7KB)
│       ├── BLIMPEvaluator integration
│       ├── MLM and CLM support
│       ├── Batch processing
│       └── Device auto-detection
│
├── src/
│   └── eval_plausibility/
│       ├── __init__.py
│       ├── blimp_evaluator.py         # Core evaluator (403 lines)
│       │   ├── CLM scoring (Causal LM)
│       │   ├── MLM scoring (Masked LM)
│       │   ├── Token alignment
│       │   └── Category-wise metrics
│       │
│       └── eval.py                    # Scoring functions
│           ├── score_sentence_clm()
│           ├── score_sentence_mlm_pll_word_l2r()
│           └── Tokenization utilities
│
├── data/
│   ├── minimal_pairs.jsonl            # Test pairs (JSONL format)
│   ├── minimal_pairs.csv              # Test pairs (CSV format)
│   ├── extensive_test_pairs.jsonl     # Extended test set
│   └── image.png                      # Documentation assets
│
├── requirements.txt                   # Python dependencies
├── README.md                          # This file
└── WEB_INTERFACE_GUIDE.md             # Detailed web interface docs

Architecture Flow

┌─────────────────────────────────────────────────────────────┐
│                        User Interface                        │
│  ┌────────────────┐              ┌─────────────────────┐   │
│  │  Web Browser   │              │  Command Line       │   │
│  │  (Port 5001)   │              │  (Terminal)         │   │
│  └────────┬───────┘              └──────────┬──────────┘   │
└───────────┼────────────────────────────────┼──────────────┘
            │                                 │
            ▼                                 ▼
┌───────────────────────┐        ┌──────────────────────────┐
│      Flask App        │        │  Evaluation Scripts      │
│      (app.py)         │        │  - evaluate_ollama.py    │
│  - REST API           │        │  - evaluate_blimp_hf.py  │
│  - Model Management   │        └──────────┬───────────────┘
│  - Progress Tracking  │                   │
└───────────┬───────────┘                   │
            │                               │
            └───────────┬───────────────────┘
                        │
                        ▼
        ┌───────────────────────────────┐
        │   Core Evaluation Library     │
        │   (src/eval_plausibility/)    │
        │   - BLIMPEvaluator            │
        │   - Token scoring             │
        │   - Probability computation   │
        └───────────┬───────────────────┘
                    │
        ┌───────────┴───────────┐
        │                       │
        ▼                       ▼
┌───────────────┐      ┌────────────────────┐
│ Ollama Models │      │ HuggingFace Models │
│ (Local LLMs)  │      │ (Transformers)     │
│ - Qwen        │      │ - BERT             │
│ - DeepSeek    │      │ - GPT-2            │
│ - Llama       │      │ - RoBERTa          │
└───────────────┘      └────────────────────┘

💡 Usage

Web Interface

Single Evaluation

Navigate to Single Evaluation tab
Enter grammatical sentence (e.g., "I gave John the button.")
Enter ungrammatical sentence (e.g., "I gave John the wall.")
Select one or more models:
- Ollama Models: qwen2.5:3b, deepseek-r1:7b, llama3.1:8b
- HuggingFace Models: gpt2, bert-base-uncased, roberta-base
Click Evaluate
View results table with:
- Good Score (0-10): Plausibility of grammatical sentence
- Bad Score (0-10): Plausibility of ungrammatical sentence
- Verdict: ✓ (Correct) if Good Score > Bad Score
- Time: Evaluation duration
Scroll to see comparison bar chart

Bulk Evaluation

Navigate to Bulk Evaluation tab
Prepare CSV file with columns:
- good_sentence: Grammatical/plausible sentences
- bad_sentence: Ungrammatical/implausible sentences
Click Choose File and upload CSV
Select models for evaluation
Click Evaluate Bulk
Monitor progress bar showing:
- Current pair being processed
- Percentage complete
- Current model
View results:
- Detailed Results Table: All pairs with scores and verdicts
- Summary Statistics: Total pairs, overall accuracy, average time
- Performance Analytics: Bar chart (accuracy) and line chart (performance trend)
Click Download CSV to export results

Command-Line Tools

1. Ollama Evaluation

Basic Usage:

python scripts/evaluate_ollama.py \
  --models qwen2.5:3b \
  --data data/minimal_pairs.jsonl

Multiple Models:

python scripts/evaluate_ollama.py \
  --models qwen2.5:3b deepseek-r1:7b llama3.1:8b \
  --data data/minimal_pairs.jsonl \
  --output results.csv

With JSON Output:

python scripts/evaluate_ollama.py \
  --models qwen2.5:3b \
  --data data/minimal_pairs.jsonl \
  --output results.json \
  --format json

2. HuggingFace Evaluation

Masked Language Model (MLM):

python scripts/evaluate_blimp_hf.py \
  --models bert-base-uncased:mlm roberta-base:mlm \
  --data data/minimal_pairs.jsonl \
  --output results.csv

Causal Language Model (CLM):

python scripts/evaluate_blimp_hf.py \
  --models gpt2:clm \
  --data data/minimal_pairs.jsonl \
  --output results.csv

Mixed Models:

python scripts/evaluate_blimp_hf.py \
  --models bert-base-uncased:mlm gpt2:clm distilbert-base-uncased:mlm \
  --data data/minimal_pairs.jsonl \
  --device cuda \
  --output results.csv

📊 Evaluation Methodology

Scoring System

Good Score (0-10)

Measures the grammatical correctness and semantic plausibility of the grammatical sentence:

10: Perfect grammar and highly plausible
7-9: Good grammar with minor issues
4-6: Moderate grammaticality
0-3: Poor grammar or implausible

Bad Score (0-10)

Measures how the model scores the ungrammatical/implausible sentence:

Lower bad scores indicate better model discrimination
High bad scores suggest the model accepts implausible sentences

Verdict

✓ Correct: Good Score > Bad Score (model correctly identifies good sentence)
✗ Incorrect: Bad Score >= Good Score (model fails to discriminate)

Calculation Methods

Ollama Models (Token Probability)

Generate sentence with token logprobs
Extract log probabilities for each token
Convert to linear probabilities
Compute average probability across tokens

Normalize to 0-10 scale:

score = (avg_probability × 20) - 10
score = max(0, min(10, score))

HuggingFace Models

MLM (Masked Language Models):

Mask each word sequentially
Compute probability of correct token
Aggregate using pseudo-log-likelihood (PLL)
Normalize to 0-10 scale

CLM (Causal Language Models):

Compute forward probability (left-to-right)
Calculate log-likelihood per token
Average across sequence
Normalize to 0-10 scale

📁 Data Format

JSONL Format (Recommended)

{"good": "I gave John the button.", "bad": "I gave John the wall."}
{"good": "She ate the apple.", "bad": "She ate the computer."}
{"good": "He put the key in his pocket.", "bad": "He put the house in his pocket."}

CSV Format

good_sentence,bad_sentence
I gave John the button.,I gave John the wall.
She ate the apple.,She ate the computer.
He put the key in his pocket.,He put the house in his pocket.

Sample Test Cases

The data/minimal_pairs.jsonl includes diverse test pairs:

Semantic Anomalies:

"I eat biscuit with tea" vs "I eat plate with tea"
"I ordered a cycle" vs "I ordered a mountain"
"She drinks water every day" vs "She drinks furniture every day"

Size Implausibility:

"He has a calculator in his pocket" vs "He has a statue in his pocket"
"She picked up a pen" vs "She picked up the sky"

Action-Object Mismatch:

"She read the book" vs "She drank the book"
"He painted the wall" vs "He painted the time"

🤖 Supported Models

🦾 Powerful Language Models at Your Fingertips

🦙 Ollama Models (Local LLMs)

🏷️ Model	📦 Size	⚡ Speed	📝 Description
qwen2.5:3b	3B	🚀🚀🚀	Fast, efficient Chinese-English
qwen2.5:7b	7B	🚀🚀	Balanced performance & speed
deepseek-r1:7b	7B	🚀🚀	Reasoning-focused model
llama3.1:8b	8B	🚀🚀	Meta's latest Llama
mistral:7b	7B	🚀🚀	High-quality open model
phi4:latest	14B	🚀	Microsoft's efficient model

📥 Installation:

ollama pull qwen2.5:3b
ollama pull deepseek-r1:7b
ollama pull llama3.1:8b

🤗 HuggingFace Models

🎭 Masked Language Models (MLM)

🏷️ Model	📊 Params	🎯 Use Case
bert-base-uncased	110M	Original BERT base
roberta-base	125M	Optimized BERT variant
distilbert-base	66M	Distilled (faster)
albert-base-v2	12M	Lightweight BERT

🎯 Causal Language Models (CLM)

🏷️ Model	📊 Params	🎯 Use Case
gpt2	124M	OpenAI GPT-2 base
gpt2-medium	355M	Larger GPT-2
gpt2-large	774M	Even larger GPT-2

🔄 Auto-download: Models automatically download on first use

🎨 Model Selection Guide

🎯 Use Case	💡 Recommended Models
🚀 Speed Priority	qwen2.5:3b, distilbert-base
🎯 Accuracy Priority	llama3.1:8b, roberta-base
⚖️ Balanced	qwen2.5:7b, bert-base-uncased
🧠 Reasoning	deepseek-r1:7b, gpt2-medium

🔌 API Documentation

REST Endpoints

1. Home Page

GET /

Response: HTML web interface

2. Get Available Models

GET /api/models

Response:

{
  "ollama": ["qwen2.5:3b", "deepseek-r1:7b"],
  "huggingface": ["gpt2", "bert-base-uncased", "roberta-base"]
}

3. Single Evaluation

POST /api/evaluate
Content-Type: application/json

{
  "good_sentence": "I gave John the button.",
  "bad_sentence": "I gave John the wall.",
  "models": ["qwen2.5:3b", "gpt2"]
}

Response:

{
  "results": [
    {
      "model": "qwen2.5:3b",
      "good_score": 8.5,
      "bad_score": 3.2,
      "correct": true,
      "time": 1.24
    },
    {
      "model": "gpt2",
      "good_score": 7.8,
      "bad_score": 4.1,
      "correct": true,
      "time": 0.85
    }
  ]
}

4. Bulk Evaluation

POST /api/evaluate_bulk
Content-Type: multipart/form-data

file: <CSV file>
models: ["qwen2.5:3b", "gpt2"]

Response: Streaming JSON with progress updates

5. Get Progress

GET /api/progress

Response:

{
  "current": 5,
  "total": 10,
  "status": "running",
  "current_model": "qwen2.5:3b",
  "current_pair": 5
}

6. Cancel Evaluation

POST /api/cancel

Response:

{"status": "cancelled"}

7. Download Results

GET /api/download_csv

Response: CSV file download

📝 Examples

Example 1: Single Pair Evaluation

Input:

Good: "The cat sat on the mat."
Bad: "The cat sat on the sky."
Models: qwen2.5:3b, bert-base-uncased

Output:

Model	Good Score	Bad Score	Verdict	Time
qwen2.5:3b	9.2	2.8	✓	1.1s
bert-base-uncased	8.7	3.5	✓	0.6s

Example 2: Bulk Evaluation

Input CSV (test.csv):

good_sentence,bad_sentence
I gave John the button.,I gave John the wall.
She ate the apple.,She ate the computer.
He drinks water.,He drinks furniture.

Command:

# Via web interface: Upload test.csv, select models, click Evaluate
# Via CLI:
python scripts/evaluate_ollama.py --models qwen2.5:3b --data test.csv

Output:

Detailed results table with 3 rows
Accuracy: 100% (3/3 correct)
Average time: 1.2s per pair
Charts showing model performance

Example 3: Multi-Model Comparison

Command:

python scripts/evaluate_ollama.py \
  --models qwen2.5:3b deepseek-r1:7b llama3.1:8b \
  --data data/extensive_test_pairs.jsonl \
  --output comparison.csv

Result: CSV file with side-by-side model scores for analysis

🛠️ Troubleshooting

Common Issues

1. Ollama Connection Error

Error: Connection refused to localhost:11434

Solution:

# Start Ollama service
ollama serve

2. Model Not Found

Error: Model 'qwen2.5:3b' not found

Solution:

# Pull the model first
ollama pull qwen2.5:3b

3. CUDA Out of Memory

Error: CUDA out of memory

Solution:

# Use CPU instead
python scripts/evaluate_blimp_hf.py --device cpu --models bert-base-uncased:mlm

Or use smaller models:

# Use DistilBERT instead of BERT
python scripts/evaluate_blimp_hf.py --models distilbert-base-uncased:mlm

4. Import Error

Error: ModuleNotFoundError: No module named 'transformers'

Solution:

pip install -r requirements.txt

5. Port Already in Use

Error: Address already in use: Port 5001

Solution:

# Find and kill process using port 5001
lsof -ti:5001 | xargs kill -9

# Or change port in app.py
# app.run(debug=True, host='0.0.0.0', port=5002)

6. Slow Evaluation

Issue: Models taking too long

Solution:

Use smaller models (3B instead of 7B)
Enable GPU acceleration (add CUDA support)
Reduce batch size in evaluate_blimp_hf.py

Use MPS on Apple Silicon:

# Auto-detected in app.py
device = "mps"  # For M1/M2/M3 Macs

🤝 Contributing

Contributions are welcome! Here's how you can help:

Areas for Improvement

Model Support: Add support for new models (Claude, Gemini, etc.)
Evaluation Metrics: Implement additional scoring methods
Visualization: Enhance charts with more interactive features
Performance: Optimize batch processing and caching
Testing: Add more unit tests and integration tests
Documentation: Improve examples and tutorials

Development Setup

# Clone repository
git clone https://github.com/Sudarshan50/Masked-Language-Model-Scoring.git
cd AIS710

# Create virtual environment
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Run tests
pytest tests/

# Start development server
python3 app.py

Submitting Changes

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit changes (git commit -m 'Add amazing feature')
Push to branch (git push origin feature/amazing-feature)
Open a Pull Request

📚 Additional Resources

WEB_INTERFACE_GUIDE.md: Detailed web interface documentation
Ollama Documentation: ollama.com/docs
HuggingFace Transformers: huggingface.co/docs/transformers
Flask Documentation: flask.palletsprojects.com
Chart.js: chartjs.org

📄 License

📜 Copyright & Licensing

👨‍💻 Developer

Sudarshan

👩‍🏫 Academic Supervisor

Prof. Ashwini Vaidya

Course: AIS710

⚖️ Usage Terms

This project is developed for educational purposes as part of the AIS710 course.

⚠️ Note: For commercial use, please refer to individual model licenses:

Ollama models: Check respective model repositories

HuggingFace models: See HuggingFace Model Hub

📞 Contact & Support

🐛 Report Issues

💡 Feature Requests

📖 Documentation

🌟 Show Your Support

If you find this project helpful, please consider giving it a ⭐ on GitHub!

🎉 Happy Evaluating!

🚀 Start evaluating language models today!

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
data		data
images		images
scripts		scripts
src/eval_plausibility		src/eval_plausibility
templates		templates
.gitignore		.gitignore
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt

Sudarshan50/Masked-Language-Model-Scoring

Folders and files

Latest commit

History

Repository files navigation

🎓 AIS710: BLIMP Evaluation Interface

🚀 A Comprehensive Web-Based Evaluation System for Language Models

👩‍🏫 Course Project

🎯 Key Highlights

📋 Table of Contents

🔍 Overview

🎯 The BLIMP Evaluation Interface

🌟 What We Offer

🎨 Key Capabilities

📊 Supported Model Types

✨ Features

🖥️ Dual Interface Design

🌐 Web Interface (app.py)

📱 Single Evaluation Mode

📊 Bulk Evaluation Mode

💻 Command-Line Tools

🦙 Ollama Evaluation

🤗 HuggingFace Evaluation

📊 Visualization & Analytics

🚀 Installation

⚡ Get Started in 3 Steps

📋 Prerequisites

📦 Step-by-Step Installation

1️⃣ Clone the Repository

2️⃣ Install Python Dependencies

3️⃣ Install Ollama (Optional for Local LLMs)

4️⃣ Pull Ollama Models (Optional)

🎯 Quick Start

🚀 Launch in 60 Seconds

🌐 Web Interface (Recommended)

💻 Command Line Interface

🎬 Demo Workflow

🏗️ Project Architecture

Architecture Flow

💡 Usage

Web Interface

Single Evaluation

Bulk Evaluation

Command-Line Tools

1. Ollama Evaluation

2. HuggingFace Evaluation

📊 Evaluation Methodology

Scoring System

Good Score (0-10)

Bad Score (0-10)

Verdict

Calculation Methods

Ollama Models (Token Probability)

HuggingFace Models

📁 Data Format

JSONL Format (Recommended)

CSV Format

Sample Test Cases

🤖 Supported Models

🦾 Powerful Language Models at Your Fingertips

🦙 Ollama Models (Local LLMs)

🤗 HuggingFace Models

🎭 Masked Language Models (MLM)

🎯 Causal Language Models (CLM)

🎨 Model Selection Guide

🔌 API Documentation

REST Endpoints

1. Home Page

2. Get Available Models

3. Single Evaluation

4. Bulk Evaluation

5. Get Progress

6. Cancel Evaluation

7. Download Results

📝 Examples

Example 1: Single Pair Evaluation

Example 2: Bulk Evaluation

Example 3: Multi-Model Comparison

🛠️ Troubleshooting

Common Issues

🌐 Web Interface (`app.py`)

Packages