Skip to content

πŸŽ“ A comprehensive web-based evaluation system for testing minimal sentence pairs using multiple language models. Supports Ollama (local LLMs) and HuggingFace models with interactive visualisations and bulk processing. Built for AIS710 course.

Notifications You must be signed in to change notification settings

Sudarshan50/Masked-Language-Model-Scoring

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

8 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸŽ“ AIS710: BLIMP Evaluation Interface

BLIMP Evaluation Interface

πŸš€ A Comprehensive Web-Based Evaluation System for Language Models

Testing minimal sentence pairs for grammatical correctness and semantic plausibility

Quick Start Features Install API

Python Flask PyTorch Ollama HuggingFace


Prof. Ashwini Vaidya

πŸ‘©β€πŸ« Course Project

Developed as part of AIS710 Course
under the guidance of
Prof. Ashwini Vaidya

🎯 Key Highlights

βœ… Multi-Model Comparison
βœ… Real-Time Evaluation
βœ… Interactive Visualizations
βœ… Bulk Processing Support
βœ… Export & Analysis Tools



πŸ“‹ Table of Contents

Click to expand/collapse

πŸ” Overview

🎯 The BLIMP Evaluation Interface

A powerful tool for evaluating language models on minimal pairs - sentence pairs that differ in grammaticality or semantic plausibility

🌟 What We Offer

The system supports both Ollama (local LLMs) and HuggingFace models, providing:

  • 🌐 Interactive Web Interface with real-time evaluation
  • πŸ’» Command-Line Tools for automation
  • πŸ“Š Detailed Analytics with charts
  • πŸ”„ Dual Evaluation Modes
  • 🎯 Multi-Model Comparison

🎨 Key Capabilities

  • βœ… Evaluate grammatical correctness (syntax)
  • βœ… Assess semantic plausibility (meaning)
  • βœ… Compare model performance across architectures
  • βœ… Visualize results with interactive charts
  • βœ… Export results for further analysis

πŸ“Š Supported Model Types

πŸ¦™ Ollama Models πŸ€— HuggingFace Models
DeepSeek-R1, Qwen2.5 BERT, RoBERTa
Llama3, Mistral GPT-2, DistilBERT
Phi4 ALBERT

✨ Features

πŸ–₯️ Dual Interface Design

🌐 Web Interface (app.py)

Port Status

πŸ“± Single Evaluation Mode

βœ“ Test individual sentence pairs in real-time
βœ“ Select multiple models simultaneously
βœ“ Interactive tooltips for metrics
βœ“ Visual comparison charts (Chart.js)
βœ“ Instant results (0-10 scale)

πŸ“Š Bulk Evaluation Mode

βœ“ Upload CSV with multiple pairs
βœ“ Real-time progress tracking
βœ“ Summary statistics
βœ“ Performance analytics (bar & line charts)
βœ“ Export results as CSV
βœ“ Cancel evaluation mid-process

πŸ’» Command-Line Tools

CLI Automation

πŸ¦™ Ollama Evaluation

scripts/evaluate_ollama.py
  • Local LLM evaluation
  • Token probability scoring
  • JSON/CSV output formats
  • Progress tracking (tqdm)

πŸ€— HuggingFace Evaluation

scripts/evaluate_blimp_hf.py
  • MLM & CLM support
  • Auto device detection (CPU/CUDA/MPS)
  • Efficient batch processing
  • Category-wise reporting

πŸ“Š Visualization & Analytics

Charts Responsive Export

πŸ“ˆ Interactive Bar Charts β€’ πŸ“‰ Line Charts β€’ 🎨 Gradient Styling β€’ πŸ’‘ Tooltip Explanations


πŸš€ Installation

⚑ Get Started in 3 Steps

Time Difficulty

πŸ“‹ Prerequisites

Python
Python 3.8+
pip
pip
Ollama
Ollama (Optional)
GPU
GPU (Optional)

πŸ“¦ Step-by-Step Installation

πŸ”½ Click to expand installation steps

1️⃣ Clone the Repository

git clone https://github.com/Sudarshan50/Masked-Language-Model-Scoring.git
cd AIS710
Step 1

2️⃣ Install Python Dependencies

pip install -r requirements.txt
πŸ“¦ Package πŸ”’ Version πŸ“ Purpose
transformers β‰₯4.30.0 HuggingFace models
torch β‰₯1.12.0 Neural networks
flask β‰₯2.3.0 Web framework
ollama β‰₯0.3.0 Ollama client
tqdm latest Progress bars
Step 2

3️⃣ Install Ollama (Optional for Local LLMs)

🍎 macOS

brew install ollama

🐧 Linux

curl -fsSL https://ollama.com/install.sh | sh

πŸͺŸ Windows

Download from ollama.com

Step 3

4️⃣ Pull Ollama Models (Optional)

# πŸš€ Recommended models for testing
ollama pull qwen2.5:3b      # Fast & efficient
ollama pull deepseek-r1:7b  # Reasoning-focused
ollama pull llama3.1:8b     # Meta's latest
ollama pull mistral:7b      # High-quality
Step 4

Complete

🎯 Quick Start

πŸš€ Launch in 60 Seconds

Web CLI

🌐 Web Interface (Recommended)

Step 1
ollama serve
Step 2
python3 app.py
Step 3
🌐 http://localhost:5001
Step 4
  • Single Mode: Enter sentence pairs
  • Bulk Mode: Upload CSV file
  • Select models & click "Evaluate"
  • View results with charts!

πŸ’» Command Line Interface

Ollama
python scripts/evaluate_ollama.py \
  --models qwen2.5:3b deepseek-r1:7b \
  --data data/minimal_pairs.jsonl \
  --output results.csv
HuggingFace
python scripts/evaluate_blimp_hf.py \
  --models bert-base-uncased:mlm gpt2:clm \
  --data data/minimal_pairs.jsonl \
  --output results.csv

πŸ’‘ Tip: Use the web interface for interactive exploration and CLI for automation!


🎬 Demo Workflow

graph LR
    A[πŸ“ Prepare Data] --> B[πŸ”§ Select Models]
    B --> C[▢️ Run Evaluation]
    C --> D[πŸ“Š View Results]
    D --> E[πŸ’Ύ Export Data]
    style A fill:#e3f2fd
    style B fill:#f3e5f5
    style C fill:#e8f5e9
    style D fill:#fff3e0
    style E fill:#fce4ec
Loading

πŸ—οΈ Project Architecture

AIS710/
β”‚
β”œβ”€β”€ app.py                              # Flask web application (395 lines)
β”‚   β”œβ”€β”€ Single evaluation endpoint
β”‚   β”œβ”€β”€ Bulk evaluation with progress tracking
β”‚   β”œβ”€β”€ Model discovery (Ollama + HuggingFace)
β”‚   β”œβ”€β”€ CSV download endpoint
β”‚   └── Auto-device detection (MPS/CUDA/CPU)
β”‚
β”œβ”€β”€ templates/
β”‚   └── index.html                     # Web interface (1620 lines)
β”‚       β”œβ”€β”€ Single evaluation tab
β”‚       β”œβ”€β”€ Bulk evaluation tab
β”‚       β”œβ”€β”€ Chart.js visualizations
β”‚       β”œβ”€β”€ Tooltips with explanations
β”‚       └── Responsive gradient design
β”‚
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ evaluate_ollama.py             # Ollama evaluation engine (20KB)
β”‚   β”‚   β”œβ”€β”€ OllamaEvaluator class
β”‚   β”‚   β”œβ”€β”€ Token probability extraction
β”‚   β”‚   β”œβ”€β”€ Score normalization (0-10 scale)
β”‚   β”‚   └── CLI interface with argparse
β”‚   β”‚
β”‚   └── evaluate_blimp_hf.py           # HuggingFace evaluation (7KB)
β”‚       β”œβ”€β”€ BLIMPEvaluator integration
β”‚       β”œβ”€β”€ MLM and CLM support
β”‚       β”œβ”€β”€ Batch processing
β”‚       └── Device auto-detection
β”‚
β”œβ”€β”€ src/
β”‚   └── eval_plausibility/
β”‚       β”œβ”€β”€ __init__.py
β”‚       β”œβ”€β”€ blimp_evaluator.py         # Core evaluator (403 lines)
β”‚       β”‚   β”œβ”€β”€ CLM scoring (Causal LM)
β”‚       β”‚   β”œβ”€β”€ MLM scoring (Masked LM)
β”‚       β”‚   β”œβ”€β”€ Token alignment
β”‚       β”‚   └── Category-wise metrics
β”‚       β”‚
β”‚       └── eval.py                    # Scoring functions
β”‚           β”œβ”€β”€ score_sentence_clm()
β”‚           β”œβ”€β”€ score_sentence_mlm_pll_word_l2r()
β”‚           └── Tokenization utilities
β”‚
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ minimal_pairs.jsonl            # Test pairs (JSONL format)
β”‚   β”œβ”€β”€ minimal_pairs.csv              # Test pairs (CSV format)
β”‚   β”œβ”€β”€ extensive_test_pairs.jsonl     # Extended test set
β”‚   └── image.png                      # Documentation assets
β”‚
β”œβ”€β”€ requirements.txt                   # Python dependencies
β”œβ”€β”€ README.md                          # This file
└── WEB_INTERFACE_GUIDE.md             # Detailed web interface docs

Architecture Flow

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                        User Interface                        β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚  Web Browser   β”‚              β”‚  Command Line       β”‚   β”‚
β”‚  β”‚  (Port 5001)   β”‚              β”‚  (Terminal)         β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
            β”‚                                 β”‚
            β–Ό                                 β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚      Flask App        β”‚        β”‚  Evaluation Scripts      β”‚
β”‚      (app.py)         β”‚        β”‚  - evaluate_ollama.py    β”‚
β”‚  - REST API           β”‚        β”‚  - evaluate_blimp_hf.py  β”‚
β”‚  - Model Management   β”‚        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚  - Progress Tracking  β”‚                   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                   β”‚
            β”‚                               β”‚
            β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                        β”‚
                        β–Ό
        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
        β”‚   Core Evaluation Library     β”‚
        β”‚   (src/eval_plausibility/)    β”‚
        β”‚   - BLIMPEvaluator            β”‚
        β”‚   - Token scoring             β”‚
        β”‚   - Probability computation   β”‚
        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                    β”‚
        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
        β”‚                       β”‚
        β–Ό                       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Ollama Models β”‚      β”‚ HuggingFace Models β”‚
β”‚ (Local LLMs)  β”‚      β”‚ (Transformers)     β”‚
β”‚ - Qwen        β”‚      β”‚ - BERT             β”‚
β”‚ - DeepSeek    β”‚      β”‚ - GPT-2            β”‚
β”‚ - Llama       β”‚      β”‚ - RoBERTa          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ’‘ Usage

Web Interface

Single Evaluation

  1. Navigate to Single Evaluation tab
  2. Enter grammatical sentence (e.g., "I gave John the button.")
  3. Enter ungrammatical sentence (e.g., "I gave John the wall.")
  4. Select one or more models:
    • Ollama Models: qwen2.5:3b, deepseek-r1:7b, llama3.1:8b
    • HuggingFace Models: gpt2, bert-base-uncased, roberta-base
  5. Click Evaluate
  6. View results table with:
    • Good Score (0-10): Plausibility of grammatical sentence
    • Bad Score (0-10): Plausibility of ungrammatical sentence
    • Verdict: βœ“ (Correct) if Good Score > Bad Score
    • Time: Evaluation duration
  7. Scroll to see comparison bar chart

Bulk Evaluation

  1. Navigate to Bulk Evaluation tab
  2. Prepare CSV file with columns:
    • good_sentence: Grammatical/plausible sentences
    • bad_sentence: Ungrammatical/implausible sentences
  3. Click Choose File and upload CSV
  4. Select models for evaluation
  5. Click Evaluate Bulk
  6. Monitor progress bar showing:
    • Current pair being processed
    • Percentage complete
    • Current model
  7. View results:
    • Detailed Results Table: All pairs with scores and verdicts
    • Summary Statistics: Total pairs, overall accuracy, average time
    • Performance Analytics: Bar chart (accuracy) and line chart (performance trend)
  8. Click Download CSV to export results

Command-Line Tools

1. Ollama Evaluation

Basic Usage:

python scripts/evaluate_ollama.py \
  --models qwen2.5:3b \
  --data data/minimal_pairs.jsonl

Multiple Models:

python scripts/evaluate_ollama.py \
  --models qwen2.5:3b deepseek-r1:7b llama3.1:8b \
  --data data/minimal_pairs.jsonl \
  --output results.csv

With JSON Output:

python scripts/evaluate_ollama.py \
  --models qwen2.5:3b \
  --data data/minimal_pairs.jsonl \
  --output results.json \
  --format json

2. HuggingFace Evaluation

Masked Language Model (MLM):

python scripts/evaluate_blimp_hf.py \
  --models bert-base-uncased:mlm roberta-base:mlm \
  --data data/minimal_pairs.jsonl \
  --output results.csv

Causal Language Model (CLM):

python scripts/evaluate_blimp_hf.py \
  --models gpt2:clm \
  --data data/minimal_pairs.jsonl \
  --output results.csv

Mixed Models:

python scripts/evaluate_blimp_hf.py \
  --models bert-base-uncased:mlm gpt2:clm distilbert-base-uncased:mlm \
  --data data/minimal_pairs.jsonl \
  --device cuda \
  --output results.csv

πŸ“Š Evaluation Methodology

Scoring System

Good Score (0-10)

Measures the grammatical correctness and semantic plausibility of the grammatical sentence:

  • 10: Perfect grammar and highly plausible
  • 7-9: Good grammar with minor issues
  • 4-6: Moderate grammaticality
  • 0-3: Poor grammar or implausible

Bad Score (0-10)

Measures how the model scores the ungrammatical/implausible sentence:

  • Lower bad scores indicate better model discrimination
  • High bad scores suggest the model accepts implausible sentences

Verdict

  • βœ“ Correct: Good Score > Bad Score (model correctly identifies good sentence)
  • βœ— Incorrect: Bad Score >= Good Score (model fails to discriminate)

Calculation Methods

Ollama Models (Token Probability)

  1. Generate sentence with token logprobs
  2. Extract log probabilities for each token
  3. Convert to linear probabilities
  4. Compute average probability across tokens
  5. Normalize to 0-10 scale:
    score = (avg_probability Γ— 20) - 10
    score = max(0, min(10, score))
    

HuggingFace Models

MLM (Masked Language Models):

  • Mask each word sequentially
  • Compute probability of correct token
  • Aggregate using pseudo-log-likelihood (PLL)
  • Normalize to 0-10 scale

CLM (Causal Language Models):

  • Compute forward probability (left-to-right)
  • Calculate log-likelihood per token
  • Average across sequence
  • Normalize to 0-10 scale

πŸ“ Data Format

JSONL Format (Recommended)

{"good": "I gave John the button.", "bad": "I gave John the wall."}
{"good": "She ate the apple.", "bad": "She ate the computer."}
{"good": "He put the key in his pocket.", "bad": "He put the house in his pocket."}

CSV Format

good_sentence,bad_sentence
I gave John the button.,I gave John the wall.
She ate the apple.,She ate the computer.
He put the key in his pocket.,He put the house in his pocket.

Sample Test Cases

The data/minimal_pairs.jsonl includes diverse test pairs:

Semantic Anomalies:

  • "I eat biscuit with tea" vs "I eat plate with tea"
  • "I ordered a cycle" vs "I ordered a mountain"
  • "She drinks water every day" vs "She drinks furniture every day"

Size Implausibility:

  • "He has a calculator in his pocket" vs "He has a statue in his pocket"
  • "She picked up a pen" vs "She picked up the sky"

Action-Object Mismatch:

  • "She read the book" vs "She drank the book"
  • "He painted the wall" vs "He painted the time"

πŸ€– Supported Models

🦾 Powerful Language Models at Your Fingertips


πŸ¦™ Ollama Models (Local LLMs)

Ollama

🏷️ Model πŸ“¦ Size ⚑ Speed πŸ“ Description
qwen2.5:3b 3B πŸš€πŸš€πŸš€ Fast, efficient Chinese-English
qwen2.5:7b 7B πŸš€πŸš€ Balanced performance & speed
deepseek-r1:7b 7B πŸš€πŸš€ Reasoning-focused model
llama3.1:8b 8B πŸš€πŸš€ Meta's latest Llama
mistral:7b 7B πŸš€πŸš€ High-quality open model
phi4:latest 14B πŸš€ Microsoft's efficient model

πŸ“₯ Installation:

ollama pull qwen2.5:3b
ollama pull deepseek-r1:7b
ollama pull llama3.1:8b

πŸ€— HuggingFace Models

HuggingFace

🎭 Masked Language Models (MLM)

🏷️ Model πŸ“Š Params 🎯 Use Case
bert-base-uncased 110M Original BERT base
roberta-base 125M Optimized BERT variant
distilbert-base 66M Distilled (faster)
albert-base-v2 12M Lightweight BERT

🎯 Causal Language Models (CLM)

🏷️ Model πŸ“Š Params 🎯 Use Case
gpt2 124M OpenAI GPT-2 base
gpt2-medium 355M Larger GPT-2
gpt2-large 774M Even larger GPT-2

πŸ”„ Auto-download: Models automatically download on first use


🎨 Model Selection Guide

🎯 Use Case πŸ’‘ Recommended Models
πŸš€ Speed Priority qwen2.5:3b, distilbert-base
🎯 Accuracy Priority llama3.1:8b, roberta-base
βš–οΈ Balanced qwen2.5:7b, bert-base-uncased
🧠 Reasoning deepseek-r1:7b, gpt2-medium

πŸ”Œ API Documentation

REST Endpoints

1. Home Page

GET /

Response: HTML web interface

2. Get Available Models

GET /api/models

Response:

{
  "ollama": ["qwen2.5:3b", "deepseek-r1:7b"],
  "huggingface": ["gpt2", "bert-base-uncased", "roberta-base"]
}

3. Single Evaluation

POST /api/evaluate
Content-Type: application/json

{
  "good_sentence": "I gave John the button.",
  "bad_sentence": "I gave John the wall.",
  "models": ["qwen2.5:3b", "gpt2"]
}

Response:

{
  "results": [
    {
      "model": "qwen2.5:3b",
      "good_score": 8.5,
      "bad_score": 3.2,
      "correct": true,
      "time": 1.24
    },
    {
      "model": "gpt2",
      "good_score": 7.8,
      "bad_score": 4.1,
      "correct": true,
      "time": 0.85
    }
  ]
}

4. Bulk Evaluation

POST /api/evaluate_bulk
Content-Type: multipart/form-data

file: <CSV file>
models: ["qwen2.5:3b", "gpt2"]

Response: Streaming JSON with progress updates

5. Get Progress

GET /api/progress

Response:

{
  "current": 5,
  "total": 10,
  "status": "running",
  "current_model": "qwen2.5:3b",
  "current_pair": 5
}

6. Cancel Evaluation

POST /api/cancel

Response:

{"status": "cancelled"}

7. Download Results

GET /api/download_csv

Response: CSV file download


πŸ“ Examples

Example 1: Single Pair Evaluation

Input:

  • Good: "The cat sat on the mat."
  • Bad: "The cat sat on the sky."
  • Models: qwen2.5:3b, bert-base-uncased

Output:

Model Good Score Bad Score Verdict Time
qwen2.5:3b 9.2 2.8 βœ“ 1.1s
bert-base-uncased 8.7 3.5 βœ“ 0.6s

Example 2: Bulk Evaluation

Input CSV (test.csv):

good_sentence,bad_sentence
I gave John the button.,I gave John the wall.
She ate the apple.,She ate the computer.
He drinks water.,He drinks furniture.

Command:

# Via web interface: Upload test.csv, select models, click Evaluate
# Via CLI:
python scripts/evaluate_ollama.py --models qwen2.5:3b --data test.csv

Output:

  • Detailed results table with 3 rows
  • Accuracy: 100% (3/3 correct)
  • Average time: 1.2s per pair
  • Charts showing model performance

Example 3: Multi-Model Comparison

Command:

python scripts/evaluate_ollama.py \
  --models qwen2.5:3b deepseek-r1:7b llama3.1:8b \
  --data data/extensive_test_pairs.jsonl \
  --output comparison.csv

Result: CSV file with side-by-side model scores for analysis


πŸ› οΈ Troubleshooting

Common Issues

1. Ollama Connection Error

Error: Connection refused to localhost:11434

Solution:

# Start Ollama service
ollama serve

2. Model Not Found

Error: Model 'qwen2.5:3b' not found

Solution:

# Pull the model first
ollama pull qwen2.5:3b

3. CUDA Out of Memory

Error: CUDA out of memory

Solution:

# Use CPU instead
python scripts/evaluate_blimp_hf.py --device cpu --models bert-base-uncased:mlm

Or use smaller models:

# Use DistilBERT instead of BERT
python scripts/evaluate_blimp_hf.py --models distilbert-base-uncased:mlm

4. Import Error

Error: ModuleNotFoundError: No module named 'transformers'

Solution:

pip install -r requirements.txt

5. Port Already in Use

Error: Address already in use: Port 5001

Solution:

# Find and kill process using port 5001
lsof -ti:5001 | xargs kill -9

# Or change port in app.py
# app.run(debug=True, host='0.0.0.0', port=5002)

6. Slow Evaluation

Issue: Models taking too long

Solution:

  • Use smaller models (3B instead of 7B)
  • Enable GPU acceleration (add CUDA support)
  • Reduce batch size in evaluate_blimp_hf.py
  • Use MPS on Apple Silicon:
    # Auto-detected in app.py
    device = "mps"  # For M1/M2/M3 Macs

🀝 Contributing

Contributions are welcome! Here's how you can help:

Areas for Improvement

  1. Model Support: Add support for new models (Claude, Gemini, etc.)
  2. Evaluation Metrics: Implement additional scoring methods
  3. Visualization: Enhance charts with more interactive features
  4. Performance: Optimize batch processing and caching
  5. Testing: Add more unit tests and integration tests
  6. Documentation: Improve examples and tutorials

Development Setup

# Clone repository
git clone https://github.com/Sudarshan50/Masked-Language-Model-Scoring.git
cd AIS710

# Create virtual environment
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Run tests
pytest tests/

# Start development server
python3 app.py

Submitting Changes

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit changes (git commit -m 'Add amazing feature')
  4. Push to branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

πŸ“š Additional Resources


πŸ“„ License


πŸ“œ Copyright & Licensing

License Year



Β© 2025 β€’ BLIMP Evaluation Interface


πŸ‘¨β€πŸ’» Developer

Sudarshan

GitHub

πŸ‘©β€πŸ« Academic Supervisor

Prof. Ashwini Vaidya

Course: AIS710



βš–οΈ Usage Terms

This project is developed for educational purposes as part of the AIS710 course.

⚠️ Note: For commercial use, please refer to individual model licenses:



πŸ“ž Contact & Support

πŸ› Report Issues

Issues

πŸ’‘ Feature Requests

Features

πŸ“– Documentation

Docs



🌟 Show Your Support

If you find this project helpful, please consider giving it a ⭐ on GitHub!

GitHub Stars





πŸŽ‰ Happy Evaluating!

Made with Love Python Flask



πŸš€ Start evaluating language models today!

About

πŸŽ“ A comprehensive web-based evaluation system for testing minimal sentence pairs using multiple language models. Supports Ollama (local LLMs) and HuggingFace models with interactive visualisations and bulk processing. Built for AIS710 course.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published