Testing minimal sentence pairs for grammatical correctness and semantic plausibility
Click to expand/collapse
A powerful tool for evaluating language models on minimal pairs - sentence pairs that differ in grammaticality or semantic plausibility
|
The system supports both Ollama (local LLMs) and HuggingFace models, providing:
|
|
| π¦ Ollama Models | π€ HuggingFace Models |
|---|---|
| DeepSeek-R1, Qwen2.5 | BERT, RoBERTa |
| Llama3, Mistral | GPT-2, DistilBERT |
| Phi4 | ALBERT |
|
scripts/evaluate_ollama.py
scripts/evaluate_blimp_hf.py
|
π Interactive Bar Charts β’ π Line Charts β’ π¨ Gradient Styling β’ π‘ Tooltip Explanations
|
Python 3.8+ |
pip |
Ollama (Optional) |
GPU (Optional) |
π½ Click to expand installation steps
git clone https://github.com/Sudarshan50/Masked-Language-Model-Scoring.git
cd AIS710pip install -r requirements.txt| π¦ Package | π’ Version | π Purpose |
transformers |
β₯4.30.0 | HuggingFace models |
torch |
β₯1.12.0 | Neural networks |
flask |
β₯2.3.0 | Web framework |
ollama |
β₯0.3.0 | Ollama client |
tqdm |
latest | Progress bars |
|
π macOS brew install ollama |
π§ Linux curl -fsSL https://ollama.com/install.sh | sh |
πͺ Windows Download from ollama.com |
# π Recommended models for testing
ollama pull qwen2.5:3b # Fast & efficient
ollama pull deepseek-r1:7b # Reasoning-focused
ollama pull llama3.1:8b # Meta's latest
ollama pull mistral:7b # High-qualitygraph LR
A[π Prepare Data] --> B[π§ Select Models]
B --> C[βΆοΈ Run Evaluation]
C --> D[π View Results]
D --> E[πΎ Export Data]
style A fill:#e3f2fd
style B fill:#f3e5f5
style C fill:#e8f5e9
style D fill:#fff3e0
style E fill:#fce4ec
AIS710/
β
βββ app.py # Flask web application (395 lines)
β βββ Single evaluation endpoint
β βββ Bulk evaluation with progress tracking
β βββ Model discovery (Ollama + HuggingFace)
β βββ CSV download endpoint
β βββ Auto-device detection (MPS/CUDA/CPU)
β
βββ templates/
β βββ index.html # Web interface (1620 lines)
β βββ Single evaluation tab
β βββ Bulk evaluation tab
β βββ Chart.js visualizations
β βββ Tooltips with explanations
β βββ Responsive gradient design
β
βββ scripts/
β βββ evaluate_ollama.py # Ollama evaluation engine (20KB)
β β βββ OllamaEvaluator class
β β βββ Token probability extraction
β β βββ Score normalization (0-10 scale)
β β βββ CLI interface with argparse
β β
β βββ evaluate_blimp_hf.py # HuggingFace evaluation (7KB)
β βββ BLIMPEvaluator integration
β βββ MLM and CLM support
β βββ Batch processing
β βββ Device auto-detection
β
βββ src/
β βββ eval_plausibility/
β βββ __init__.py
β βββ blimp_evaluator.py # Core evaluator (403 lines)
β β βββ CLM scoring (Causal LM)
β β βββ MLM scoring (Masked LM)
β β βββ Token alignment
β β βββ Category-wise metrics
β β
β βββ eval.py # Scoring functions
β βββ score_sentence_clm()
β βββ score_sentence_mlm_pll_word_l2r()
β βββ Tokenization utilities
β
βββ data/
β βββ minimal_pairs.jsonl # Test pairs (JSONL format)
β βββ minimal_pairs.csv # Test pairs (CSV format)
β βββ extensive_test_pairs.jsonl # Extended test set
β βββ image.png # Documentation assets
β
βββ requirements.txt # Python dependencies
βββ README.md # This file
βββ WEB_INTERFACE_GUIDE.md # Detailed web interface docs
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β User Interface β
β ββββββββββββββββββ βββββββββββββββββββββββ β
β β Web Browser β β Command Line β β
β β (Port 5001) β β (Terminal) β β
β ββββββββββ¬ββββββββ ββββββββββββ¬βββββββββββ β
βββββββββββββΌβββββββββββββββββββββββββββββββββΌβββββββββββββββ
β β
βΌ βΌ
βββββββββββββββββββββββββ ββββββββββββββββββββββββββββ
β Flask App β β Evaluation Scripts β
β (app.py) β β - evaluate_ollama.py β
β - REST API β β - evaluate_blimp_hf.py β
β - Model Management β ββββββββββββ¬ββββββββββββββββ
β - Progress Tracking β β
βββββββββββββ¬ββββββββββββ β
β β
βββββββββββββ¬ββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββ
β Core Evaluation Library β
β (src/eval_plausibility/) β
β - BLIMPEvaluator β
β - Token scoring β
β - Probability computation β
βββββββββββββ¬ββββββββββββββββββββ
β
βββββββββββββ΄ββββββββββββ
β β
βΌ βΌ
βββββββββββββββββ ββββββββββββββββββββββ
β Ollama Models β β HuggingFace Models β
β (Local LLMs) β β (Transformers) β
β - Qwen β β - BERT β
β - DeepSeek β β - GPT-2 β
β - Llama β β - RoBERTa β
βββββββββββββββββ ββββββββββββββββββββββ
- Navigate to Single Evaluation tab
- Enter grammatical sentence (e.g., "I gave John the button.")
- Enter ungrammatical sentence (e.g., "I gave John the wall.")
- Select one or more models:
- Ollama Models: qwen2.5:3b, deepseek-r1:7b, llama3.1:8b
- HuggingFace Models: gpt2, bert-base-uncased, roberta-base
- Click Evaluate
- View results table with:
- Good Score (0-10): Plausibility of grammatical sentence
- Bad Score (0-10): Plausibility of ungrammatical sentence
- Verdict: β (Correct) if Good Score > Bad Score
- Time: Evaluation duration
- Scroll to see comparison bar chart
- Navigate to Bulk Evaluation tab
- Prepare CSV file with columns:
good_sentence: Grammatical/plausible sentencesbad_sentence: Ungrammatical/implausible sentences
- Click Choose File and upload CSV
- Select models for evaluation
- Click Evaluate Bulk
- Monitor progress bar showing:
- Current pair being processed
- Percentage complete
- Current model
- View results:
- Detailed Results Table: All pairs with scores and verdicts
- Summary Statistics: Total pairs, overall accuracy, average time
- Performance Analytics: Bar chart (accuracy) and line chart (performance trend)
- Click Download CSV to export results
Basic Usage:
python scripts/evaluate_ollama.py \
--models qwen2.5:3b \
--data data/minimal_pairs.jsonlMultiple Models:
python scripts/evaluate_ollama.py \
--models qwen2.5:3b deepseek-r1:7b llama3.1:8b \
--data data/minimal_pairs.jsonl \
--output results.csvWith JSON Output:
python scripts/evaluate_ollama.py \
--models qwen2.5:3b \
--data data/minimal_pairs.jsonl \
--output results.json \
--format jsonMasked Language Model (MLM):
python scripts/evaluate_blimp_hf.py \
--models bert-base-uncased:mlm roberta-base:mlm \
--data data/minimal_pairs.jsonl \
--output results.csvCausal Language Model (CLM):
python scripts/evaluate_blimp_hf.py \
--models gpt2:clm \
--data data/minimal_pairs.jsonl \
--output results.csvMixed Models:
python scripts/evaluate_blimp_hf.py \
--models bert-base-uncased:mlm gpt2:clm distilbert-base-uncased:mlm \
--data data/minimal_pairs.jsonl \
--device cuda \
--output results.csvMeasures the grammatical correctness and semantic plausibility of the grammatical sentence:
- 10: Perfect grammar and highly plausible
- 7-9: Good grammar with minor issues
- 4-6: Moderate grammaticality
- 0-3: Poor grammar or implausible
Measures how the model scores the ungrammatical/implausible sentence:
- Lower bad scores indicate better model discrimination
- High bad scores suggest the model accepts implausible sentences
- β Correct: Good Score > Bad Score (model correctly identifies good sentence)
- β Incorrect: Bad Score >= Good Score (model fails to discriminate)
- Generate sentence with token logprobs
- Extract log probabilities for each token
- Convert to linear probabilities
- Compute average probability across tokens
- Normalize to 0-10 scale:
score = (avg_probability Γ 20) - 10 score = max(0, min(10, score))
MLM (Masked Language Models):
- Mask each word sequentially
- Compute probability of correct token
- Aggregate using pseudo-log-likelihood (PLL)
- Normalize to 0-10 scale
CLM (Causal Language Models):
- Compute forward probability (left-to-right)
- Calculate log-likelihood per token
- Average across sequence
- Normalize to 0-10 scale
{"good": "I gave John the button.", "bad": "I gave John the wall."}
{"good": "She ate the apple.", "bad": "She ate the computer."}
{"good": "He put the key in his pocket.", "bad": "He put the house in his pocket."}good_sentence,bad_sentence
I gave John the button.,I gave John the wall.
She ate the apple.,She ate the computer.
He put the key in his pocket.,He put the house in his pocket.The data/minimal_pairs.jsonl includes diverse test pairs:
Semantic Anomalies:
- "I eat biscuit with tea" vs "I eat plate with tea"
- "I ordered a cycle" vs "I ordered a mountain"
- "She drinks water every day" vs "She drinks furniture every day"
Size Implausibility:
- "He has a calculator in his pocket" vs "He has a statue in his pocket"
- "She picked up a pen" vs "She picked up the sky"
Action-Object Mismatch:
- "She read the book" vs "She drank the book"
- "He painted the wall" vs "He painted the time"
π₯ Installation: ollama pull qwen2.5:3b
ollama pull deepseek-r1:7b
ollama pull llama3.1:8b |
|
| π― Use Case | π‘ Recommended Models |
|---|---|
| π Speed Priority | qwen2.5:3b, distilbert-base |
| π― Accuracy Priority | llama3.1:8b, roberta-base |
| βοΈ Balanced | qwen2.5:7b, bert-base-uncased |
| π§ Reasoning | deepseek-r1:7b, gpt2-medium |
GET /Response: HTML web interface
GET /api/modelsResponse:
{
"ollama": ["qwen2.5:3b", "deepseek-r1:7b"],
"huggingface": ["gpt2", "bert-base-uncased", "roberta-base"]
}POST /api/evaluate
Content-Type: application/json
{
"good_sentence": "I gave John the button.",
"bad_sentence": "I gave John the wall.",
"models": ["qwen2.5:3b", "gpt2"]
}Response:
{
"results": [
{
"model": "qwen2.5:3b",
"good_score": 8.5,
"bad_score": 3.2,
"correct": true,
"time": 1.24
},
{
"model": "gpt2",
"good_score": 7.8,
"bad_score": 4.1,
"correct": true,
"time": 0.85
}
]
}POST /api/evaluate_bulk
Content-Type: multipart/form-data
file: <CSV file>
models: ["qwen2.5:3b", "gpt2"]Response: Streaming JSON with progress updates
GET /api/progressResponse:
{
"current": 5,
"total": 10,
"status": "running",
"current_model": "qwen2.5:3b",
"current_pair": 5
}POST /api/cancelResponse:
{"status": "cancelled"}GET /api/download_csvResponse: CSV file download
Input:
- Good: "The cat sat on the mat."
- Bad: "The cat sat on the sky."
- Models: qwen2.5:3b, bert-base-uncased
Output:
| Model | Good Score | Bad Score | Verdict | Time |
|---|---|---|---|---|
| qwen2.5:3b | 9.2 | 2.8 | β | 1.1s |
| bert-base-uncased | 8.7 | 3.5 | β | 0.6s |
Input CSV (test.csv):
good_sentence,bad_sentence
I gave John the button.,I gave John the wall.
She ate the apple.,She ate the computer.
He drinks water.,He drinks furniture.Command:
# Via web interface: Upload test.csv, select models, click Evaluate
# Via CLI:
python scripts/evaluate_ollama.py --models qwen2.5:3b --data test.csvOutput:
- Detailed results table with 3 rows
- Accuracy: 100% (3/3 correct)
- Average time: 1.2s per pair
- Charts showing model performance
Command:
python scripts/evaluate_ollama.py \
--models qwen2.5:3b deepseek-r1:7b llama3.1:8b \
--data data/extensive_test_pairs.jsonl \
--output comparison.csvResult: CSV file with side-by-side model scores for analysis
Error: Connection refused to localhost:11434
Solution:
# Start Ollama service
ollama serveError: Model 'qwen2.5:3b' not found
Solution:
# Pull the model first
ollama pull qwen2.5:3bError: CUDA out of memory
Solution:
# Use CPU instead
python scripts/evaluate_blimp_hf.py --device cpu --models bert-base-uncased:mlmOr use smaller models:
# Use DistilBERT instead of BERT
python scripts/evaluate_blimp_hf.py --models distilbert-base-uncased:mlmError: ModuleNotFoundError: No module named 'transformers'
Solution:
pip install -r requirements.txtError: Address already in use: Port 5001
Solution:
# Find and kill process using port 5001
lsof -ti:5001 | xargs kill -9
# Or change port in app.py
# app.run(debug=True, host='0.0.0.0', port=5002)Issue: Models taking too long
Solution:
- Use smaller models (3B instead of 7B)
- Enable GPU acceleration (add CUDA support)
- Reduce batch size in evaluate_blimp_hf.py
- Use MPS on Apple Silicon:
# Auto-detected in app.py device = "mps" # For M1/M2/M3 Macs
Contributions are welcome! Here's how you can help:
- Model Support: Add support for new models (Claude, Gemini, etc.)
- Evaluation Metrics: Implement additional scoring methods
- Visualization: Enhance charts with more interactive features
- Performance: Optimize batch processing and caching
- Testing: Add more unit tests and integration tests
- Documentation: Improve examples and tutorials
# Clone repository
git clone https://github.com/Sudarshan50/Masked-Language-Model-Scoring.git
cd AIS710
# Create virtual environment
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Run tests
pytest tests/
# Start development server
python3 app.py- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit changes (
git commit -m 'Add amazing feature') - Push to branch (
git push origin feature/amazing-feature) - Open a Pull Request
- WEB_INTERFACE_GUIDE.md: Detailed web interface documentation
- Ollama Documentation: ollama.com/docs
- HuggingFace Transformers: huggingface.co/docs/transformers
- Flask Documentation: flask.palletsprojects.com
- Chart.js: chartjs.org
Β© 2025 β’ BLIMP Evaluation Interface
|
Sudarshan |
Prof. Ashwini Vaidya Course: AIS710 |
This project is developed for educational purposes as part of the AIS710 course.
β οΈ Note: For commercial use, please refer to individual model licenses:
- Ollama models: Check respective model repositories
- HuggingFace models: See HuggingFace Model Hub
|
|
|
|
If you find this project helpful, please consider giving it a β on GitHub!
π Start evaluating language models today!
