PyCEFRizer - CEFR-J Level Estimator

A Python implementation of PyCEFRizer (CEFR-J Level Estimator) for estimating the CEFR-J level of English reading passages.

Overview

PyCEFRizer analyzes English text (10-10,000 words) and estimates its difficulty level according to the CEFR-J framework. It uses 8 linguistic metrics:

CVV1: Corrected Verb Variation (verb diversity)
BperA: B-level to A-level content word ratio
POStypes: Average distinct POS tags per sentence
ARI: Automated Readability Index
AvrDiff: Average difficulty of content words
AvrFreqRank: Average frequency rank of words
VperSent: Average verbs per sentence
LenNP: Average noun phrase length

Installation

Installation with uv (Recommended)

uv is a fast Python package manager that provides better performance and reliability.

Install uv (if not already installed):

curl -LsSf https://astral.sh/uv/install.sh | sh

Clone and setup:

git clone https://github.com/straygizmo/PyCEFRizer.git
cd PyCEFRizer
uv sync

Install spaCy model:

uv run python -m spacy download en_core_web_sm

Install from GitHub

pip install git+https://github.com/straygizmo/PyCEFRizer.git

Or install with development dependencies:

pip install "git+https://github.com/straygizmo/PyCEFRizer.git#egg=pycefrizer[dev]"

After installation, download the required spaCy model:

python -m spacy download en_core_web_sm

Install from Source

Clone the repository:

git clone https://github.com/straygizmo/PyCEFRizer.git
cd PyCEFRizer

Install in development mode:

pip install -e .

Download spaCy model:

python -m spacy download en_core_web_sm

Usage

Basic Usage

from pycefrizer import PyCEFRizer

# Create analyzer
analyzer = PyCEFRizer()

# Analyze text
text = "Your English text here..."
result = analyzer.analyze(text)

print(result)
# Output:
# {
#    "CEFR-J_Level": "B2.2",
#    "CVV1_CEFR": "4.23",
#    "BperA_CEFR": "3.87",
#    "POStypes_CEFR": "4.12",
#    "ARI_CEFR": "4.56",
#    "AvrDiff_CEFR": "4.01",
#    "AvrFreqRank_CEFR": "3.92",
#    "VperSent_CEFR": "4.34",
#    "LenNP_CEFR": "4.15"
# }

JSON Output

# Get JSON formatted output
json_result = analyzer.analyze_json(text)
print(json_result)

Detailed Analysis

# Get detailed analysis with raw metric values
detailed = analyzer.get_detailed_analysis(text)
print(detailed)
# Includes raw metric values in addition to CEFR scores

Command Line Usage

After installation, you can use the pycefrizer command:

# With uv (if installed with uv)
uv run pycefrizer "Your English text here..."
uv run pycefrizer -f input.txt
uv run pycefrizer -f input.txt -o output.json
uv run pycefrizer -d "Your text here..."
cat article.txt | uv run pycefrizer

# With standard pip installation
pycefrizer "Your English text here..."
pycefrizer -f input.txt
pycefrizer -f input.txt -o output.json
pycefrizer -d "Your text here..."
cat article.txt | pycefrizer

Python API Usage

import pycefrizer

# Quick analysis
result = pycefrizer.analyze("Your English text here...")
print(result)

# Using the class directly
from pycefrizer import PyCEFRizer

analyzer = PyCEFRizer()
result = analyzer.analyze("Your text here...")

Word CEFR Level Lookup

PyCEFRizer can also look up CEFR levels for individual words:

from pycefrizer import get_word_level, check_word_level

# Get CEFR level for a single word
level = get_word_level("beautiful")
print(level)  # Output: B1

# Check if a word is at or below a target level
is_basic = check_word_level("cat", "A2")  # True (cat is A1)
is_basic = check_word_level("paradigm", "B1")  # False (paradigm is C1)

# Using the analyzer directly
analyzer = PyCEFRizer()
level = analyzer.get_word_cefr_level("computer")
print(level)  # Output: A2

# Single word through analyze() method
result = analyzer.analyze("beautiful")
print(result)  # Output: {"CEFR_Level": "B1"}

Finding Unused Vocabulary

PyCEFRizer can identify words from the dictionary at a specific CEFR level that are NOT used in the provided text. This is useful for educational material development and vocabulary gap analysis:

from pycefrizer import PyCEFRizer

analyzer = PyCEFRizer()

# Find unused C1 vocabulary in a simple text
unused_c1 = analyzer.get_unused_words("C1", "The cat sat on the mat.")
print(unused_c1)  # Output: {"cloak": "noun", "exterior": "noun", ...}

# Find unused B2 vocabulary
text = "This is a comprehensive analysis of modern technology."
unused_b2 = analyzer.get_unused_words("B2", text)
print(f"Number of unused B2 words: {len(unused_b2)}")
# Shows B2 words not used in the text

Command line usage:

# With uv
uv run pycefrizer -w "beautiful"
# Output: B1

# With standard installation
pycefrizer -w "beautiful"
# Output: B1

# Word not in dictionary returns empty
uv run pycefrizer -w "xyz123"
# Output: (empty line)

MCP Server

PyCEFRizer includes an MCP (Model Context Protocol) server that allows AI assistants to analyze text difficulty levels through a standardized interface.

MCP Server Configuration

To use PyCEFRizer as an MCP server, add the following configuration to your .mcp.json file:

{
  "mcpServers": {
    "pycefrizer": {
      "command": "uv",
      "args": ["run", "pycefrizer-mcp"],
      "cwd": "/path/to/PyCEFRizer"
    }
  }
}

Alternative configuration using Python module directly:

{
  "mcpServers": {
    "pycefrizer": {
      "command": "uv",
      "args": ["run", "python", "-m", "pycefrizer.mcp_server"],
      "cwd": "/path/to/PyCEFRizer"
    }
  }
}

Available MCP Tools

The MCP server provides the following tools:

analyze_text: Analyze English text and return CEFR-J level assessment with metric scores
get_word_cefr_level: Get the CEFR level of a single English word
get_unused_words: Find unused vocabulary from a specific CEFR level in the given text
get_detailed_analysis: Get detailed analysis including raw metric values and processed scores
analyze_file: Analyze text from a file and return CEFR-J level assessment
get_available_words: Get all available words in the dictionary for a specific CEFR level
get_cefr_words: Get all available words from the dictionary grouped by CEFR levels

Running MCP Server Manually

You can also run the MCP server manually for testing:

# With uv
uv run pycefrizer-mcp

# Or using Python module
uv run python -m pycefrizer.mcp_server

CEFR-J Levels

The analyzer returns one of the following CEFR-J levels:

preA1: Below A1 level
A1.1, A1.2, A1.3: Elementary levels
A2.1, A2.2: Pre-intermediate levels
B1.1, B1.2: Intermediate levels
B2.1, B2.2: Upper-intermediate levels
C1: Advanced level
C2: Proficient level

Data Files

The analyzer uses two data files in the data/ directory:

word_lookup.json: Comprehensive word lookup dictionary with CEFR levels, base forms, and POS tags (21,891 words)
coca_frequencies.json: Word frequency rankings from COCA

Note: The included data files contain sample data. For production use, you should obtain complete EVP and COCA datasets.

Requirements

Python 3.9+
spaCy 3.7.2
textstat 0.7.4
nltk 3.8+

Development

Setting up Development Environment

# Clone the repository
git clone https://github.com/straygizmo/PyCEFRizer.git
cd PyCEFRizer

# Create virtual environment and install dependencies
uv sync

# Install spaCy model
uv run python -m spacy download en_core_web_sm

# Run tests
uv run pytest

Adding Dependencies

# Add a new dependency
uv add package-name

# Add a development dependency
uv add --dev package-name

Updating Dependencies

# Update all dependencies
uv sync --upgrade

# Update a specific package
uv add package-name@latest

Building the Package

To build the package for distribution:

# Build using uv
uv build

# This creates dist/ directory with:
# - pycefrizer-3.0.0.tar.gz (source distribution)
# - pycefrizer-3.0.0-py3-none-any.whl (wheel distribution)

Publishing to PyPI

# Publish to TestPyPI first (for testing)
uv publish --repository testpypi

# Publish to PyPI
uv publish

Credits

Research Foundation

This implementation is based on the methodology described in:

Uchida, S., & Negishi, M. (2025). "Estimating the CEFR-J Level of English Reading Passages: Development and Accuracy of CVLA3". To appear in English Corpus Studies, Vol. 32.

The full research paper is included in this repository at theses/Uchida_Negishi_2025.md.

『CEFR-J Wordlist Version 1.6』東京外国語大学投野由紀夫研究室. （URL: http://www.cefr-j.org/download.html より 2022 年 2 月ダウンロード）

Implementation

This PyCEFRizer (CEFR-J Level Estimator) implementation was designed and developed using Claude Code, Anthropic's AI coding assistant. The implementation faithfully follows the CVLA3 methodology described in the research paper, including:

All 8 linguistic metrics (CVV1, BperA, POStypes, ARI, AvrDiff, AvrFreqRank, VperSent, LenNP)
Regression equations for CEFR score calculation
CEFR-J level mapping methodology
Statistical approach of averaging middle 6 values for stability

Key Features

Uses spaCy for NLP processing (POS tagging, dependency parsing)
Implements 8 linguistic metrics for comprehensive text analysis
Applies regression equations to convert metrics to CEFR scores
Averages middle 6 scores (excluding min/max) for stability
Maps final score to CEFR-J level

License

This implementation is for educational and research purposes. When using this software, please cite:

The original research paper by Uchida & Negishi (2025)
This implementation as: "PyCEFRizer - CEFR-J Level Estimator, implemented using Claude Code based on Uchida & Negishi (2025)"

Contributing

Contributions are welcome! Please feel free to submit issues or pull requests.

Acknowledgments

The authors of the original CVLA3 research for their innovative methodology
The spaCy team for their excellent NLP library
The English Vocabulary Profile and COCA for linguistic resources

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
pycefrizer		pycefrizer
tests		tests
theses		theses
.gitignore		.gitignore
.python-version		.python-version
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
MCP_SERVER_README.md		MCP_SERVER_README.md
PUBLISHING.md		PUBLISHING.md
README.md		README.md
RELEASE_CHECKLIST.md		RELEASE_CHECKLIST.md
analyze_text.py		analyze_text.py
check_word_lookup.py		check_word_lookup.py
example_usage.py		example_usage.py
mcp.json		mcp.json
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run_example.sh		run_example.sh
setup.py		setup.py
test_mcp_server.py		test_mcp_server.py
test_with_file.py		test_with_file.py
testdata.txt		testdata.txt

License

straygizmo/PyCEFRizer

Folders and files

Latest commit

History

Repository files navigation

PyCEFRizer - CEFR-J Level Estimator

Overview

Installation

Installation with uv (Recommended)

Install from GitHub

Install from Source

Usage

Basic Usage

JSON Output

Detailed Analysis

Command Line Usage

Python API Usage

Word CEFR Level Lookup

Finding Unused Vocabulary

MCP Server

MCP Server Configuration

Available MCP Tools

Running MCP Server Manually

CEFR-J Levels

Data Files

Requirements

Development

Setting up Development Environment

Adding Dependencies

Updating Dependencies

Building the Package

Publishing to PyPI

Credits

Research Foundation

Implementation

Key Features

License

Contributing

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages