Benchmarking tool for evaluating text summarization methods on scientific papers.
- Clone
llm_apisrepository - Clone
explorationrepository - Clone this repository
- Install dependencies
cd llm_summarization_benchmark uv sync uv run spacy download en_core_web_sm - Install AlignScore-large
mkdir -p Output/llm_summarization_benchmark cd Output/llm_summarization_benchmark wget https://huggingface.co/yzha/AlignScore/resolve/main/AlignScore-large.ckpt - Copy
Resources/example.envtoResources/.envand adjust - Run
uv run benchmark
Individual LLM config parameters are stored in
../llm_apis/src/llm_apis/config.py(separatellm_apisrepository)
The following files must be in place in order to load previous results:
Output/llm_summarization_benchmark/benchmark.pklOutput/llm_apis/cache.json
Afterwards, simply run the benchmark again - processed results will be skipped.
Document store in Resources folder, containing ID, title, abstract and reference summaries.
1-N reference summaries can be provided per paper.
Multiple reference summaries improve evaluation robustness and reduce single-annotator bias.
[
{
"title": "Paper Title",
"abstract": "Paper abstract text...",
"id": "paper_001",
"summaries": [
"This paper analyzes ..",
"The paper investigates .. "
]
}
]- Highlight sections of Elsevier and Cell papers, joined by ". ".
- Tokenizes sentences (nltk)
- Creates TF-IDF vectors for sentence representation (sklearn)
- Calculates cosine similarities between TF-IDF vectors (sklearn)
- Builds similarity graph with cosine similarities as edge weights (networkx)
- Applies PageRank to rank sentences by importance (networkx)
- Selects highest-scoring sentences within word count limits while preserving original order
WARNING: Results might be misleading when gold-standard summaries are (partial) copies from the source document, rather than being abstractive
- Calculates word frequency scores
- ranks sentences by avg. word frequency (excluding stopwords (nltk)
- selects highest-scoring sentences (in original order) within word count limits
- Ollama, OpenAI, Perplexity, Anthropic and a number of models
Each generated summary is evaluated against all available gold-standard reference summaries of a document using a number of metrics as listed below. For each metric, mean/min/max/std are computed.
Set of metrics for evaluating summary quality by comparing to reference summaries. wiki | package | publication
- ROUGE-N: N-gram co-occurrence statistics between system and reference summaries.
- ROUGE-1: Overlap of unigrams (individual words)
- ROUGE-2: Overlap of bigrams (word pairs)
- ROUGE-L: Longest Common Subsequence (LCS) based statistics that capture sentence-level structure similarity by awarding credit only to in-sequence word matches.
Semantic similarity using BERT embeddings. paper | package
roberta-large: Default model paper | modelmicrosoft/deberta-xlarge-mnli: Proposed as "better model" paper | model)
Matches words through exact matches, stemming, synonyms, and considers word order. Claims to outperform BLEU. paper | function
N-gram overlaps with brevity penalty. paper | function
Semantic similarity using sentence transformers. Compares generated summary directly against the source document (rather than reference summaries like other metrics). model
- Execution Time: Processing time
- Length Compliance Metrics
- Within Bounds: Percentage meeting length constraints
- Too Short/Long: Violation statistics with percentages
- Average Length: Mean word count with standard deviation
- Length Distribution: Detailed statistical analysis