Scientific Paper Summarization Benchmark

Benchmarking tool for evaluating text summarization methods on scientific papers.

Quick Start

Clone llm_apis repository
Clone exploration repository
Clone this repository

Install dependencies

cd llm_summarization_benchmark  
uv sync
uv run spacy download en_core_web_sm

Install AlignScore-large

mkdir -p Output/llm_summarization_benchmark
cd Output/llm_summarization_benchmark
wget https://huggingface.co/yzha/AlignScore/resolve/main/AlignScore-large.ckpt

Copy Resources/example.env to Resources/.env and adjust
Run
```
uv run benchmark
```

Individual LLM config parameters are stored in ../llm_apis/src/llm_apis/config.py (separate llm_apis repository)

Run the visualization only without benchmarking

The following files must be in place in order to load previous results:

Output/llm_summarization_benchmark/benchmark.pkl
Output/llm_apis/cache.json

Afterwards, simply run the benchmark again - processed results will be skipped.

Workflow

text_summarization_goldstandard_data.json

Document store in Resources folder, containing ID, title, abstract and reference summaries. 1-N reference summaries can be provided per paper. Multiple reference summaries improve evaluation robustness and reduce single-annotator bias.

[
  {
    "title": "Paper Title",
    "abstract": "Paper abstract text...",
    "id": "paper_001",
    "summaries": [
      "This paper analyzes ..",
      "The paper investigates .. "
    ]
  }
]

Reference summary sources

Highlight sections of Elsevier and Cell papers, joined by ". ".

Summarization Methods

local:textrank

Tokenizes sentences (nltk)
Creates TF-IDF vectors for sentence representation (sklearn)
Calculates cosine similarities between TF-IDF vectors (sklearn)
Builds similarity graph with cosine similarities as edge weights (networkx)
Applies PageRank to rank sentences by importance (networkx)
Selects highest-scoring sentences within word count limits while preserving original order

WARNING: Results might be misleading when gold-standard summaries are (partial) copies from the source document, rather than being abstractive

local:frequency

Calculates word frequency scores
ranks sentences by avg. word frequency (excluding stopwords (nltk)
selects highest-scoring sentences (in original order) within word count limits

External Platforms

Ollama, OpenAI, Perplexity, Anthropic and a number of models

Evaluation Metrics

Each generated summary is evaluated against all available gold-standard reference summaries of a document using a number of metrics as listed below. For each metric, mean/min/max/std are computed.

Rouge

Set of metrics for evaluating summary quality by comparing to reference summaries. wiki | package | publication

ROUGE-N: N-gram co-occurrence statistics between system and reference summaries.
- ROUGE-1: Overlap of unigrams (individual words)
- ROUGE-2: Overlap of bigrams (word pairs)
ROUGE-L: Longest Common Subsequence (LCS) based statistics that capture sentence-level structure similarity by awarding credit only to in-sequence word matches.

Bert

Semantic similarity using BERT embeddings. paper | package

roberta-large: Default model paper | model
microsoft/deberta-xlarge-mnli: Proposed as "better model" paper | model)

Meteor

Matches words through exact matches, stemming, synonyms, and considers word order. Claims to outperform BLEU. paper | function

BLEU

N-gram overlaps with brevity penalty. paper | function

all-mpnet-base-v2

Semantic similarity using sentence transformers. Compares generated summary directly against the source document (rather than reference summaries like other metrics). model

Further Metrics

Execution Time: Processing time
Length Compliance Metrics
- Within Bounds: Percentage meeting length constraints
- Too Short/Long: Violation statistics with percentages
- Average Length: Mean word count with standard deviation
- Length Distribution: Detailed statistical analysis

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
Resources		Resources
src/llm_summarization_benchmark		src/llm_summarization_benchmark
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Scientific Paper Summarization Benchmark

Quick Start

Run the visualization only without benchmarking

Workflow

text_summarization_goldstandard_data.json

Reference summary sources

Summarization Methods

local:textrank

local:frequency

External Platforms

Evaluation Metrics

Rouge

Bert

Meteor

BLEU

all-mpnet-base-v2

Further Metrics

About

Uh oh!

Releases

Packages

Languages

License

Delta4AI/LLMTextSummarizationBenchmark

Folders and files

Latest commit

History

Repository files navigation

Scientific Paper Summarization Benchmark

Quick Start

Run the visualization only without benchmarking

Workflow

text_summarization_goldstandard_data.json

Reference summary sources

Summarization Methods

local:textrank

local:frequency

External Platforms

Evaluation Metrics

Rouge

Bert

Meteor

BLEU

all-mpnet-base-v2

Further Metrics

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages