Skip to content

Delta4AI/LLMTextSummarizationBenchmark

Repository files navigation

Scientific Paper Summarization Benchmark

Benchmarking tool for evaluating text summarization methods on scientific papers.

Quick Start

  1. Clone llm_apis repository
  2. Clone exploration repository
  3. Clone this repository
  4. Install dependencies
    cd llm_summarization_benchmark  
    uv sync
    uv run spacy download en_core_web_sm
  5. Install AlignScore-large
    mkdir -p Output/llm_summarization_benchmark
    cd Output/llm_summarization_benchmark
    wget https://huggingface.co/yzha/AlignScore/resolve/main/AlignScore-large.ckpt
  6. Copy Resources/example.env to Resources/.env and adjust
  7. Run
    uv run benchmark

Individual LLM config parameters are stored in ../llm_apis/src/llm_apis/config.py (separate llm_apis repository)

Run the visualization only without benchmarking

The following files must be in place in order to load previous results:

  • Output/llm_summarization_benchmark/benchmark.pkl
  • Output/llm_apis/cache.json

Afterwards, simply run the benchmark again - processed results will be skipped.


Workflow

Workflow


text_summarization_goldstandard_data.json

Document store in Resources folder, containing ID, title, abstract and reference summaries. 1-N reference summaries can be provided per paper. Multiple reference summaries improve evaluation robustness and reduce single-annotator bias.

[
  {
    "title": "Paper Title",
    "abstract": "Paper abstract text...",
    "id": "paper_001",
    "summaries": [
      "This paper analyzes ..",
      "The paper investigates .. "
    ]
  }
]

Reference summary sources

  • Highlight sections of Elsevier and Cell papers, joined by ". ".

Summarization Methods

local:textrank

  1. Tokenizes sentences (nltk)
  2. Creates TF-IDF vectors for sentence representation (sklearn)
  3. Calculates cosine similarities between TF-IDF vectors (sklearn)
  4. Builds similarity graph with cosine similarities as edge weights (networkx)
  5. Applies PageRank to rank sentences by importance (networkx)
  6. Selects highest-scoring sentences within word count limits while preserving original order

WARNING: Results might be misleading when gold-standard summaries are (partial) copies from the source document, rather than being abstractive

local:frequency

  1. Calculates word frequency scores
  2. ranks sentences by avg. word frequency (excluding stopwords (nltk)
  3. selects highest-scoring sentences (in original order) within word count limits

External Platforms

  • Ollama, OpenAI, Perplexity, Anthropic and a number of models

Evaluation Metrics

Each generated summary is evaluated against all available gold-standard reference summaries of a document using a number of metrics as listed below. For each metric, mean/min/max/std are computed.

Rouge

Set of metrics for evaluating summary quality by comparing to reference summaries. wiki | package | publication

  • ROUGE-N: N-gram co-occurrence statistics between system and reference summaries.
    • ROUGE-1: Overlap of unigrams (individual words)
    • ROUGE-2: Overlap of bigrams (word pairs)
  • ROUGE-L: Longest Common Subsequence (LCS) based statistics that capture sentence-level structure similarity by awarding credit only to in-sequence word matches.

Bert

Semantic similarity using BERT embeddings. paper | package

  • roberta-large: Default model paper | model
  • microsoft/deberta-xlarge-mnli: Proposed as "better model" paper | model)

Meteor

Matches words through exact matches, stemming, synonyms, and considers word order. Claims to outperform BLEU. paper | function

BLEU

N-gram overlaps with brevity penalty. paper | function

all-mpnet-base-v2

Semantic similarity using sentence transformers. Compares generated summary directly against the source document (rather than reference summaries like other metrics). model

Further Metrics

  • Execution Time: Processing time
  • Length Compliance Metrics
    • Within Bounds: Percentage meeting length constraints
    • Too Short/Long: Violation statistics with percentages
    • Average Length: Mean word count with standard deviation
    • Length Distribution: Detailed statistical analysis

About

pipeline for systematic evaluation and benchmarking of text summarization methods for biomedical literature

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages