SCOPE: Summarization with Context-Oriented Preprocessing and Evaluation

Co-Supervised by: Merlin Streilein

If you are interested in this topic or have further questions, do not hesitate to contact merlin.streilein@unibe.ch.

Background / Context

Modern text summarization models, such as BART or T5, are limited by their maximum input length (commonly around 1024 tokens). This presents a challenge for summarizing long documents such as scientific articles, legal contracts, or books. A naive solution is recursive summarization, where documents are chunked and summarized iteratively, but this often leads to information loss, redundancy, or inconsistency. Recent research has suggested alternatives: heuristics that focus on specific parts of the document (e.g., beginnings and ends of paragraphs), importance-based ranking of sentences, or hybrid approaches combining extractive and abstractive methods. Despite these ideas, there has been little systematic evaluation of strategies to adapt fixed-length models to long inputs, and the potential for new preprocessing schemes remains underexplored.

Research Question(s) / Goals

What are the strengths and weaknesses of existing strategies (recursive summarization, sentence ranking, positional heuristics) for handling long-input summarization with fixed-length models like BART?
Can a new strategy be developed that better preserves global coherence while respecting input length limits?
How do different strategies compare across domains (news articles, scientific papers, long-form reviews)?

Approach / Methods

Implement baseline summarization using a pretrained model such as BART with the standard 1024-token limit.
Reproduce existing preprocessing strategies:
- Recursive summarization.
- Sentence importance ranking (using similarity, embeddings, or TF-IDF).
- Positional heuristics (favoring beginnings/ends of sections).
Propose and implement a new preprocessing scheme (e.g., adaptive importance sampling across the text, hybrid extractive-abstractive pipelines).
Evaluate methods on long-document datasets (e.g., arXiv, PubMed, BookSum, or long-form news datasets).
Compare using both automatic metrics (ROUGE, BERTScore) and human evaluation of coherence and coverage.

Expected Contributions / Outcomes

A systematic evaluation of strategies for adapting fixed-length summarization models to long inputs.
Identification of trade-offs between different approaches (coverage vs. coherence, precision vs. recall).
A prototype implementation of a novel strategy that improves on baseline methods.
Insights into the role of text structure (e.g., paragraph boundaries, sentence positions) in summarization effectiveness.

Required Skills / Prerequisites

Programming in Python.
Familiarity with NLP and Hugging Face Transformers (especially BART or T5) is advantageous but not necessary.
Understanding of summarization metrics (ROUGE, BLEU, BERTScore) and evaluation design is advantageous but not necessary.
Basic knowledge of text preprocessing and representation (e.g., embeddings, sentence scoring) is advantageous but not necessary.

Possible Extensions

Explore adaptive token allocation (e.g., dynamically select more tokens from “important” regions of text).
Apply methods to domain-specific long-text tasks such as legal or medical document summarization.
Conduct qualitative analysis of failure cases (what information gets consistently dropped?).

Further Reading / Starting Literature

Lewis, M. et al. (2020). BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. ACL.
Zaheer, M. et al. (2020). Big Bird: Transformers for Longer Sequences. NeurIPS.
Puny, O., Ben-Hamu, H., & Lipman, Y. (2020). Global Attention Improves Graph Networks Generalization. https://arxiv.org/abs/2006.07846.