SCOPE: Summarization with Context-Oriented Preprocessing and Evaluation
Co-Supervised by: Merlin Streilein
If you are interested in this topic or have further questions, do not hesitate to contact merlin.streilein@unibe.ch.
Background / Context
Modern text summarization models, such as BART or T5, are limited by their maximum input length (commonly around 1024 tokens). This presents a challenge for summarizing long documents such as scientific articles, legal contracts, or books. A naive solution is recursive summarization, where documents are chunked and summarized iteratively, but this often leads to information loss, redundancy, or inconsistency. Recent research has suggested alternatives: heuristics that focus on specific parts of the document (e.g., beginnings and ends of paragraphs), importance-based ranking of sentences, or hybrid approaches combining extractive and abstractive methods. Despite these ideas, there has been little systematic evaluation of strategies to adapt fixed-length models to long inputs, and the potential for new preprocessing schemes remains underexplored.
Research Question(s) / Goals
- What are the strengths and weaknesses of existing strategies (recursive summarization, sentence ranking, positional heuristics) for handling long-input summarization with fixed-length models like BART?
- Can a new strategy be developed that better preserves global coherence while respecting input length limits?
- How do different strategies compare across domains (news articles, scientific papers, long-form reviews)?
Approach / Methods
- Implement baseline summarization using a pretrained model such as BART with the standard 1024-token limit.
- Reproduce existing preprocessing strategies:
- Recursive summarization.
- Sentence importance ranking (using similarity, embeddings, or TF-IDF).
- Positional heuristics (favoring beginnings/ends of sections).
- Propose and implement a new preprocessing scheme (e.g., adaptive importance sampling across the text, hybrid extractive-abstractive pipelines).
- Evaluate methods on long-document datasets (e.g., arXiv, PubMed, BookSum, or long-form news datasets).
- Compare using both automatic metrics (ROUGE, BERTScore) and human evaluation of coherence and coverage.
Expected Contributions / Outcomes
- A systematic evaluation of strategies for adapting fixed-length summarization models to long inputs.
- Identification of trade-offs between different approaches (coverage vs. coherence, precision vs. recall).
- A prototype implementation of a novel strategy that improves on baseline methods.
- Insights into the role of text structure (e.g., paragraph boundaries, sentence positions) in summarization effectiveness.
Required Skills / Prerequisites
- Programming in Python.
- Familiarity with NLP and Hugging Face Transformers (especially BART or T5) is advantageous but not necessary.
- Understanding of summarization metrics (ROUGE, BLEU, BERTScore) and evaluation design is advantageous but not necessary.
- Basic knowledge of text preprocessing and representation (e.g., embeddings, sentence scoring) is advantageous but not necessary.
Possible Extensions
- Explore adaptive token allocation (e.g., dynamically select more tokens from “important” regions of text).
- Apply methods to domain-specific long-text tasks such as legal or medical document summarization.
- Conduct qualitative analysis of failure cases (what information gets consistently dropped?).
Further Reading / Starting Literature
- Lewis, M. et al. (2020). BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. ACL.
- Zaheer, M. et al. (2020). Big Bird: Transformers for Longer Sequences. NeurIPS.
- Puny, O., Ben-Hamu, H., & Lipman, Y. (2020). Global Attention Improves Graph Networks Generalization. https://arxiv.org/abs/2006.07846.
