SCOPE: Summarization with Context-Oriented Preprocessing and Evaluation 

Co-Supervised by: Merlin Streilein

If you are interested in this topic or have further questions, do not hesitate to contact merlin.streilein@unibe.ch.

Background / Context

Modern text summarization models, such as BART or T5, are limited by their maximum input length (commonly around 1024 tokens). This presents a challenge for summarizing long documents such as scientific articles, legal contracts, or books. A naive solution is recursive summarization, where documents are chunked and summarized iteratively, but this often leads to information loss, redundancy, or inconsistency. Recent research has suggested alternatives: heuristics that focus on specific parts of the document (e.g., beginnings and ends of paragraphs), importance-based ranking of sentences, or hybrid approaches combining extractive and abstractive methods. Despite these ideas, there has been little systematic evaluation of strategies to adapt fixed-length models to long inputs, and the potential for new preprocessing schemes remains underexplored. 

Research Question(s) / Goals

  • What are the strengths and weaknesses of existing strategies (recursive summarization, sentence ranking, positional heuristics) for handling long-input summarization with fixed-length models like BART? 
  • Can a new strategy be developed that better preserves global coherence while respecting input length limits? 
  • How do different strategies compare across domains (news articles, scientific papers, long-form reviews)? 

Approach / Methods

  • Implement baseline summarization using a pretrained model such as BART with the standard 1024-token limit. 
  • Reproduce existing preprocessing strategies: 
    • Recursive summarization. 
    • Sentence importance ranking (using similarity, embeddings, or TF-IDF). 
    • Positional heuristics (favoring beginnings/ends of sections). 
  • Propose and implement a new preprocessing scheme (e.g., adaptive importance sampling across the text, hybrid extractive-abstractive pipelines). 
  • Evaluate methods on long-document datasets (e.g., arXiv, PubMed, BookSum, or long-form news datasets). 
  • Compare using both automatic metrics (ROUGE, BERTScore) and human evaluation of coherence and coverage. 

Expected Contributions / Outcomes

  • A systematic evaluation of strategies for adapting fixed-length summarization models to long inputs. 
  • Identification of trade-offs between different approaches (coverage vs. coherence, precision vs. recall). 
  • A prototype implementation of a novel strategy that improves on baseline methods. 
  • Insights into the role of text structure (e.g., paragraph boundaries, sentence positions) in summarization effectiveness. 

Required Skills / Prerequisites

  • Programming in Python. 
  • Familiarity with NLP and Hugging Face Transformers (especially BART or T5) is advantageous but not necessary. 
  • Understanding of summarization metrics (ROUGE, BLEU, BERTScore) and evaluation design is advantageous but not necessary. 
  • Basic knowledge of text preprocessing and representation (e.g., embeddings, sentence scoring) is advantageous but not necessary. 

Possible Extensions

  • Explore adaptive token allocation (e.g., dynamically select more tokens from “important” regions of text). 
  • Apply methods to domain-specific long-text tasks such as legal or medical document summarization. 
  • Conduct qualitative analysis of failure cases (what information gets consistently dropped?). 

Further Reading / Starting Literature

  • Lewis, M. et al. (2020). BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. ACL. 
  • Zaheer, M. et al. (2020). Big Bird: Transformers for Longer Sequences. NeurIPS. 
  • Puny, O., Ben-Hamu, H., & Lipman, Y. (2020). Global Attention Improves Graph Networks Generalization. https://arxiv.org/abs/2006.07846.