{"id":1684,"date":"2025-10-17T08:08:31","date_gmt":"2025-10-17T08:08:31","guid":{"rendered":"https:\/\/prg.inf.unibe.ch\/?page_id=1684"},"modified":"2025-10-17T08:08:31","modified_gmt":"2025-10-17T08:08:31","slug":"thesis-scope","status":"publish","type":"page","link":"https:\/\/prg.inf.unibe.ch\/index.php\/education\/thesis-scope\/","title":{"rendered":"thesis-SCOPE"},"content":{"rendered":"\n<div style=\"height:150px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<section class=\"wp-block-uagb-columns uagb-columns__wrap uagb-columns__background-none uagb-columns__stack-mobile uagb-columns__valign- uagb-columns__gap-10 align uagb-block-b3397370 uagb-columns__columns-1 uagb-columns__max_width-theme\"><div class=\"uagb-columns__overlay\"><\/div><div class=\"uagb-columns__inner-wrap uagb-columns__columns-1\">\n<div class=\"wp-block-uagb-column uagb-column__wrap uagb-column__background-undefined uagb-block-3e0cbd99\"><div class=\"uagb-column__overlay\"><\/div>\n<h1 class=\"wp-block-heading\"><strong>SCOPE: Summarization with Context-Oriented Preprocessing and Evaluation<\/strong>\u00a0<\/h1>\n\n\n\n<p><strong>Co-Supervised by:<\/strong> Merlin Streilein<\/p>\n\n\n\n<p>If you are interested in this topic or have further questions, do not hesitate to contact <a href=\"mailto:merlin.streilein@unibe.ch\">merlin.streilein@unibe.ch<\/a>.<\/p>\n\n\n\n<p class=\"has-medium-font-size\"><strong>Background \/ Context<\/strong><\/p>\n\n\n\n<p>Modern text summarization models, such as BART or T5, are limited by their maximum input length (commonly around 1024 tokens). This presents a challenge for summarizing long documents such as scientific articles, legal contracts, or books. A naive solution is <em>recursive summarization<\/em>, where documents are chunked and summarized iteratively, but this often leads to information loss, redundancy, or inconsistency. Recent research has suggested alternatives: heuristics that focus on specific parts of the document (e.g., beginnings and ends of paragraphs), importance-based ranking of sentences, or hybrid approaches combining extractive and abstractive methods. Despite these ideas, there has been little systematic evaluation of strategies to adapt fixed-length models to long inputs, and the potential for new preprocessing schemes remains\u00a0underexplored.\u00a0<\/p>\n\n\n\n<p class=\"has-medium-font-size\"><strong>Research Question(s) \/ Goals<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>What are the strengths and weaknesses of existing strategies (recursive summarization, sentence ranking, positional heuristics) for handling long-input summarization with fixed-length models like BART?\u00a0<\/li>\n\n\n\n<li>Can a new strategy be developed that better preserves global coherence while respecting input length limits?\u00a0<\/li>\n\n\n\n<li>How do different strategies compare across domains (news articles, scientific papers, long-form reviews)?\u00a0<\/li>\n<\/ul>\n\n\n\n<p class=\"has-medium-font-size\"><strong>Approach \/ Methods<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Implement baseline summarization using a pretrained model such as BART with the standard 1024-token limit.\u00a0<\/li>\n\n\n\n<li>Reproduce existing preprocessing strategies:\u00a0\n<ul class=\"wp-block-list\">\n<li>Recursive summarization.\u00a0<\/li>\n\n\n\n<li>Sentence importance ranking (using similarity, embeddings, or TF-IDF).\u00a0<\/li>\n\n\n\n<li>Positional heuristics (favoring beginnings\/ends of sections).\u00a0<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li>Propose and implement a new preprocessing scheme (e.g., adaptive importance sampling across the text, hybrid extractive-abstractive pipelines).\u00a0<\/li>\n\n\n\n<li>Evaluate methods on long-document datasets (e.g., arXiv, PubMed, BookSum, or long-form news datasets).\u00a0<\/li>\n\n\n\n<li>Compare using both automatic metrics (ROUGE, BERTScore) and human evaluation of coherence and coverage.\u00a0<\/li>\n<\/ul>\n\n\n\n<p class=\"has-medium-font-size\"><strong>Expected Contributions \/ Outcomes<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>A systematic evaluation of strategies for adapting fixed-length summarization models to long inputs.\u00a0<\/li>\n\n\n\n<li>Identification of trade-offs between different approaches (coverage vs. coherence, precision vs. recall).\u00a0<\/li>\n\n\n\n<li>A prototype implementation of a novel strategy that improves on\u00a0baseline methods.\u00a0<\/li>\n\n\n\n<li>Insights into the role of text structure (e.g., paragraph boundaries, sentence positions) in summarization effectiveness.\u00a0<\/li>\n<\/ul>\n\n\n\n<p class=\"has-medium-font-size\"><strong>Required Skills \/ Prerequisites<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Programming in Python.\u00a0<\/li>\n\n\n\n<li>Familiarity with NLP and Hugging Face Transformers (especially BART or T5)\u00a0is advantageous\u00a0but not necessary.\u00a0<\/li>\n\n\n\n<li>Understanding of summarization metrics (ROUGE, BLEU, BERTScore) and evaluation design\u00a0is advantageous\u00a0but not necessary.\u00a0<\/li>\n\n\n\n<li>Basic knowledge of text preprocessing and representation (e.g., embeddings, sentence scoring)\u00a0is advantageous\u00a0but not necessary.\u00a0<\/li>\n<\/ul>\n\n\n\n<p class=\"has-medium-font-size\"><strong>Possible Extensions<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Explore adaptive token allocation (e.g., dynamically select more tokens from \u201cimportant\u201d regions of text).\u00a0<\/li>\n\n\n\n<li>Apply methods to domain-specific long-text tasks such as legal or medical document summarization.\u00a0<\/li>\n\n\n\n<li>Conduct qualitative analysis of failure cases (what information gets consistently dropped?).\u00a0<\/li>\n<\/ul>\n\n\n\n<p class=\"has-medium-font-size\"><strong>Further Reading \/ Starting Literature<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Lewis, M. et al. (2020). <em>BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension<\/em>. ACL.\u00a0<\/li>\n\n\n\n<li>Zaheer, M. et al. (2020). <em>Big Bird: Transformers for Longer Sequences<\/em>. NeurIPS.\u00a0<\/li>\n\n\n\n<li>Puny, O., Ben-Hamu, H., &amp; Lipman, Y. (2020). <em>Global Attention Improves Graph Networks Generalization<\/em>. https:\/\/arxiv.org\/abs\/2006.07846.&nbsp;<\/li>\n<\/ul>\n<\/div>\n<\/div><\/section>\n","protected":false},"excerpt":{"rendered":"<p>SCOPE: Summarization with Context-Oriented Preprocessing and Evaluation\u00a0 Co-Supervised by: Merlin Streilein If you are interested in this topic or have further questions, do not hesitate to contact merlin.streilein@unibe.ch. Background \/ Context Modern text summarization models, such as BART or T5, are limited by their maximum input length (commonly around 1024 tokens). This presents a challenge &hellip;<\/p>\n<p class=\"read-more\"> <a class=\"\" href=\"https:\/\/prg.inf.unibe.ch\/index.php\/education\/thesis-scope\/\"> <span class=\"screen-reader-text\">thesis-SCOPE<\/span> Read More &raquo;<\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"parent":731,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"_uag_custom_page_level_css":"","site-sidebar-layout":"no-sidebar","site-content-layout":"plain-container","ast-global-header-display":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"disabled","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","theme-transparent-header-meta":"enabled","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","footnotes":""},"class_list":["post-1684","page","type-page","status-publish","hentry"],"uagb_featured_image_src":{"full":false,"thumbnail":false,"medium":false,"medium_large":false,"large":false,"1536x1536":false,"2048x2048":false},"uagb_author_info":{"display_name":"prg-admin","author_link":"https:\/\/prg.inf.unibe.ch\/index.php\/author\/prg-admin\/"},"uagb_comment_info":0,"uagb_excerpt":"SCOPE: Summarization with Context-Oriented Preprocessing and Evaluation\u00a0 Co-Supervised by: Merlin Streilein If you are interested in this topic or have further questions, do not hesitate to contact merlin.streilein@unibe.ch. Background \/ Context Modern text summarization models, such as BART or T5, are limited by their maximum input length (commonly around 1024 tokens). This presents a challenge&hellip;","_links":{"self":[{"href":"https:\/\/prg.inf.unibe.ch\/index.php\/wp-json\/wp\/v2\/pages\/1684","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/prg.inf.unibe.ch\/index.php\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/prg.inf.unibe.ch\/index.php\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/prg.inf.unibe.ch\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/prg.inf.unibe.ch\/index.php\/wp-json\/wp\/v2\/comments?post=1684"}],"version-history":[{"count":4,"href":"https:\/\/prg.inf.unibe.ch\/index.php\/wp-json\/wp\/v2\/pages\/1684\/revisions"}],"predecessor-version":[{"id":1688,"href":"https:\/\/prg.inf.unibe.ch\/index.php\/wp-json\/wp\/v2\/pages\/1684\/revisions\/1688"}],"up":[{"embeddable":true,"href":"https:\/\/prg.inf.unibe.ch\/index.php\/wp-json\/wp\/v2\/pages\/731"}],"wp:attachment":[{"href":"https:\/\/prg.inf.unibe.ch\/index.php\/wp-json\/wp\/v2\/media?parent=1684"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}