Extracting Real-Word Errors from Public Datasets for Improved Detection Models

Supervised by: Corina Masanti

If you are interested in this topic or have further questions, do not hesitate to contact corina.masanti@unibe.ch.

Context

Real-word errors occur when a word is spelt correctly but is incorrect in the context of a sentence (e.g., “there” instead of “their”). Unlike common spelling errors, real-word errors often go unnoticed because they require contextual understanding to identify. Due to their complexity, real-world errors are less common in existing public datasets, making it difficult for models to learn these patterns effectively. This thesis aims to address this gap by systematically extracting real-word errors from large public datasets to build a dedicated dataset and exploring detection techniques tailored for these errors.

Goal(s)

The first step is to analyze publicly available datasets and identify cases of real-word errors. The sentences containing these errors need to be extracted with predefined rules. After building the dataset, it can be used to train and evaluate methods for detecting real-word errors. This includes transformer-based models fine-tuned on the extracted dataset and rule-based approaches. If the dataset is too small, synthetic data can be generated to improve error detection.

Approach

Develop a method to identify and extract real-word errors from public datasets to create a dedicated dataset for these errors.
Experiment with various approaches to detect real-word errors, including context-sensitive models such as transformer-based language models and rule-based systems.
(Optional) Augment the dataset with synthetic real-word errors to enhance model performance on rare error types.

Required Skills

Good programming skills
Basic understanding of machine learning concepts or interest to learn them in the process

Further Reading(s)

Volodina, Elena, et al. “MultiGED-2023 shared task at NLP4CALL: Multilingual Grammatical Error Detection.” Proceedings of the 12th Workshop on NLP for Computer Assisted Language Learning. 2023. Some available datasets for error detection: https://github.com/spraakbanken/multiged-2023?tab=readme-ov-file#data