Extracting Real-Word Errors from Public Datasets for Improved Detection Models

Supervised by: Corina Masanti

If you are interested in this topic or have further questions, do not hesitate to contact corina.masanti@unibe.ch.

Context

Real-word errors occur when a word is spelt correctly but is incorrect in the context of a sentence (e.g., “there” instead of “their”). Unlike common spelling errors, real-word errors often go unnoticed because they require contextual understanding to identify. Due to their complexity, real-world errors are less common in existing public datasets, making it difficult for models to learn these patterns effectively. This thesis aims to address this gap by systematically extracting real-word errors from large public datasets to build a dedicated dataset and exploring detection techniques tailored for these errors.

Goal(s)

The first step is to analyze publicly available datasets and identify cases of real-word errors. The sentences containing these errors need to be extracted with predefined rules. After building the dataset, it can be used to train and evaluate methods for detecting real-word errors. This includes transformer-based models fine-tuned on the extracted dataset and rule-based approaches. If the dataset is too small, synthetic data can be generated to improve error detection.

Approach

  • Develop a method to identify and extract real-word errors from public datasets to create a dedicated dataset for these errors.
  • Experiment with various approaches to detect real-word errors, including context-sensitive models such as transformer-based language models and rule-based systems.
  • (Optional) Augment the dataset with synthetic real-word errors to enhance model performance on rare error types.

Required Skills

  • Good programming skills
  • Basic understanding of machine learning concepts or interest to learn them in the process

Further Reading(s)

Volodina, Elena, et al. “MultiGED-2023 shared task at NLP4CALL: Multilingual Grammatical Error Detection.” Proceedings of the 12th Workshop on NLP for Computer Assisted Language Learning. 2023. Some available datasets for error detection: https://github.com/spraakbanken/multiged-2023?tab=readme-ov-file#data