Extracting Real-Word Errors from Public Datasets for Improved Detection Models
Supervised by: Corina Masanti
If you are interested in this topic or have further questions, do not hesitate to contact corina.masanti@unibe.ch.
Context
Real-word errors occur when a word is spelt correctly but is incorrect in the context of a sentence (e.g., “there” instead of “their”). Unlike common spelling errors, real-word errors often go unnoticed because they require contextual understanding to identify. Due to their complexity, real-world errors are less common in existing public datasets, making it difficult for models to learn these patterns effectively. This thesis aims to address this gap by systematically extracting real-word errors from large public datasets to build a dedicated dataset and exploring detection techniques tailored for these errors.
Goal(s)
The first step is to analyze publicly available datasets and identify cases of real-word errors. The sentences containing these errors need to be extracted with predefined rules. After building the dataset, it can be used to train and evaluate methods for detecting real-word errors. This includes transformer-based models fine-tuned on the extracted dataset and rule-based approaches. If the dataset is too small, synthetic data can be generated to improve error detection.
Approach
- Develop a method to identify and extract real-word errors from public datasets to create a dedicated dataset for these errors.
- Experiment with various approaches to detect real-word errors, including context-sensitive models such as transformer-based language models and rule-based systems.
- (Optional) Augment the dataset with synthetic real-word errors to enhance model performance on rare error types.
Required Skills
- Good programming skills
- Basic understanding of machine learning concepts or interest to learn them in the process
Further Reading(s)
Volodina, Elena, et al. “MultiGED-2023 shared task at NLP4CALL: Multilingual Grammatical Error Detection.” Proceedings of the 12th Workshop on NLP for Computer Assisted Language Learning. 2023. Some available datasets for error detection: https://github.com/spraakbanken/multiged-2023?tab=readme-ov-file#data