Improving Data Set Quality for Robust Language Models

Supervised by: Corina Masanti

If you are interested in this topic or have further questions, do not hesitate to contact me.

Context/Background/Current State

The performance of NLP models is highly dependent on the quality of the training data. However, many data sets suffer from noise, inconsistencies, and biases that can lead to suboptimal model performance. The goal of this thesis is to investigate methods for improving data set quality in the context of automatic error detection and correction in text documents to improve the robustness and generalization of language models.


  • Analyze the characteristics of a real-world data set and identify potential problems.
  • Develop techniques to improve the data set.
  • (Optional) Generate synthetic data tailored to the additional problems identified in the real-world data set.


Analyze a real-world data set with a focus on its limitations and challenges. Investigate techniques such as rule-based and neural approaches to create improved versions of the data set. Evaluate the impact of the improved data set on the performance and robustness of the language model in automatic error detection and correction.

Required Skills

Good programming skills.


Further Reading

  • Stahlberg, Felix, and Shankar Kumar. ”Synthetic data generation for grammatical error correction with tagged corruption models.” arXiv preprint arXiv:2105.13318 (2021).
  • Bryant, Christopher, et al. ”The BEA-2019 shared task on grammatical error correction.” Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications (2019).