Improving Data Set Quality for Robust Language Models
Supervised by: Corina Masanti
If you are interested in this topic or have further questions, do not hesitate to contact me.
Context/Background/Current State
The performance of NLP models is highly dependent on the quality of the training data. However, many data sets suffer from noise, inconsistencies, and biases that can lead to suboptimal model performance. The goal of this thesis is to investigate methods for improving data set quality in the context of automatic error detection and correction in text documents to improve the robustness and generalization of language models.
Goal(s)
- Analyze the characteristics of a real-world data set and identify potential problems.
- Develop techniques to improve the data set.
- (Optional) Generate synthetic data tailored to the additional problems identified in the real-world data set.
Approach
Analyze a real-world data set with a focus on its limitations and challenges. Investigate techniques such as rule-based and neural approaches to create improved versions of the data set. Evaluate the impact of the improved data set on the performance and robustness of the language model in automatic error detection and correction.
Required Skills
Good programming skills.
Remarks
Further Reading
- Stahlberg, Felix, and Shankar Kumar. ”Synthetic data generation for grammatical error correction with tagged corruption models.” arXiv preprint arXiv:2105.13318 (2021).
- Bryant, Christopher, et al. ”The BEA-2019 shared task on grammatical error correction.” Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications (2019).