Improving Data Set Quality for Robust Language Models

If you are interested in this topic or have further questions, do not hesitate to contact me.

Context/Background/Current State

The performance of NLP models is highly dependent on the quality of the training data. However, many data sets suffer from noise, inconsistencies, and biases that can lead to suboptimal model performance. The goal of this thesis is to investigate methods for improving data set quality in the context of automatic error detection and correction in text documents to improve the robustness and generalization of language models.

Goal(s)

Analyze the characteristics of a real-world data set and identify potential problems.
Develop techniques to improve the data set.
(Optional) Generate synthetic data tailored to the additional problems identified in the real-world data set.

Approach

Analyze a real-world data set with a focus on its limitations and challenges. Investigate techniques such as rule-based and neural approaches to create improved versions of the data set. Evaluate the impact of the improved data set on the performance and robustness of the language model in automatic error detection and correction.

Required Skills

Good programming skills.

Improving Data Set Quality for Robust Language Models

Context/Background/Current State

Goal(s)

Approach

Required Skills

Remarks

Further Reading