Improve Data Quality of the Swiss River Network  (by a lot)

Supervised by: Benjamin Fankhauser

If you are interested in this topic or have further questions, do not hesitate to contact me.

Context/Background/Current State

Our current research activities increase the accuracy of water temperature predictions by using LSTMs and make use of all available data. Not only air temperature and discharge but also measurements of neighboring river stations. In this network of sensors each of them can fail and produce missing or wrong data. The goal of this project is to fill all the gaps in air temperature (1% missing data), discharge (6% missing data) and water temperature (30% missing data) from 1980 to 2020. Various techniques can be applied from linear interpolation to LSTMs based on the measurements of neighbors.

The fact that sensor data may be erroneous or missing is largely unexplored in hydrologic research. Current approaches are either ignoring the missing data (e.g. by skipping over it). Or fix them with linear interpolation and do not care about the consequences. Both strategies are valid. If we provide a ”best-guess” to fill all the missing data, we add a third valid option to this problem.

Goal(s)

  • Provide the best-guess for air temperature, discharge and water temperature from 1980 to 2020 even for stations which have been built after 1980 (regress into the past).

Approach

  • Fix the holes in air temperature by training a machine learning model based on its neighbors (only 1% missing data).
  • Fix the holes in discharge.
  • Fix the holes in water temperature. We can provide LSTM architectures and best practices to do so.

Required Skills

This is hands on working with real world data. The challenges most probably arise during the progress. The student should be familiar with machine learning and LSTMs. For each method there will be this one hole that requires special treatment. As the holes are already in the data there will be no simple solution to assess the best performance, and thus decisions have to be based on intuition or special designed metrics. But it is also acceptable if one implements a few selected methods and provide one fully completed data set for each method.

Remarks

A large number of follow-up questions will certainly arise. For instance, wether to first fill the holes in discharge to fill then the holes in water temperature or the other way round. Or to fix first the small holes to then have longer sequences to then fix the larger holes, to name another example.

Further Reading