Robustness for the Analysis of Electronic Health Records

Supervised by: Aylin Tastan

If you are interested in this topic or have further questions, do not hesitate to contact me.

Context/Background/Current State

Healthcare, which is a multi-dimensional system aiming maintenance or treatment of physical, mental and emotional well-being, necessitates detailed analysis of available medical information such as imaging and laboratory examinations, recordings and/or patient´s medical history [1]. As a natural result of increasing available medical information and the advanced computer systems, the digitalization of medical data and its analysis are active research areas in recent years [2,3]. The digital transformation of data provides practitioners and doctors an easy and time efficient access which is crucial for efficient and accurate assessment of the medical data. In addition its advantageous property about data storage, transferring medical data into a digital form enables the design of computer aided medical systems, such as, for diagnosis, classification, treatment and progression of disease [2].

Despite the advantages of electronic health recordings in medical data analysis, they are not primarily designed for the application of data analysis algorithms or tools. In particular, medical data is often unstructured, heterogeneous with various data representations, not yet aggregated, and partly vague (or even inaccurate) [4]. Beyond these difficulties, medical data may contain missing and/or inconsistent observations which are explained with its veracity characteristic and has a negative impact on the analysis. A further challenge for medical data analysis is that medical observations are often subject to outliers and noise [5] which may obscure the valuable data and result in an inaccurate assessment.

By virtue of its complex structure, medical data analysis has been the subject of intensive scientific research that has widely been studied for decades from different aspects, see e.g. [6-9]. Among various medical data analysis tools, machine learning is one of the most popular approach which is capable of detecting diseases and predicting potential diseases by automatically extracting the relevant features from the medical data [10]. However, as it has been mentioned earlier, medical data is generally subject to instrumental and environmental noise or missing variables and training a model with a corrupted and/or incomplete data results in a performance degradation in machine learning algorithms.

A popular way of integrating robustness to medical data analysis is outlier detection, see e.g. [11,12]. Motivated by its broad range of applications and advantageous usage in medical data analysis, application of outlier detection algorithms to the analysis of electronic health records is a promising preliminary approach for understanding existing abnormalities and designing robust medical data analysis tools. In the sequel, an exemplary project addressing robust electronic health data analysis is introduced.


The aim of this research project is integrating robustness to the analysis of electronic health records by performing state-of-the-art outlier detection algorithms. The detailed description of the research goals are provided in the following.

  • Understanding fundamental outlier types: To perform outlier detection algorithms effectively, it is important to gain insights about the most common outlier types on the data set. This prior information is significant also for designing new outlier detection algorithms and performing appropriate machine learning algorithms.
  • Demonstrating the performance of state-of-the-art outlier detection algorithms on available medical data set: There are variety of outlier detection algorithms whose performance depend on the structure of the available data. Therefore, the planned project aims to observe the performance of different outlier detection methods on the existing data set.
  • Developing a preliminary robust machine learning algorithm: Since the goal of outlier detection application is to improve the performance of machine learning algorithms on medical data analysis, the final goal is to aggregate these building blocks and develop a robust machine learning algorithm that can be a preliminary design for further researches.


  • Literature review: This step includes a detailed literature search about the state-of-the-art outlier detection methods that are applicable to medical data set analysis.
  • Access to the data set: The planned research project is built upon the Swiss Personalized Health Network (SPHN) data set that can jointly be used by all Swiss Universities, research institutions, hospitals and other interested parties. Due to the privacy and security concerns, legal and ethical issues, the data is stored in a platform that allows users to store and analyze the sensitive research data. As a result of the explained privacy requirements, having an access to the mentioned data storage and processing platform requires plenty of formal procedures including privacy and security education, preparation of necessary documents and understanding of the online-platform principles.
  • Application of different outlier detection methods: The SPHN data set contains various medical laboratory evaluations that might be subject to noise and/or outliers. To perform machine learning algorithms efficiently, it is necessary to perform a preprocessing step which attempts to prevent the undesired effects of outliers on the prediction of diseases. This step of the planned project comprises the application of state-of-the-art outlier detection algorithms and their comparative analysis on the prediction of medical diseases.
  • Design of a robust machine learning algorithm: The final step of the project is planned to aggregate the outlier detection and machine learning algorithms for the prediction of considered medical diseases. To this end, the results of different outlier detection and machine learning combinations must be reported.

Required Skills

  • Basic knowledge about graph theory and algorithms.
  • Good programming skills.

In addition to the above requirements, the following abilities are helpful to conduct the project:

  • Having an experience in biomedical data analysis
  • Having an experience in real-world data set analysis


As it has been mentioned earlier, SPHN data set is stored in a platform that allows users to store and analyze the electronic health records. An important property about this platform and the project is that the data or any kind of output that provide information about patients’ personal information cannot be transferred from this platform to elsewhere. A user of this platform is subject to their defined privacy and security rules and this means that the data analysis must be performed on the platform using the provided software.

Further Reading

  1. S. Dash, S. K.Shakyawar, M. Sharma, and S. Kaushik, “Big data in healthcare: management, analysis and future prospects,” J. Big Data, vol. 6, pp. 1-25, 2019.
  2. S. Shilo, H. Rossman, and E. Segal, “Axes of a revolution: challenges and promises of big data in healthcare,” Nat. Med., vol. 26, pp. 29-38, 2020.
  3. N. Mehta and A. Pandit, “Concurrence of big data analytics and healthcare: A systematic review,” Int. J. Med. Inf., vol. 114, pp. 57-65, 2018.
  4. R. H. Hariri, E. M. Fredericks, and K. M. Bowers, “Uncertainty in big data analytics: survey, opportunities, and challenges,” J. Big Data, vol. 6, pp. 1-16, 2019.
  5. A. Qayyum, J. Qadir, M. Bilal, and A. Al-Fuqaha, “Secure and robust machine learning for healthcare: A survey,” IEEE Access, vol. 14, pp. 156-180, 2020.
  6. D. Ahmedt-Aristizabal, M. A. Armin, S. Denman, C. Fookes, and L. Petersson, “A survey on graph-based deep learning for computational histopathology,” Comput. Med. Imaging Graphics, vol. 95, p. 102027, 2022.
  7. N. Rieke, J. Hancox, W. Li, F. Milletari, H. R. Roth, S. Albarqouni, S. Bakas, M. N. Galtier, B. A. Landman, K. Maier-Hein, S. Ourselin, M. Sheller, R. M. Summers, A. Trask, D. Xu, M. Baust, and M. J. Cardoso, “The future of digital health with federated learning,” NPJ Digital Med., vol. 3, p. 119, 2020.
  8. P. Schober and T. R. Vetter, “Linear regression in medical research,” Anaesth. Analg., vol. 132, p. 108, 2020.
  9. A. Salcedo-Bernal, M. P. Villamil-Giraldo, and A. D. Moreno-Barbosa, “Clinical data analysis: An opportunity to compare machine learning methods,” Procedia Comput. Sci., vol. 100, pp. 731-738, 2016.
  10. A. Garg and V. Mago, “Role of machine learning in medical research: A survey,” Comput. Sci. Rev., vol. 40, p. 100370, 2021.
  11. A. Smiti, “A critical overview of outlier detection methods,” Comput. Sci. Rev., vol. 38, p. 100306, 2020.
  12. A. Boukerche, L. Zheng and O. Alfandi, “Outlier detection: Methods, models, and classification,” ACM Comput. Surv., vol. 53, pp. 1-37, 2020