Analysis of Electronic Health Records using Graph Theory and Algorithms

Supervised by: Aylin Tastan

If you are interested in this topic or have further questions, do not hesitate to contact me.

Context/Background/Current State

Healthcare, which is a multi-dimensional system aiming maintenance or treatment of physical, mental and emotional well-being, necessitates detailed analysis of available medical information such as imaging and laboratory examinations, recordings and/or patient´s medical history [1]. As a natural result of increasing available medical information and the advanced computer systems, the digitalization of medical data and its analysis are active research areas in recent years [2,3]. The digital transformation of data provides practitioners and doctors an easy and time efficient access which is crucial for efficient and accurate assessment of the medical data. In addition its advantageous property about data storage, transferring medical data into a digital form enables the design of computer aided medical systems, such as, for diagnosis, classification, treatment and progression of disease [2].

Despite the advantages of electronic health recordings in medical data analysis, they are not primarily designed for the application of data analysis algorithms or tools. In particular, medical data is often unstructured, heterogeneous with various data representations, not yet aggregated, and partly vague (or even inaccurate) [4]. Beyond these difficulties, medical data may contain missing and/or inconsistent observations which are explained with its veracity characteristic and has a negative impact on the analysis. A further challenge for medical data analysis is that medical observations are often subject to outliers and noise [5] which may obscure the valuable data and result in an inaccurate assessment.

By virtue of its complex structure, medical data analysis has been the subject of intensive scientific research that has widely been studied for decades from different aspects, see e.g. [6-9]. Among various medical data analysis tools, machine learning is one of the most popular approach which is capable of detecting diseases and predicting potential diseases by automatically extracting the relevant features from the medical data [10]. However, as it has been mentioned earlier, medical data is generally subject to instrumental and environmental noise or missing variables and training a model with a corrupted and/or incomplete data results in a performance degradation in machine learning algorithms. Another major limitation of machine learning in medicine is that the sensitive medical information consisting patients’ personal information and electronic health recordings is subject to privacy attacks [11].

To design medical data analysis algorithms that provide privacy and robustness, it is important to understand the hidden relationships in a given data set where graphical models are popular tools to represent the hidden associations between different observations, see e.g. [6,12]. Motivated by its broad range of applications in medical data analysis, the analysis of electronic health records is planned to be conduct using graph theory and algorithms in this project whose main ideas are detailed in the following sections.


The aim of this research project is to analyze the electronic health recordings using graph theory and algorithms while considering the privacy and security concerns, legal and ethical issues. The detailed description of the research goals are with respect to three different aspects, that are detailed below:

  • Understanding the relationship between electronic health recordings and the available diagnoses: The electronic health recordings including various laboratory tests make the analysis of available medical data challenging. To gain insights about the necessary medical observations and their content that are fundamental for a considered disease, understanding the relationship between electronic health recordings and the different diagnoses plays a crucial role. Motivated by this, this work aims to analyze the relationship between electronic health recordings and different diagnoses using graph theory and algorithms.
  • Designing a set of primary graph representations to demonstrate the applicability of graph algorithms on large-scale medical data set analysis: Graphical models are effective tools to learn the hidden relationships in a given data set. However, graph representation of a medical data requires set of definitions, i.e., vertices, edges and edge weights for weighted graphs. Building upon the advantageous nature of graphs for understanding hidden relationships, the design of a set of primary graph representations associated with the most common medical laboratory recordings is considered in this work. For every selected medical test, the goal is to determine an informative graph model that provides accurate classification results. The individual graph representation analysis associated with every selected medical test can be broaden to a large-scale medical data set analysis by incorporating the analysis results to a general graph design including all medical tests.
  • Analyzing the effect of age and sexuality on different diseases: In addition to the medical observations, it is important to analyze the effect of demographic information of a patient on the diagnoses considering privacy and ethical issues. To determine the potential risk groups for the considered diseases, the effect of age and gender analysis on different diagnoses another research interest of this work.


  • Literature review: This step includes a detailed literature search about the state-of-the-art methods that are applicable to medical data set analysis.
  • Access to the data set: The planned research project is built upon the Swiss Personalized Health Network (SPHN) data set that can jointly be used by all Swiss Universities, research institutions, hospitals and other interested parties. Due to the privacy and security concerns, legal and ethical issues, the data is stored in a platform that allows users to store and analyze the sensitive research data. As a result of the explained privacy requirements, having an access to the mentioned data storage and processing platform requires plenty of formal procedures including privacy and security education, preparation of necessary documents and understanding of the online-platform principles.
  • Graph representation of medical observations: The SPHN data set contains various medical laboratory evaluations whose associated personal information is anonymized. Even though this anonymized information is useful in terms of privacy and security, it makes the analysis of data challenging due to the non-aggregated information of different observations. Different from aggregating observations, the planned project aims to represent observations as graph vertices and understand the hidden relationships between them. To analyze diagnosis associations of every laboratory test separately, this step starts with determining the most common laboratory tests in the data set. Then, for every determined laboratory test, the associated observations will be represented as graph vertices. To analyze the relationship between laboratory tests and diagnoses independent from demographic information of patients, these vertices can be identified by selecting the observations from the same gender and age scale. Similar to the representing observations as graph vertices, the most common diagnoses will be identified for every determined laboratory test and they will also be represented as graph vertices. After determining graph vertices, the next step of graph construction is to determine the edge weights between the vertices that are corresponding to the selected observations and diagnoses. Starting from the simplest case which refers to an unweighted graph, the goal is to obtain different graph representations based on varying edge weight definitions that can be obtained by using the value of medical observations.
  • Summary of results and detailed comparisons: This step of the planned project comprises the comparisons of computed graph models and understanding of the hidden relationships in the data set. To this end, the frequent subgraphs associated with different graph constructions can be extracted and compared. The extracted frequent subgraphs are assumed to be the ones that point out the relevant diagnoses. In more details, an extracted frequent subgraph will provide one of the most commonly observed diagnosis based on the determined medical test. Additionally, statistical analysis of the edge weights that are incident to this commonly observed diagnosis may provide information about value of the test result leading to the considered diagnosis. The analysis can be further extended by comparing the obtained results based on different sexuality and age scales to understand their effect on the commonly observed diagnoses. Finally, comparisons are planned to be summarized and a general picture illustrating the relationship between most commonly seen medical tests and diagnoses is aimed to be obtained.

Required Skills

  • Basic knowledge about graph theory and algorithms.
  • Good programming skills.

In addition to the above requirements, the following abilities are helpful to conduct the project:

  • Having an experience in biomedical data analysis.
  • Having an experience in real-world data set analysis.
  • Basic knowledge about statistics.


As it has been mentioned earlier, SPHN data set is stored in a platform that allows users to store and analyze the electronic health records. An important property about this platform and the project is that the data or any kind of output that provide information about patients’ personal information can not be transferred from this platform to elsewhere. A user of this platform is subject to their defined privacy and security rules and this means that the data analysis must be performed on the platform using the provided software.

Further Reading

  1. S. Dash, S. K.Shakyawar, M. Sharma, and S. Kaushik, “Big data in healthcare: management, analysis and future prospects,” J. Big Data, vol. 6, pp. 1-25, 2019.
  2. S. Shilo, H. Rossman, and E. Segal, “Axes of a revolution: challenges and promises of big data in healthcare,” Nat. Med., vol. 26, pp. 29-38, 2020.
  3. N. Mehta and A. Pandit, “Concurrence of big data analytics and healthcare: A systematic review,” Int. J. Med. Inf., vol. 114, pp. 57-65, 2018.
  4. R. H. Hariri, E. M. Fredericks, and K. M. Bowers, “Uncertainty in big data analytics: survey, opportunities, and challenges,” J. Big Data, vol. 6, pp. 1-16, 2019.
  5. A. Qayyum, J. Qadir, M. Bilal, and A. Al-Fuqaha, “Secure and robust machine learning for healthcare: A survey,” IEEE Access, vol. 14, pp. 156-180, 2020.
  6. D. Ahmedt-Aristizabal, M. A. Armin, S. Denman, C. Fookes, and L. Petersson, “A survey on graph-based deep learning for computational histopathology,” Comput. Med. Imaging Graphics, vol. 95, p. 102027, 2022.
  7. N. Rieke, J. Hancox, W. Li, F. Milletari, H. R. Roth, S. Albarqouni, S. Bakas, M. N. Galtier, B. A. Landman, K. Maier-Hein, S. Ourselin, M. Sheller, R. M. Summers, A. Trask, D. Xu, M. Baust, and M. J. Cardoso, “The future of digital health with federated learning,” NPJ Digital Med., vol. 3, p. 119, 2020.
  8. P. Schober and T. R. Vetter, “Linear regression in medical research,” Anaesth. Analg., vol. 132, p. 108, 2020.
  9. A. Salcedo-Bernal, M. P. Villamil-Giraldo, and A. D. Moreno-Barbosa, “Clinical data analysis: An opportunity to compare machine learning methods,” Procedia Comput. Sci., vol. 100, pp. 731-738, 2016.
  10. A. Garg and V. Mago, “Role of machine learning in medical research: A survey,” Comput. Sci. Rev., vol. 40, p. 100370, 2021.
  11. X. Zhang, J. Ding, M. Wu, S. T. C. Wong, H. Van Nguyen, and M. Pan, “Adaptive privacy preserving deep learning algorithms for medical data,” in Proc. IEEE/CVF Winter Conf. Appl. Comput. Vision, pp. 1169-1178, 2021.
  12. M. M. Li, K.Huang, and M. Zitnik, “Graph representation learning in biomedicine and healthcare,” Nat. Biomed. Eng., vol. 6, pp. 1353-1369, 2022.