Home

Awesome

<h1 align="center"> HADO CARES: Healthcare Data Analysis with Kedro </h1>

Overview

HADO_CARES embarks on a meticulous journey through healthcare data, utilizing the Kedro framework to orchestrate a coherent, insightful, and reproducible data science workflow. From initial data scrutiny to implementing sophisticated machine learning models, this project threads through various stages of data analysis, offering a structured, transparent, and replicable methodology encapsulated in a series of Python scripts and Jupyter notebooks.

Context and Problem

The HADO area of the Santiago de Compostela hospital especialized in paliative cares, manages patient records, among other ways, manually using Excel spreadsheets. This methodology, although functional, leads to a lack of standardization in data formats and limited utilization of the collected information, further hampered by the absence of an efficient system to process and analyze the data in a comprehensive and cohesive manner.

Objectives

The main objective of this project is to enhance the current patient monitoring process in HADO by:

Getting Started

Prerequisites

Installation & Usage

  1. Clone the repository

    git clone https://github.com/pablovdcf/TFM_HADO_Cares.git
    
  2. Navigate to the project hado folder

    cd TFM_HADO_Cares/hado
    
  3. Install dependencies for kedro:

    pip install -r src/requirements.txt
    
  4. Install dependencies for streamlit app:

    pip install -r hado_app/requirements.txt
    
  5. You will need the files to run Kedro pipelines in the path hado/data/01_raw/

    files = (['Estadística 2017.ods', 'Estadistica 2018.ods', 'HADO 19.ods', 'HADO 20.ods', 'HADO 22.ods'])
    
  6. Run the Kedro pipeline:

    kedro run
    
  7. Explore notebooks in hado/notebooks for detailed data analysis and visualizations.

  8. Run the streamlit app:

    streamlit run hado_app/app.py
    

If you have the data to run the notebooks install the requirements:

pip install -r notebooks/requirements.txt

Kedro Pipelines

1.Data Preprocessing Pipeline

The data preprocessing pipeline is defined in data_preprocessing/pipeline.py and involves several nodes, such as check_convert_and_concat(**dataframes), which are defined in data_preprocessing/nodes.py. Parameters are configured in data_preprocessing.yml.

Through different files I normalize, rename, remove outliers from the columns through the parameters of the file data_preprocessing.yml and finally concat them to work with only a unique DataFrame.

kedro-viz-preprocessing

2.Data Processing Pipeline

The data processing pipeline, defined in data_processing/pipeline.py, includes nodes like clean_text(df: pd.DataFrame) and replace_missing(df: pd.DataFrame, params: Dict), which are defined in data_processing/nodes.py. Parameters are set in data_processing.yml.

New variables are added through the data such as the different drugs used as morphine, the values for the health scales are classified, a replacement and mapping is done through lists and dictionaries for different variables and a clean DataFrame and another one with One-Hot encoding is prepared for later use in the notebooks and modeling.

kedro-viz-processing

3.Data Science Pipeline

The data science pipeline, defined in data_science/pipeline.py, encompasses nodes like preprocess_split_data(data: pd.DataFrame, parameters), train_clf_model(X_train: pd.DataFrame, y_train: pd.Series, model_options: Dict[str, Any]), and more, which are defined in data_science/nodes.py. Parameters are configured in data_science.yml

Finally, the classification models are run through Random Forest, XGBoost and LightGBM to see which one gives the best result and pass it through the parameters as the best classification model. The confusion matrices are also extracted. For more details go to data_science.yml.

kedro-viz-science <br>

Streamlit Application

Home-app

Filters-app

A histogram and boxplot for numerical variables. Hist-box-plot <br>

By means of a bubble chart with plotly we can see the evolution if we select all the years Bubble-chart <br>

We can also choose Wordcloud by selecting a categorical variable. Wordcloud <br>

With folium we can see choroplethic map of Galicia with the municipalities.

Galicia-map-municipalities

And other metrics for the municipalities like the distribution by municipalities for the year selected. <br>

Distribution-municipalities

You can visit the app at the link 👉hado cares app and the app documentation

Notebooks

NLP Notebooks

General Data Description

Implemented Solutions and Strategies

Future Challenges and Considerations

Methodology

<!-- ## Contributing Kindly refer to [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines on contributions, and feel free to open issues or pull requests. -->

Contact