Medical Data Visualizer

Description

This is one of the projects provided by freeCodeCamp as a part of their certification course. The in-depth requirements can be found in their website but the general idea is that we need to clean the data as instructed and create 3 different visualizations i.e., Categorical Plot and Heat Map.

Project Detail

The folder structure of the project is as follows:

Medical-Data-Visualizer/
│
├── medical_examination.csv
├── medical_data_visualizer.py
├── main.py
└── test_module.py

The project is initialized through main.py and the test file test_module.py is automatically executed when we initialize the program.

# main.py
# This entrypoint file to be used in development. Start by reading README.md
import medical_data_visualizer
from unittest import main

# Test your function by calling it here
medical_data_visualizer.draw_cat_plot()
medical_data_visualizer.draw_heat_map()

# Run unit tests automatically
main(module='test_module', exit=False)

# test_module.py
import unittest
import medical_data_visualizer
import matplotlib as mpl


# the test case
class CatPlotTestCase(unittest.TestCase):
    def setUp(self):
        self.fig = medical_data_visualizer.draw_cat_plot()
        self.ax = self.fig.axes[0]

    def test_line_plot_labels(self):
        actual = self.ax.get_xlabel()
        expected = "variable"
        self.assertEqual(actual, expected, "Expected line plot xlabel to be 'variable'")
        actual = self.ax.get_ylabel()
        expected = "total"
        self.assertEqual(actual, expected, "Expected line plot ylabel to be 'total'")
        actual = []
        for label in self.ax.get_xaxis().get_majorticklabels():
            actual.append(label.get_text())
        expected = ['active', 'alco', 'cholesterol', 'gluc', 'overweight', 'smoke']
        self.assertEqual(actual, expected, "Expected bar plot secondary x labels to be 'active', 'alco', 'cholesterol', 'gluc', 'overweight', 'smoke'")

    def test_bar_plot_number_of_bars(self):
        actual = len([rect for rect in self.ax.get_children() if isinstance(rect, mpl.patches.Rectangle)])
        expected = 13
        self.assertEqual(actual, expected, "Expected a different number of bars chart.")


class HeatMapTestCase(unittest.TestCase):
    def setUp(self):
        self.fig = medical_data_visualizer.draw_heat_map()
        self.ax = self.fig.axes[0]

    def test_heat_map_labels(self):
        actual = []
        for label in self.ax.get_xticklabels():
          actual.append(label.get_text())
        expected = ['id', 'age', 'sex', 'height', 'weight', 'ap_hi', 'ap_lo', 'cholesterol', 'gluc', 'smoke', 'alco', 'active', 'cardio', 'overweight']
        self.assertEqual(actual, expected, "Expected heat map labels to be 'id', 'age', 'sex', 'height', 'weight', 'ap_hi', 'ap_lo', 'cholesterol', 'gluc', 'smoke', 'alco', 'active', 'cardio', 'overweight'.")

    def test_heat_map_values(self):
        actual = [text.get_text() for text in self.ax.get_default_bbox_extra_artists() if isinstance(text, mpl.text.Text)]
        print(actual)
        expected = ['0.0', '0.0', '-0.0', '0.0', '-0.1', '0.5', '0.0', '0.1', '0.1', '0.3', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.2', '0.1', '0.0', '0.2', '0.1', '0.0', '0.1', '-0.0', '-0.1', '0.1', '0.0', '0.2', '0.0', '0.1', '-0.0', '-0.0', '0.1', '0.0', '0.1', '0.4', '-0.0', '-0.0', '0.3', '0.2', '0.1', '-0.0', '0.0', '0.0', '-0.0', '-0.0', '-0.0', '0.2', '0.1', '0.1', '0.0', '0.0', '0.0', '0.0', '0.3', '0.0', '-0.0', '0.0', '-0.0', '-0.0', '-0.0', '0.0', '0.0', '-0.0', '0.0', '0.0', '0.0', '0.2', '0.0', '-0.0', '0.2', '0.1', '0.3', '0.2', '0.1', '-0.0', '-0.0', '-0.0', '-0.0', '0.1', '-0.1', '-0.1', '0.7', '0.0', '0.2', '0.1', '0.1', '-0.0', '0.0', '-0.0', '0.1']
        self.assertEqual(actual, expected, "Expected different values in heat map.")

if __name__ == "__main__":
    unittest.main()

The contents of the .csv file can be visualized as such:

| id |   age  | sex | height | weight | ap_hi | ap_lo | cholesterol | gluc | smoke | alco | active | cardio |
|----|--------|-----|--------|--------|-------|-------|-------------|------|-------|------|--------|--------|
| 0  | 18393  |  2  |  168   |  62.0  |  110  |  80   |      1      |  1   |   0   |  0   |   1    |   0    |
| 1  | 20228  |  1  |  156   |  85.0  |  140  |  90   |      3      |  1   |   0   |  0   |   1    |   1    |
| 2  | 18857  |  1  |  165   |  64.0  |  130  |  70   |      3      |  1   |   0   |  0   |   0    |   1    |

Okay now the logical part 🤣. The file medical_data_visualizer.py contains comments and pre-defined functions where we need to write our logic. Once again as a reminder, the code might not be perfect and could be further improved, but it passed all the tests so I think I'm in the safe 🟢:

# medical_data_visualizer.py
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

# Import data
df = pd.read_csv('medical_examination.csv')

def calc_overweight(bmi):
    if(bmi > 25):
        # is over weight
        return 1
    else:
        # not over weight
        return 0

def normalize(level):
    if level == 1:
        return 0
    elif level > 1:
        return 1

# Add 'overweight' column
height_in_meter = df['height'] * 0.01
bmi = df['weight'] / np.square(height_in_meter)

df['overweight'] = bmi.apply(calc_overweight)

# Normalize data by making 0 always good and 1 always bad. If the value of 'cholesterol' or 'gluc' is 1, make the value 0. If the value is more than 1, make the value 1.
df['cholesterol'] = df['cholesterol'].apply(normalize)
df['gluc'] = df['gluc'].apply(normalize)

# Draw Categorical Plot
def draw_cat_plot():
    # Create DataFrame for cat plot using `pd.melt` using just the values from 'cholesterol', 'gluc', 'smoke', 'alco', 'active', and 'overweight'.
    df_cat = pd.melt(df, id_vars=['cardio'], value_vars=['cholesterol', 'gluc', 'smoke', 'alco', 'active','overweight'])

    # Group and reformat the data to split it by 'cardio'. Show the counts of each feature. You will have to rename one of the columns for the catplot to work correctly.
    df_cat = df_cat.groupby(['cardio', 'variable', 'value'], as_index = False).size().rename(columns={'size' : 'total'}) # in long format

    # Draw the catplot with 'sns.catplot()'
    # hue = creates sub-group based on unique values
    # col = create seperate plots based on unique values
    fig = sns.catplot(data=df_cat, x='variable', y='total', kind='bar', hue='value', col='cardio').fig

    # Get the figure for the output
    # fig = plot.figure()

    # Do not modify the next two lines
    fig.savefig('catplot.png')
    return fig


# Draw Heat Map
def draw_heat_map():
    # Clean the data
    df_heat = df[(df['ap_lo'] <= df['ap_hi']) &
                (df['height'] >= df['height'].quantile(0.025)) &
                (df['height'] <= df['height'].quantile(0.975)) &
                (df['weight'] >= df['weight'].quantile(0.025)) &
                (df['weight'] <= df['weight'].quantile(0.975))]

    # Calculate the correlation matrix
    corr = df_heat.corr()

    # Generate a mask for the upper triangle
    mask = np.triu(corr)

    # Set up the matplotlib figure
    fig, ax = plt.subplots(figsize=(10, 8))

    # Draw the heatmap with 'sns.heatmap()'
    ax = sns.heatmap(corr, mask=mask, annot=True, fmt="0.1f", square=True, linewidths=0.5)

    # Do not modify the next two lines
    fig.savefig('heatmap.png')
    return fig

Initially, I had to add new column named overweight by calculating their BMI. And to do so, I created a new function:

def calc_overweight(bmi):
    if(bmi > 25):
        # is over weight
        return 1
    else:
        # not over weight
        return 0

Then filled up the column with respective values:

# Add 'overweight' column
height_in_meter = df['height'] * 0.01
bmi = df['weight'] / np.square(height_in_meter)

df['overweight'] = bmi.apply(calc_overweight)

Then, I had to normalize the values by making 0 always good and 1 always bad. This was to be done in cholesterol and gluc columns:

def normalize(level):
    if level == 1:
        return 0
    elif level > 1:
        return 1

# Normalize data by making 0 always good and 1 always bad. If the value of 'cholesterol' or 'gluc' is 1, make the value 0. If the value is more than 1, make the value 1.
df['cholesterol'] = df['cholesterol'].apply(normalize)
df['gluc'] = df['gluc'].apply(normalize)

After this I moved on to create my first visualization Categorial Plot . The process is fairly simple to follow. The dataset needed to be split by Cardio so there is one chart for each cardio value.

# Create DataFrame for cat plot using `pd.melt` using just the values from 'cholesterol', 'gluc', 'smoke', 'alco', 'active', and 'overweight'.
df_cat = pd.melt(df, id_vars=['cardio'], value_vars=['cholesterol', 'gluc', 'smoke', 'alco', 'active','overweight'])

# Group and reformat the data to split it by 'cardio'. Show the counts of each feature. You will have to rename one of the columns for the catplot to work correctly.
df_cat = df_cat.groupby(['cardio', 'variable', 'value'], as_index = False).size().rename(columns={'size' : 'total'}) # in long format

As instructed, I used .melt() to create df_cat dataframe and grouped it and split by cardio to show count of each feature. Then using seaborn the categorical plot was created and saved as catplot.png .

Moving on to head map, I first needed to clean the data again as instructed:

diastolic pressure is higher than systolic (Keep the correct data with (df['ap_lo'] <= df['ap_hi']))
height is less than the 2.5th percentile (Keep the correct data with (df['height'] >= df['height'].quantile(0.025)))
height is more than the 97.5th percentile
weight is less than the 2.5th percentile
weight is more than the 97.5th percentile

It was done as such:

df_heat = df[(df['ap_lo'] <= df['ap_hi']) &
            (df['height'] >= df['height'].quantile(0.025)) &
            (df['height'] <= df['height'].quantile(0.975)) &
            (df['weight'] >= df['weight'].quantile(0.025)) &
            (df['weight'] <= df['weight'].quantile(0.975))]

Then the correlation matrix was calculated and the upper triangle was masked:

# Calculate the correlation matrix
corr = df_heat.corr()

# Generate a mask for the upper triangle
mask = np.triu(corr)

The triu() method return a copy of an array with the elements below the k-th diagonal zeroed. Then, the heatmap was created using seaborn and saved as heatmap.png .

# Set up the matplotlib figure
fig, ax = plt.subplots(figsize=(10, 8))

# Draw the heatmap with 'sns.heatmap()'
ax = sns.heatmap(corr, mask=mask, annot=True, fmt="0.1f", square=True, linewidths=0.5)

# Do not modify the next two lines
fig.savefig('heatmap.png')

Insights

This project was a bit challenging for me. Personally, I had a hard time understanding the instructions and out of frustration I gave up working on it and started working on another project just to refresh myself 😭. These things happen, but you must never give up 💪. But the project itself was very helpful as I got to use new methods that I've never heard or seen before and got to visualize data in a proper format.

Conclusion

The project was tedious but fun to do. Coming back to it with a fresh mindset did help a lot. My code can surely be further improved and you are welcome to do so. Or if you want you can try it out yourself first and take help from my code which is available in this repo. See ya later 😁👋.

Medical Data Visualizer

A certification project provided by freeCodeCamp

Description

Project Detail

Insights

Conclusion