Time Series Visualizer

Description

This is yet another fun project provided by freeCodeCamp as a part of their certification course. The detailed explanation of the requirements can be found in their website, but the general idea is that we need to visualize the time series data using a line chart, [bar chart](atlassian.com/data/charts/bar-chart-complet.. and a box plot.

Project Detail

The folder structure of the project is as follows:

Medical-Data-Visualizer/
│
├── fcc-forum-pageviews.csv
├── time_series_visualizer.py
├── main.py
└── test_module.py

The project is initialized through main.py and the provided test_module.py is automatically executed when we initialize the program.

# main.py
# This entrypoint file to be used in development. Start by reading README.md
import time_series_visualizer
from unittest import main

# Test your function by calling it here
time_series_visualizer.draw_line_plot()
time_series_visualizer.draw_bar_plot()
time_series_visualizer.draw_box_plot()

# Run unit tests automatically
main(module='test_module', exit=False)

# test_module.py
import unittest
import time_series_visualizer
import matplotlib as mpl

class DataCleaningTestCase(unittest.TestCase):
    def test_data_cleaning(self):
        actual = int(time_series_visualizer.df.count(numeric_only=True))
        expected = 1238
        self.assertEqual(actual, expected, "Expected DataFrame count after cleaning to be 1238.")

class LinePlotTestCase(unittest.TestCase):
    def setUp(self):
        self.fig = time_series_visualizer.draw_line_plot()
        self.ax = self.fig.axes[0]

    def test_line_plot_title(self):
        actual = self.ax.get_title()
        expected = "Daily freeCodeCamp Forum Page Views 5/2016-12/2019"
        self.assertEqual(actual, expected, "Expected line plot title to be 'Daily freeCodeCamp Forum Page Views 5/2016-12/2019'")

    def test_line_plot_labels(self):
        actual = self.ax.get_xlabel()
        expected = "Date"
        self.assertEqual(actual, expected, "Expected line plot xlabel to be 'Date'")
        actual = self.ax.get_ylabel()
        expected = "Page Views"
        self.assertEqual(actual, expected, "Expected line plot ylabel to be 'Page Views'")

    def test_line_plot_data_quantity(self):
        actual = len(self.ax.lines[0].get_ydata())
        expected = 1238
        self.assertEqual(actual, expected, "Expected number of data points in line plot to be 1238.")


class BarPlotTestCase(unittest.TestCase):
    def setUp(self):
        self.fig = time_series_visualizer.draw_bar_plot()
        self.ax = self.fig.axes[0]

    def test_bar_plot_legend_labels(self):
        actual = []
        for label in self.ax.get_legend().get_texts():
          actual.append(label.get_text())
        expected = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']
        self.assertEqual(actual, expected, "Expected bar plot legend labels to be months of the year.")

    def test_bar_plot_labels(self):
        actual = self.ax.get_xlabel()
        expected = "Years"
        self.assertEqual(actual, expected, "Expected bar plot xlabel to be 'Years'")
        actual = self.ax.get_ylabel()
        expected = "Average Page Views"
        self.assertEqual(actual, expected, "Expected bar plot ylabel to be 'Average Page Views'")
        actual = []
        for label in self.ax.get_xaxis().get_majorticklabels():
            actual.append(label.get_text())
        expected = ['2016', '2017', '2018', '2019']
        self.assertEqual(actual, expected, "Expected bar plot secondary labels to be '2016', '2017', '2018', '2019'")

    def test_bar_plot_number_of_bars(self):
        actual = len([rect for rect in self.ax.get_children() if isinstance(rect, mpl.patches.Rectangle)])
        expected = 49
        self.assertEqual(actual, expected, "Expected a different number of bars in bar chart.")


class BoxPlotTestCase(unittest.TestCase):
    def setUp(self):
        self.fig = time_series_visualizer.draw_box_plot()
        self.ax1 = self.fig.axes[0]
        self.ax2 = self.fig.axes[1]

    def test_box_plot_number(self):
        actual = len(self.fig.get_axes())
        expected = 2
        self.assertEqual(actual, expected, "Expected two box plots in figure.")

    def test_box_plot_labels(self):
        actual = self.ax1.get_xlabel()
        expected = "Year"
        self.assertEqual(actual, expected, "Expected box plot 1 xlabel to be 'Year'")
        actual = self.ax1.get_ylabel()
        expected = "Page Views"
        self.assertEqual(actual, expected, "Expected box plot 1 ylabel to be 'Page Views'")
        actual = self.ax2.get_xlabel()
        expected = "Month"
        self.assertEqual(actual, expected, "Expected box plot 2 xlabel to be 'Month'")
        actual = self.ax2.get_ylabel()
        expected = "Page Views"
        self.assertEqual(actual, expected, "Expected box plot 2 ylabel to be 'Page Views'")
        actual = []
        for label in self.ax1.get_xaxis().get_majorticklabels():
            actual.append(label.get_text())
        expected = ['2016', '2017', '2018', '2019']
        self.assertEqual(actual, expected, "Expected box plot 1 secondary labels to be '2016', '2017', '2018', '2019'")
        actual = []
        for label in self.ax2.get_xaxis().get_majorticklabels():
            actual.append(label.get_text())
        expected = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
        self.assertEqual(actual, expected, "Expected box plot 2 secondary labels to be 'Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'")
        actual = []
        for label in self.ax1.get_yaxis().get_majorticklabels():
            actual.append(label.get_text())
        expected = ['0', '20000', '40000', '60000', '80000', '100000', '120000', '140000', '160000', '180000', '200000']
        self.assertEqual(actual, expected, "Expected box plot 1 secondary labels to be '0', '20000', '40000', '60000', '80000', '100000', '120000', '140000', '160000', '180000', '200000'")

    def test_box_plot_titles(self):
        actual = self.ax1.get_title()
        expected = "Year-wise Box Plot (Trend)"
        self.assertEqual(actual, expected, "Expected box plot 1 title to be 'Year-wise Box Plot (Trend)'")
        actual = self.ax2.get_title()
        expected = "Month-wise Box Plot (Seasonality)"
        self.assertEqual(actual, expected, "Expected box plot 2 title to be 'Month-wise Box Plot (Seasonality)'")

    def test_box_plot_number_of_boxes(self):
        actual = len(self.ax1.lines) / 6 # Every box has 6 lines
        expected = 4
        self.assertEqual(actual, expected, "Expected four boxes in box plot 1")
        actual = len(self.ax2.lines) / 6 # Every box has 6 lines
        expected = 12
        self.assertEqual(actual, expected, "Expected 12 boxes in box plot 2")

if __name__ == "__main__":
    unittest.main()

The contents of the .csv file can be visualized as such:

|    date    | value |
|------------|-------|
| 2016-05-09 |  1201 |
| 2016-05-10 |  2329 |
| 2016-05-11 |  1716 |

Now begins the fun part 😆. Like before the time_series_visualizer.py file contains pre-defined functions along with comments to help us code.

# time_series_visualizer.py
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()

# Import data (Make sure to parse dates. Consider setting index column to 'date'.)
df = pd.read_csv('fcc-forum-pageviews.csv', index_col="date" , parse_dates=True)

# Clean data
bottom_threshold = df['value'].quantile(0.025)
top_threshold = df['value'].quantile(0.975)

df = df[(df['value'] > bottom_threshold) & (df['value'] < top_threshold)]

def draw_line_plot():
    # Draw line plot
    fig = plt.figure(figsize=(18,6))
    plt.plot(df,color='red')
    plt.title('Daily freeCodeCamp Forum Page Views 5/2016-12/2019')
    plt.xlabel('Date')
    plt.ylabel('Page Views')

    # Save image and return fig (don't change this part)
    fig.savefig('line_plot.png')
    return fig

def draw_bar_plot():
    # Copy and modify data for monthly bar plot
    df_bar = df.copy(deep=True)

    # month order for df
    custom_month_order = [
        'January', 'February', 'March', 'April',
        'May', 'June', 'July', 'August',
        'September', 'October', 'November', 'December'
    ]

    # getting year and month (January, February, ..., December) from dateTime
    df_bar['year'] = df_bar.index.year
    df_bar['month'] = df_bar.index.month_name()

    # arranging the month
    df_bar['month'] = pd.Categorical(df_bar['month'], categories=custom_month_order, ordered=True)

    # average daily page view for each month grouped by year
    avg_grouped = df_bar.groupby(['year', 'month'])['value'].mean().reset_index(name='Average Page Views')

    # arranging data for the plot
    pivot_avg = avg_grouped.pivot(index='year', columns='month', values='Average Page Views')

    # Draw bar plot
    fig = pivot_avg.plot(kind='bar',figsize=(10, 6)).get_figure()
    plt.legend(title='Months')
    plt.xlabel('Years')
    plt.ylabel('Average Page Views')

    # Save image and return fig (don't change this part)
    fig.savefig('bar_plot.png')
    return fig

def draw_box_plot():
    # Prepare data for box plots (this part is done!)
    df_box = df.copy()
    df_box.reset_index(inplace=True)
    df_box['year'] = [d.year for d in df_box.date]
    df_box['month'] = [d.strftime('%b') for d in df_box.date]

    # month order for plotting
    month_order = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']

    fig = plt.figure(figsize=(14,6))

    plt.subplot(1, 2, 1)
    sns.boxplot(x='year', y='value', data=df_box)
    plt.title('Year-wise Box Plot (Trend)')
    plt.xlabel('Year')
    plt.ylabel('Page Views')

    plt.subplot(1, 2, 2)
    sns.boxplot(x='month', y='value', data=df_box, order=month_order)
    plt.title('Month-wise Box Plot (Seasonality)')
    plt.xlabel('Month')
    plt.ylabel('Page Views')

    plt.tight_layout()

    # Save image and return fig (don't change this part)
    fig.savefig('box_plot.png')
    return fig

So initially, I had to import the data and clean it by filtering out the days when the page view were in top 2.5% i.e., (97.5th %) of the dataset or the bottom 2.5% of the dataset.

df = pd.read_csv('fcc-forum-pageviews.csv', index_col="date" , parse_dates=True)

# Clean data
bottom_threshold = df['value'].quantile(0.025)
top_threshold = df['value'].quantile(0.975)

df = df[(df['value'] > bottom_threshold) & (df['value'] < top_threshold)]

Then drawing the line plot was fairly simple. I set the title, and labels for X and Y axis then passed in the cleaned dataset.

def draw_line_plot():
    # Draw line plot
    fig = plt.figure(figsize=(18,6))
    plt.plot(df,color='red')
    plt.title('Daily freeCodeCamp Forum Page Views 5/2016-12/2019')
    plt.xlabel('Date')
    plt.ylabel('Page Views')

    # Save image and return fig (don't change this part)
    fig.savefig('line_plot.png')
    return fig

Moving on to the bar plot, I copied the original dataset as instructed. Then, I extracted the year and the month from the date column and put them in their own columns. To my surprise, extracting the name of the month changed the sequence of the dataset, so I defined a custom_month_order variable that contained the list of month in order. Then, I re-ordered them so that it would not mess up the plot.

# month order for df
custom_month_order = [
    'January', 'February', 'March', 'April',
    'May', 'June', 'July', 'August',
    'September', 'October', 'November', 'December'
]

# arranging the month
df_bar['month'] = pd.Categorical(df_bar['month'], categories=custom_month_order, ordered=True)

Then, I had to calculate average of daily page views for each month grouped by year. Also, I used pivot() to re-arrange my data to pass it into my plot:

# average daily page view for each month grouped by year
avg_grouped = df_bar.groupby(['year', 'month'])['value'].mean().reset_index(name='Average Page Views')

# arranging data for the plot
pivot_avg = avg_grouped.pivot(index='year', columns='month', values='Average Page Views')

And finally for the bar plot, I set up the labels and legend title as instructed:

# Draw bar plot
fig = pivot_avg.plot(kind='bar',figsize=(10, 6)).get_figure()
plt.legend(title='Months')
plt.xlabel('Years')
plt.ylabel('Average Page Views')

Finally for the box plot, there was nothing much to be done as the data was already prepared for us. I just had to go through some documentation regarding the seaborn box plot and some stack-overflow help 🤭. But like before, I had to arrange the month in sequence for proper plotting.

# month order for plotting
month_order = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']

fig = plt.figure(figsize=(14,6))

plt.subplot(1, 2, 1)
sns.boxplot(x='year', y='value', data=df_box)
plt.title('Year-wise Box Plot (Trend)')
plt.xlabel('Year')
plt.ylabel('Page Views')

plt.subplot(1, 2, 2)
sns.boxplot(x='month', y='value', data=df_box, order=month_order)
plt.title('Month-wise Box Plot (Seasonality)')
plt.xlabel('Month')
plt.ylabel('Page Views')

plt.tight_layout()

Insights

This project was fairly simple than the medical data visualizer project. Go check the blog out if you haven't 😉. But there were some unexpected behaviors like the disruption of the data's sequence when I calculated their month's name from number. All in all, it was a fun project 👌.

Conclusion

The emphasis of visualizations made the project much more fun than I had anticipated. I hope you enjoy solving it as well 💪. And like before, my code it not perfect in anyway 😭 you can definitely improve it, try it out yourself first and if nothing works you can go through this blog again or check out my repo. See ya later 👋.

Journey to Data Science