Demographic Data Analyzer

Description

This is yet another project provided by freeCodeCamp as a part of their certification course. Requirement details can be found in their website, but the general idea of the project is to write code in-order to answer some predefined questions:

How many people of each race are represented in this dataset? This should be a Pandas series with race names as the index labels. (race column)
What is the average age of men?
What is the percentage of people who have a Bachelor's degree?
What percentage of people with advanced education (Bachelors, Masters, or Doctorate) make more than 50K?
What percentage of people without advanced education make more than 50K?
What is the minimum number of hours a person works per week?
What percentage of the people who work the minimum number of hours per week have a salary of more than 50K?
What country has the highest percentage of people that earn >50K and what is that percentage?
Identify the most popular occupation for those who earn >50K in India.

Project Detail

The folder structure is maintained by freeCodeCamp themselves:

Mean-Variance-Standard Deviation Calculator/
│
├── adult.data.csv
├── demographic_data_analyzer.py
├── main.py
└── test_module.py

Like before the project is initialized through main.py . The tests are automatically executed in-order to check the validity of the code that we write.

# main.py
# This entrypoint file to be used in development. Start by reading README.md
import demographic_data_analyzer
from unittest import main

# Test your function by calling it here
demographic_data_analyzer.calculate_demographic_data()

# Run unit tests automatically
main(module='test_module', exit=False)

# test_module.py
import unittest
import demographic_data_analyzer

class DemographicAnalyzerTestCase(unittest.TestCase):
    @classmethod
    def setUpClass(self):
        self.data = demographic_data_analyzer.calculate_demographic_data(print_data = False)

    def test_race_count(self):
        actual = self.data['race_count'].tolist()
        expected = [27816, 3124, 1039, 311, 271]
        self.assertCountEqual(actual, expected, msg="Expected race count values to be [27816, 3124, 1039, 311, 271]")

    def test_average_age_men(self):
        actual = self.data['average_age_men']
        expected = 39.4
        self.assertAlmostEqual(actual, expected, msg="Expected different value for average age of men.")

    def test_percentage_bachelors(self):
        actual = self.data['percentage_bachelors']
        expected = 16.4 
        self.assertAlmostEqual(actual, expected, msg="Expected different value for percentage with Bachelors degrees.")

    def test_higher_education_rich(self):
        actual = self.data['higher_education_rich']
        expected = 46.5
        self.assertAlmostEqual(actual, expected, msg="Expected different value for percentage with higher education that earn >50K.")

    def test_lower_education_rich(self):
        actual = self.data['lower_education_rich']
        expected = 17.4
        self.assertAlmostEqual(actual, expected, msg="Expected different value for percentage without higher education that earn >50K.")

    def test_min_work_hours(self):
        actual = self.data['min_work_hours']
        expected = 1
        self.assertAlmostEqual(actual, expected, msg="Expected different value for minimum work hours.")     

    def test_rich_percentage(self):
        actual = self.data['rich_percentage']
        expected = 10
        self.assertAlmostEqual(actual, expected, msg="Expected different value for percentage of rich among those who work fewest hours.")   

    def test_highest_earning_country(self):
        actual = self.data['highest_earning_country']
        expected = 'Iran'
        self.assertEqual(actual, expected, "Expected different value for highest earning country.")   

    def test_highest_earning_country_percentage(self):
        actual = self.data['highest_earning_country_percentage']
        expected = 41.9
        self.assertAlmostEqual(actual, expected, msg="Expected different value for highest earning country percentage.")   

    def test_top_IN_occupation(self):
        actual = self.data['top_IN_occupation']
        expected = 'Prof-specialty'
        self.assertEqual(actual, expected, "Expected different value for top occupations in India.")      

if __name__ == "__main__":
    unittest.main()

The contents of the csv file can be visualized as such:

|    |   age | workclass        |   fnlwgt | education   |   education-num | marital-status     | occupation        | relationship   | race   | sex    |   capital-gain |   capital-loss |   hours-per-week | native-country   | salary   |
|---:|------:|:-----------------|---------:|:------------|----------------:|:-------------------|:------------------|:---------------|:-------|:-------|---------------:|---------------:|-----------------:|:-----------------|:---------|
|  0 |    39 | State-gov        |    77516 | Bachelors   |              13 | Never-married      | Adm-clerical      | Not-in-family  | White  | Male   |           2174 |              0 |               40 | United-States    | <=50K    |
|  1 |    50 | Self-emp-not-inc |    83311 | Bachelors   |              13 | Married-civ-spouse | Exec-managerial   | Husband        | White  | Male   |              0 |              0 |               13 | United-States    | <=50K    |
|  2 |    38 | Private          |   215646 | HS-grad     |               9 | Divorced           | Handlers-cleaners | Not-in-family  | White  | Male   |              0 |              0 |               40 | United-States    | <=50K    |
|  3 |    53 | Private          |   234721 | 11th        |               7 | Married-civ-spouse | Handlers-cleaners | Husband        | Black  | Male   |              0 |              0 |               40 | United-States    | <=50K    |
|  4 |    28 | Private          |   338409 | Bachelors   |              13 | Married-civ-spouse | Prof-specialty    | Wife           | Black  | Female |              0 |              0 |               40 | Cuba             | <=50K    |

The main task for us lies inside demographic_data_analyzer.py . The code might not be perfect and efficient but it works, so I think that is good enough for now ✊🤣:

# demographic_data_analyzer.py
import pandas as pd

def calculate_demographic_data(print_data=True):
    # Read data from file
    df = pd.read_csv('adult.data.csv')

    # How many of each race are represented in this dataset? This should be a Pandas series with race names as the index labels.
    race_count = pd.Series(df['race'].value_counts())

    # What is the average age of men?
    average_age_men = round(df[df.sex == 'Male'].age.mean(), 1)

    # What is the percentage of people who have a Bachelor's degree?
    bachelor_count = df.groupby('education').size().Bachelors
    total_count = df.shape[0] # ignoring first row i.e., column header
    percentage_bachelors = round((bachelor_count / total_count) * 100, 1)

    # What percentage of people with advanced education (`Bachelors`, `Masters`, or `Doctorate`) make more than 50K?
    # What percentage of people without advanced education make more than 50K?

    # with and without `Bachelors`, `Masters`, or `Doctorate`
    higher_education = ((df.education == 'Bachelors') | (df.education == 'Masters') | (df.education == 'Doctorate')).sum()
    lower_education = ((df.education != 'Bachelors') & (df.education != 'Masters') & (df.education != 'Doctorate')).sum()

    higher_education_and_high_pay = (((df.education == 'Bachelors') | (df.education == 'Masters') | (df.education == 'Doctorate')) & (df.salary == '>50K')).sum()
    lower_education_and_high_pay = (((df.education != 'Bachelors') & (df.education != 'Masters') & (df.education != 'Doctorate')) & (df.salary == '>50K')).sum()

    # percentage with salary >50K
    higher_education_rich = round((higher_education_and_high_pay / higher_education)*100, 1)
    lower_education_rich = round((lower_education_and_high_pay / lower_education)*100, 1)

    # What is the minimum number of hours a person works per week (hours-per-week feature)?
    min_work_hours = df['hours-per-week'].min()

    # What percentage of the people who work the minimum number of hours per week have a salary of >50K?
    num_min_workers = df[df['hours-per-week'] == min_work_hours].shape[0]
    less_work_high_pay = ((df['hours-per-week'] == min_work_hours) & (df.salary == '>50K')).sum()

    rich_percentage = (less_work_high_pay / num_min_workers)*100

    # What country has the highest percentage of people that earn >50K?
    country_earnings = df[df['salary'] == '>50K'].groupby('native-country').size() / df.groupby('native-country').size() # count of highest earning ppl in each country / total pop of each country
    highest_earning_country = country_earnings.idxmax()
    highest_earning_country_percentage = round(country_earnings.max() * 100, 1)

    # Identify the most popular occupation for those who earn >50K in India.
    india_top_earner = df[(df['native-country'] == 'India') & (df['salary'] == '>50K')]
    top_IN_occupation = india_top_earner.groupby('occupation').size().idxmax()

    # DO NOT MODIFY BELOW THIS LINE

    if print_data:
        print("Number of each race:\n", race_count) 
        print("Average age of men:", average_age_men)
        print(f"Percentage with Bachelors degrees: {percentage_bachelors}%")
        print(f"Percentage with higher education that earn >50K: {higher_education_rich}%")
        print(f"Percentage without higher education that earn >50K: {lower_education_rich}%")
        print(f"Min work time: {min_work_hours} hours/week")
        print(f"Percentage of rich among those who work fewest hours: {rich_percentage}%")
        print("Country with highest percentage of rich:", highest_earning_country)
        print(f"Highest percentage of rich people in country: {highest_earning_country_percentage}%")
        print("Top occupations in India:", top_IN_occupation)

    return {
        'race_count': race_count,
        'average_age_men': average_age_men,
        'percentage_bachelors': percentage_bachelors,
        'higher_education_rich': higher_education_rich,
        'lower_education_rich': lower_education_rich,
        'min_work_hours': min_work_hours,
        'rich_percentage': rich_percentage,
        'highest_earning_country': highest_earning_country,
        'highest_earning_country_percentage':
        highest_earning_country_percentage,
        'top_IN_occupation': top_IN_occupation
    }

So the skeleton of the file was already provided with proper comments, I just had to write proper logic beneath each comment. Starting off, I created a dataframe by reading the csv file. Then, count of each race was done by using the value_counts() method which returns series containing counts of unique values.

Then average age of men was calculated and the value was rounded off with 1 decimal place. After that, percentage of people who had bachelors degree was needed to be calculated:

# What is the percentage of people who have a Bachelor's degree?
bachelor_count = df.groupby('education').size().Bachelors
total_count = df.shape[0] # ignoring first row i.e., column header
percentage_bachelors = round((bachelor_count / total_count) * 100, 1)

Initially, I counted the number of people who had 'Bachelors' degree along with the total count of people. Then, using simple math 😎, the percentage was calculated and rounded off. Important thing to note is that the .shape[0] ignores the column headers.

Moving on, I needed to calculate percentage of people with higher education that make more than 50K and percentage of people without advanced education make more than 50K:

higher_education = ((df.education == 'Bachelors') | (df.education == 'Masters') | (df.education == 'Doctorate')).sum()
lower_education = ((df.education != 'Bachelors') & (df.education != 'Masters') & (df.education != 'Doctorate')).sum()

higher_education_and_high_pay = (((df.education == 'Bachelors') | (df.education == 'Masters') | (df.education == 'Doctorate')) & (df.salary == '>50K')).sum()
lower_education_and_high_pay = (((df.education != 'Bachelors') & (df.education != 'Masters') & (df.education != 'Doctorate')) & (df.salary == '>50K')).sum()

# percentage with salary >50K
higher_education_rich = round((higher_education_and_high_pay / higher_education)*100, 1)
lower_education_rich = round((lower_education_and_high_pay / lower_education)*100, 1)

So degrees such as: Bachelors, Masters, or Doctorate is considered as higher education. So, count of people with higher education and higher pay was calculated by checking if they had respective degrees or not and if their pay was >50K or not. Similarly process was done for lower education but this time, the count was done for people without higher education. Then once again, simple math was applied.

Then, I had to figure out people who work less but earn more. Sounds like a dream huh? 🤣

# What is the minimum number of hours a person works per week (hours-per-week feature)?
min_work_hours = df['hours-per-week'].min()

# What percentage of the people who work the minimum number of hours per week have a salary of >50K?
num_min_workers = df[df['hours-per-week'] == min_work_hours].shape[0]
less_work_high_pay = ((df['hours-per-week'] == min_work_hours) & (df.salary == '>50K')).sum()

rich_percentage = (less_work_high_pay / num_min_workers)*100

First, I calculated the least number of hour per week and then the number of people who worked the least hour. Then, I calculated the count of people who worked less but still earned >50K . Then, my applying simple math, the percentage was calculated.

The most confusing part for me was to calculate "What country has the highest percentage of people that earn >50K?":

# What country has the highest percentage of people that earn >50K?
country_earnings = df[df['salary'] == '>50K'].groupby('native-country').size() / df.groupby('native-country').size() # count of highest earning ppl in each country / total pop of each country
highest_earning_country = country_earnings.idxmax()
highest_earning_country_percentage = round(country_earnings.max() * 100, 1)

At first I thought the maximum count of people in a country that earned >50K was the answer, but as I read the comment again, the country having the highest count of people that earned more than >50K doesn't necessarily had to have the highest percentage of people that earned >50K . You see where I got confused? 🤣

So, I calculated the percentage by dividing the count of highest earning people in each country with total population of each country. Then using .idxmax() I got the name of the country and rounded off the percentage to the first decimal place using round() .

Finally, I had to identify the most popular occupation for those who earn >50K in India :

# Identify the most popular occupation for those who earn >50K in India.
india_top_earner = df[(df['native-country'] == 'India') & (df['salary'] == '>50K')]
top_IN_occupation = india_top_earner.groupby('occupation').size().idxmax()

This one is straight forward. I calculated the top earners of India and grouped them by their occupation and calculated the count of each people working in respective occupation using .size() and using .idxmax() got the name of the occupation.

Insights

This project was fun to do as I felt like I was given an important analysis task by someone and help them figure out these important questions for better insights 🤣. I learned how to better use dataframe and series which is the basic building block of pandas and I feel more comfortable using them now.

Conclusion

So yea this was the project, if you want to follow along you can find the project requirements here and this is my repo with complete answers. Of course the code is provided above in this article as well. But be sure to try it out on your own and take help only if stuck in a problem 😁.
See ya later 👋.

Journey to Data Science