Medical Data Visualizer

该博客介绍了如何使用Pandas库分析1994年美国人口普查数据,包括种族分布、男性平均年龄、拥有学士学位的比例、不同教育水平人群收入超过50K的百分比、最少工作小时数的人群特征、高收入国家及印度高收入职业等关键指标的计算方法。涉及到的数据操作包括分组统计、条件筛选、计算平均值、百分比和众数等。
摘要由CSDN通过智能技术生成

Medical Data Visualizer

Introduction

In this challenge you must analyze demographic data using Pandas. You are given dataset of demographic data that was extracted from the 1994 Census database.

FreeCodeCamp

Code

import pandas as pd


def calculate_demographic_data(print_data=True):
    # Read data from file
    df = pd.read_csv('adult.data.csv')

    # How many of each race are represented in this dataset? This should be a Pandas series with race names as the index labels.
    race_count = df.groupby("race").count()['age'].sort_values(ascending=False)

    # What is the average age of men?
    average_age_men = round(df[df['sex'] == 'Male']['age'].mean(), 1)

    # What is the percentage of people who have a Bachelor's degree?
    percentage_bachelors = round(df[df['education'] == 'Bachelors']['education'].count() /df['education'].count() * 100, 1)

    # What percentage of people with advanced education (`Bachelors`, `Masters`, or `Doctorate`) make more than 50K?
    # What percentage of people without advanced education make more than 50K?

    # with and without `Bachelors`, `Masters`, or `Doctorate`
    higher_education = df[((df['education'] == 'Bachelors') | (df['education'] == 'Masters') | (df['education'] == 'Doctorate'))]['education'].count()
    lower_education = df[((df['education'] != 'Bachelors') & (df['education'] != 'Masters') & (df['education'] != 'Doctorate'))]['education'].count()

    # percentage with salary >50K
    higher_education_rich = round(df[(df['salary'] == '>50K') & ((df['education'] == 'Bachelors') | (df['education'] == 'Masters') | (df['education'] == 'Doctorate'))]['salary'].count() / higher_education * 100, 1)

    lower_education_rich = round(df[(df['salary'] == '>50K') & ((df['education'] != 'Bachelors') & (df['education'] != 'Masters') & (df['education'] != 'Doctorate'))]['education'].count() / lower_education * 100, 1)

    # What is the minimum number of hours a person works per week (hours-per-week feature)?
    min_work_hours = df['hours-per-week'].min()

    # What percentage of the people who work the minimum number of hours per week have a salary of >50K?
    num_min_workers = df[(df['hours-per-week'] == min_work_hours) & (df['salary'] == '>50K')]['salary'].count() 

    rich_percentage = round(num_min_workers / df[(df['hours-per-week'] == min_work_hours)]['hours-per-week'].count() * 100, 1)

    # What country has the highest percentage of people that earn >50K?
    # reference resources:https://www.reddit.com/r/FreeCodeCamp/comments/le7ynx/data_analysis_with_python_projects_solving/
    salary = df.loc[df['salary'] == '>50K']['native-country'].value_counts()
    population = df['native-country'].value_counts()
    highest_earning_country = (salary / population).sort_values(ascending=False).index[0]
    highest_earning_country_percentage = round((salary / population * 100).max(), 1)

    # Identify the most popular occupation for those who earn >50K in India.
    top_IN_occupation = df[(df['salary'] == '>50K') & (df['native-country'] == 'India')]['occupation'].mode()[0]
    # print(top_IN_occupation)

    # DO NOT MODIFY BELOW THIS LINE

    if print_data:
        print("Number of each race:\n", race_count) 
        print("Average age of men:", average_age_men)
        print(f"Percentage with Bachelors degrees: {percentage_bachelors}%")
        print(f"Percentage with higher education that earn >50K: {higher_education_rich}%")
        print(f"Percentage without higher education that earn >50K: {lower_education_rich}%")
        print(f"Min work time: {min_work_hours} hours/week")
        print(f"Percentage of rich among those who work fewest hours: {rich_percentage}%")
        print("Country with highest percentage of rich:", highest_earning_country)
        print(f"Highest percentage of rich people in country: {highest_earning_country_percentage}%")
        print("Top occupations in India:", top_IN_occupation)

    return {
        'race_count': race_count,
        'average_age_men': average_age_men,
        'percentage_bachelors': percentage_bachelors,
        'higher_education_rich': higher_education_rich,
        'lower_education_rich': lower_education_rich,
        'min_work_hours': min_work_hours,
        'rich_percentage': rich_percentage,
        'highest_earning_country': highest_earning_country,
        'highest_earning_country_percentage':
        highest_earning_country_percentage,
        'top_IN_occupation': top_IN_occupation
    }

DataSet

adult.data.csv

Result

Last

  1. 前几个问题自然而然的想到了分组统计,所以就用了groupby和count。根据groupby分组然后用count统计。
  2. pandas的排序用的是sort_values,有values自然就会有其他的,不然直接用sort关键字作方法就可以了。这里的另外一种排序是sort_index根据索引排序。正倒序的关键参数也和Python的不一样,这里的是Boolean类型的ascending翻译过来就是上升、升序,那逆序就是False了。
  3. 这里对列的筛选用的最多的是嵌套的dataframe:df[df['columns'] OPERATOR term] 多个筛选条件的时候用的是& |分别对于Python的and or。另外一种筛选方法是使用loc,其实跟第一种方式区别不大:df[(df['salary'] == '>50K')], df.loc[df['salary'] == '>50K']结果都一样。
  4. 到最后两三个问题的时候,知识就不够用了。后面看了一个视频才知道用到了没见过的方法value_countsmodevalue_counts是对值进行统计,有点像分组统计的味道,而mode则是求Series的众数。
  5. round是四舍五入保留小数位的函数。

End

关注我的公众号吧~
在这里插入图片描述

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值