Medical Data Visualizer

最新推荐文章于 2024-03-20 09:59:15 发布

Caisi Huang

最新推荐文章于 2024-03-20 09:59:15 发布

阅读量121

点赞数

分类专栏： Python 文章标签： python 数据分析

本文链接：https://blog.csdn.net/qq_43920024/article/details/116176581

版权

Python 专栏收录该内容

32 篇文章 1 订阅

订阅专栏

该博客介绍了如何使用Pandas库分析1994年美国人口普查数据，包括种族分布、男性平均年龄、拥有学士学位的比例、不同教育水平人群收入超过50K的百分比、最少工作小时数的人群特征、高收入国家及印度高收入职业等关键指标的计算方法。涉及到的数据操作包括分组统计、条件筛选、计算平均值、百分比和众数等。

摘要由CSDN通过智能技术生成

Medical Data Visualizer

Introduction

Introduction

In this challenge you must analyze demographic data using Pandas. You are given dataset of demographic data that was extracted from the 1994 Census database.

FreeCodeCamp

Code

import pandas as pd


def calculate_demographic_data(print_data=True):
    # Read data from file
    df = pd.read_csv('adult.data.csv')

    # How many of each race are represented in this dataset? This should be a Pandas series with race names as the index labels.
    race_count = df.groupby("race").count()['age'].sort_values(ascending=False)

    # What is the average age of men?
    average_age_men = round(df[df['sex'] == 'Male']['age'].mean(), 1)

    # What is the percentage of people who have a Bachelor's degree?
    percentage_bachelors = round(df[df['education'] == 'Bachelors']['education'].count() /df['education'].count() * 100, 1)

    # What percentage of people with advanced education (`Bachelors`, `Masters`, or `Doctorate`) make more than 50K?
    # What percentage of people without advanced education make more than 50K?

    # with and without `Bachelors`, `Masters`, or `Doctorate`
    higher_education = df[((df['education'] == 'Bachelors') | (df['education'] == 'Masters') | (df['education'] == 'Doctorate'))]['education'].count()
    lower_education = df[((df['education'] != 'Bachelors') & (df['education'] != 'Masters') & (df['education'] != 'Doctorate'))]['education'].count()

    # percentage with salary >50K
    higher_education_rich = round(df[(df['salary'] == '>50K') & ((df['education'] == 'Bachelors') | (df['education'] == 'Masters') | (df['education'] == 'Doctorate'))]['salary'].count() / higher_education * 100, 1)

    lower_education_rich = round(df[(df['salary'] == '>50K') & ((df['education'] != 'Bachelors') & (df['education'] != 'Masters') & (df['education'] != 'Doctorate'))]['education'].count() / lower_education * 100, 1)

    # What is the minimum number of hours a person works per week (hours-per-week feature)?
    min_work_hours = df['hours-per-week'].min()

    # What percentage of the people who work the minimum number of hours per week have a salary of >50K?
    num_min_workers = df[(df['hours-per-week'] == min_work_hours) & (df['salary'] == '>50K')]['salary'].count() 

    rich_percentage = round(num_min_workers / df[(df['hours-per-week'] == min_work_hours)]['hours-per-week'].count() * 100, 1)

    # What country has the highest percentage of people that earn >50K?
    # reference resources：https://www.reddit.com/r/FreeCodeCamp/comments/le7ynx/data_analysis_with_python_projects_solving/
    salary = df.loc[df['salary'] == '>50K']['native-country'].value_counts()
    population = df['native-country'].value_counts()
    highest_earning_country = (salary / population).sort_values(ascending=False).index[0]
    highest_earning_country_percentage = round((salary / population * 100).max(), 1)

    # Identify the most popular occupation for those who earn >50K in India.
    top_IN_occupation = df[(df['salary'] == '>50K') & (df['native-country'] == 'India')]['occupation'].mode()[0]
    # print(top_IN_occupation)

    # DO NOT MODIFY BELOW THIS LINE

    if print_data:
        print("Number of each race:\n", race_count) 
        print("Average age of men:", average_age_men)
        print(f"Percentage with Bachelors degrees: {percentage_bachelors}%")
        print(f"Percentage with higher education that earn >50K: {higher_education_rich}%")
        print(f"Percentage without higher education that earn >50K: {lower_education_rich}%")
        print(f"Min work time: {min_work_hours} hours/week")
        print(f"Percentage of rich among those who work fewest hours: {rich_percentage}%")
        print("Country with highest percentage of rich:", highest_earning_country)
        print(f"Highest percentage of rich people in country: {highest_earning_country_percentage}%")
        print("Top occupations in India:", top_IN_occupation)

    return {
        'race_count': race_count,
        'average_age_men': average_age_men,
        'percentage_bachelors': percentage_bachelors,
        'higher_education_rich': higher_education_rich,
        'lower_education_rich': lower_education_rich,
        'min_work_hours': min_work_hours,
        'rich_percentage': rich_percentage,
        'highest_earning_country': highest_earning_country,
        'highest_earning_country_percentage':
        highest_earning_country_percentage,
        'top_IN_occupation': top_IN_occupation
    }

DataSet

adult.data.csv

Result

Last

前几个问题自然而然的想到了分组统计，所以就用了groupby和count。根据groupby分组然后用count统计。
pandas的排序用的是sort_values,有values自然就会有其他的，不然直接用sort关键字作方法就可以了。这里的另外一种排序是sort_index根据索引排序。正倒序的关键参数也和Python的不一样，这里的是Boolean类型的ascending翻译过来就是上升、升序，那逆序就是False了。
这里对列的筛选用的最多的是嵌套的dataframe：df[df['columns'] OPERATOR term] 多个筛选条件的时候用的是& |分别对于Python的and or。另外一种筛选方法是使用loc，其实跟第一种方式区别不大:df[(df['salary'] == '>50K')], df.loc[df['salary'] == '>50K']结果都一样。
到最后两三个问题的时候，知识就不够用了。后面看了一个视频才知道用到了没见过的方法value_counts、mode，value_counts是对值进行统计，有点像分组统计的味道，而mode则是求Series的众数。
round是四舍五入保留小数位的函数。

End

关注我的公众号吧~
在这里插入图片描述

Caisi Huang

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Medical Data Visualizer

Medical Data VisualizerIntroductionCodeDataSetResultLastEndIntroductionIn this challenge you must analyze demographic data using Pandas. You are given dataset of demographic data that was extracted from the 1994 Census database.FreeCodeCampCodeimpor
复制链接

扫一扫

专栏目录