Assignment 2 Pandas Introduction

整理一下刚做完的coursera专项课程:Introduction to Data Science in Python (密歇根大学)第二周的作业。对我而言这个作业真滴挺难的,小白嘛。最后八道题只做对了六道,只能靠大神的答案才能勉强通过这样子;就算是我做对了的答案,也出现了很多DataFrame的复制,肯定是降低效率的。特此把我的答案和大神的答案放在一起比对~

Part 1

The following code loads the olympics dataset (olympics.csv), which was derrived from the Wikipedia entry on All Time Olympic Games Medals, and does some basic data cleaning.

The columns are organized as # of Summer games, Summer medals, # of Winter games, Winter medals, total # number of games, total # of medals. Use this dataset to answer the questions below.

首先载入数据集:

import pandas as pd
import numpy as np
df = pd.read_csv('olympics.csv', index_col=0, skiprows=1)
for col in df.columns:
if col[:2]=='01':
df.rename(columns={col:'Gold'+col[4:]}, inplace=True)
if col[:2]=='02':
df.rename(columns={col:'Silver'+col[4:]}, inplace=True)
if col[:2]=='03':
df.rename(columns={col:'Bronze'+col[4:]}, inplace=True)
if col[:1]=='№':
df.rename(columns={col:'#'+col[1:]}, inplace=True)

names_ids = df.index.str.split('\s\(') # split the index by '('

df.index = names_ids.str[0] # the [0] element is the country name (new index)
df['ID'] = names_ids.str[1].str[:3] # the [1] element is the abbreviation or ID (take first 3 characters from that)


df = df.drop('Totals')
df.head()

数据集大概长这个样子滴

 # SummerGoldSilverBronzeTotal# WinterGold.1Silver.1Bronze.1Total.1# GamesGold.2Silver.2Bronze.2Combined totalID
Afghanistan13002200000130022AFG
Algeria1252815300001552815ALG
Argentina23182428701800004118242870ARG
Armenia512912600001112912ARM
Australasia23451200000234512ANZ

Question 1

Which country has won the most gold medals in summer games?

This function should return a single string value.

这是我的答案:

def answer_one():
    pdf=df.sort_values(by='Gold',ascending=False)
    return pdf.iloc[0].name
answer_one()
#输出为'United States

我的想法是先按夏季金牌数排序,然后取第一个值的index名就是国家的名字。但这个过程中实际上复制了一个DataFrame才实现sort_values()。

这是别人的答案:

def answer_one():
    return df['Gold'].idxmax()
answer_one()

查了documentation原来idmax()是这样:

Series.idxmax(axis=None, skipna=True, *args, **kwargs)
Index label of the first occurrence of maximum of values.
Returns:

idxmax : Index of maximum of values

这个函数还有一个DataFrame版本:

DataFrame.idxmax(axis=0, skipna=True)
Return index of first occurrence of maximum over requested axis. NA/null values are excluded.
有两个参数:
(1)axis : {0 or ‘index’, 1 or ‘columns’}, default 0
(2)skipna : boolean, default True
Exclude NA/null values. If an entire row/column is NA, the result will be NA.

返回 idxmax : Series

直接就返回了index,看起来很好用的样子!

Question 2

Which country had the biggest difference between their summer and winter gold medal counts?

This function should return a single string value.

我的答案:

def answer_two():
    df['dif']=np.abs(df['Gold']-df['Gold.1'])
    copy=df.sort_values(by='dif',ascending=False)
return copy.iloc[0].name

首先增加一列'dif'计算夏季金牌和冬季金牌差值的绝对值,然后用sort_values按照新的一列排序,取第一个的id。同样复制了一个dataframe,并且还是个增加了一列的dataframe。

别人的答案

def answer_two():
    summer = df['Gold']
    winter = df['Gold.1']
    delt = summer - winter
    return delt[delt == max(delt)].index[0]
answer_two()

首先生成了一个Series - delt保存金牌差值,然后用到了boolean mask: delt[delt== max(delt)]获得series其中一项,由于题目要求的是国家的名字是string value,所以用index[0]取index label。经过我的实验这里直接用delt.idxmax() 是可以的。

没有用整个dataframe,而是直接把要用的series取出来操作,很机智。

Question 3

Which country has the biggest difference between their summer gold medal counts and winter gold medal counts relative to their total gold medal count? Only include countries that have won at least 1 gold in both summer and winter. This function should return a single string value.

(Summer Gold−Winter Gold) / Total Gold


我的答案:

def answer_three():
    s=df[(df['Gold']>0) & (df['Gold.1']>0) ]#这里语法很重要,两个并列的条件每个都要加小括号哟
    s['a3']=np.abs((s['Gold']-s['Gold.1']))/s['Gold.2']
    s=s.sort_values(by='a3',ascending=False)
return s.iloc[0].name

answer_three()

首先复制了dataframe - s满足夏季金牌数和冬季金牌数同时大于0,给s增加一列a3储存相对金牌差值,然后按照a3排序。过程中(应该是)复制了2个dataframe。回头再好好看一下第一周的内容有一块说到了这里,如果错了的话回来改正。

别人的答案:

def answer_three():
    f1 = df[df['Gold']>0]
    f2 = f1[f1['Gold.1']>0]
    summer = f2['Gold']
    winter = f2['Gold.1']
    total = summer + winter
    relative = (summer - winter) / total
    return relative[relative == max(relative)].index[0]
answer_three()

前两行也是要和我一样达到筛选的目的,我认为我的更好,他多复制出了一个dataframe。后面也可以直接用relative = ( f2['Gold'] - f2['Gold.1'] ) / ( f2['Gold'] + f2['Gold.1'] ),这样就少创建了3个series。最后一行依然用到了直接用max取series的最大值而无需排序,也可以用relative.idxmax() 。 

Question 4

Write a function that creates a Series called "Points" which is a weighted value where each gold medal (Gold.2) counts for 3 points, silver medals (Silver.2) for 2 points, and bronze medals (Bronze.2) for 1 point. The function should return only the column (a Series object) which you created, with the country names as indices.

This function should return a Series named Points of length 146

我的答案:

def answer_four():
    Points=3*df['Gold.2']+2*df['Silver.2']+df['Bronze.2']
return Points

这题很简单我做的没问题,过啦。

Part 2

For the next set of questions, we will be using census data from the United States Census Bureau. Counties are political and geographic subdivisions of states in the United States. This dataset contains population data for counties and states in the US from 2010 to 2015. See this document for a description of the variable names.

The census dataset (census.csv) should be loaded as census_df. Answer questions using this as appropriate.

换数据集啦,新的数据集长这样,估计显示在这里格式不会太好看。

 SUMLEVREGIONDIVISIONSTATECOUNTYSTNAMECTYNAMECENSUS2010POPESTIMATESBASE2010POPESTIMATE2010...RDOMESTICMIG2011RDOMESTICMIG2012RDOMESTICMIG2013RDOMESTICMIG2014RDOMESTICMIG2015RNETMIG2011RNETMIG2012RNETMIG2013RNETMIG2014RNETMIG2015
0403610AlabamaAlabama477973647801274785161...0.002295-0.1931960.3810660.582002-0.4673691.0300150.8266441.3832821.7247180.712594
1503611AlabamaAutauga County545715457154660...7.242091-2.915927-3.0123492.265971-2.5307997.606016-2.626146-2.7220022.592270-2.187333
2503613AlabamaBaldwin County182265182265183193...14.83296017.64729321.84570519.24328717.19787215.84417618.55962722.72762620.31714218.293499
3503615AlabamaBarbour County274572745727341...-4.728132-2.500690-7.056824-3.904217-10.543299-4.874741-2.758113-7.167664-3.978583-10.543299
4503617AlabamaBibb County229152291922861...-5.527043-5.068871-6.201001-0.1775370.177258-5.088389-4.363636-5.4037290.7545331.107861
5 rows × 100 columns

Question 5

Which state has the most counties in it? (hint: consider the sumlevel key carefully! You'll need this for future questions too...)

This function should return a single string value.

我的答案:

def answer_five():
    count=census_df.groupby(by='STNAME').count().sort_values(by='SUMLEV',ascending=False)
return count.iloc[0].name

先按照州名STNAME分组,数数,按照'sumlev'降序排列。这里的count是个dataframe。这题做的还可以,过啦。

Question 6
Only looking at the three most populous counties for each state, what are the three most populous states (in order of highest population to lowest population)? Use CENSUS2010POP.
This function should return a list of string values.

这是别人的答案:

def answer_six():
    f1 = census_df[census_df['SUMLEV'] == 50].groupby('STNAME')['CENSUS2010POP'].apply(lambda x: x.nlargest(3).sum()).nlargest(3).index.values.tolist()
    return f1
answer_six()

首先筛选郡级记录,按照州名分类,只取CENSUS2010POP列,然后用了一个lambda函数,用到nlargest()方法。这个方法我是头一次见到,看起来很好用的样子。


Question 7
Which county has had the largest absolute change in population within the period 2010-2015? (Hint: population values are stored in columns POPESTIMATE2010 through POPESTIMATE2015, you need to consider all six columns.)
e.g. If County Population in the 5 year period is 100, 120, 80, 105, 100, 130, then its largest change in the period would be |130-80| = 50.

This function should return a single string value.

别人的答案:

def answer_seven():
    f1 = census_df[census_df['SUMLEV'] == 50].set_index(['STNAME','CTYNAME']).ix[:,['POPESTIMATE2010','POPESTIMATE2011','POPESTIMATE2012','POPESTIMATE2013','POPESTIMATE2014','POPESTIMATE2015']].stack()
    f2 = f1.max(level=['STNAME','CTYNAME']) - f1.min(level=['STNAME','CTYNAME'])
    return f2.idxmax()[1]
answer_seven()

首先用'SUMLEV==50'选择所有郡级条目,用set_index规定hierarchical index,并选择只展示'POPESTIMATE2010-2015'的列信息。最后的.stack()是什么呢?

python文档里写道DataFrame.stack()的作用是这样的:

Pivot a level of the (possibly hierarchical) column labels, returning a DataFrame having a hierarchical index with a new inner-most level of row labels. The level involved will automatically get sorted.

Parameters:
1. level : int, string, or list of these, default last level
Level(s) to stack, can pass level name
2. dropna : boolean, default True

Whether to drop rows in the resulting Frame/Series with no valid values

如果没有.stack(), f1是一个五列四行,两项hiearchical index:

(                              POPESTIMATE2010  POPESTIMATE2011  \
 STNAME    CTYNAME                                                
 Alabama   Autauga County                54660            55253   
           Baldwin County               183193           186659           
 Wisconsin Washburn County               15930            15784   
           Washington County            131967           132225   
 
                               POPESTIMATE2012  POPESTIMATE2013  \
 STNAME    CTYNAME                                                
 Alabama   Autauga County                55175            55038   
           Baldwin County               190396           195126         
 Wisconsin Washburn County               15831            15657   
           Washington County            132649           132758            
                               POPESTIMATE2014  POPESTIMATE2015  
 STNAME    CTYNAME                                               
 Alabama   Autauga County                55290            55347  
           Baldwin County               199713           203709          
 Wisconsin Washburn County               15678            15552  
           Washington County            133301           133674  

 

用了.stack()以后,变成这个样子:

STNAME   CTYNAME                           
 Alabama  Autauga County     POPESTIMATE2010     54660
                             POPESTIMATE2011     55253
                             POPESTIMATE2012     55175
                             POPESTIMATE2013     55038
                             POPESTIMATE2014     55290
                             POPESTIMATE2015     55347
          Baldwin County     POPESTIMATE2010    183193
                             POPESTIMATE2011    186659
                             POPESTIMATE2012    190396
                             POPESTIMATE2013    195126
                             POPESTIMATE2014    199713
                             POPESTIMATE2015    203709

stack()以后最大的好处就是现在一个郡的五个人口值属于同一个Object, 可以用DataFrame.max()和.min()函数了。文档是这样写的:

DataFrame.max(axis=None, skipna=None, level=None, numeric_only=None, **kwargs)[source]
This method returns the maximum of the values in the object. If you want the index of the maximum, use idxmax. This is the equivalent of the numpy.ndarray method argmax.
Parameters:
1.axis : {index (0), columns (1)}
2.skipna : boolean, default True. Exclude NA/null values when computing the result.
3.level : int or level name, default None. If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series
4.numeric_only : boolean, default None
Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.


Returns: max : Series or DataFrame (if level specified)

Question 8

In this datafile, the United States is broken up into four regions using the "REGION" column.
Create a query that finds the counties that belong to regions 1 or 2, whose name starts with 'Washington', and whose POPESTIMATE2015 was greater than their POPESTIMATE 2014.
This function should return a 5x2 DataFrame with the columns = ['STNAME', 'CTYNAME'] and the same index ID as the census_df (sorted ascending by index).


题目要首先确定region,然后返回满足条件a和b的dataframe。我的答案:

def answer_eight():
    s=census_df[census_df['REGION'] < 3]
    r =  s[(s['CTYNAME'].str[:10]=='Washington') & (s['POPESTIMATE2015'] > s['POPESTIMATE2014'])].loc[:,('STNAME','CTYNAME')]
    return  r
answer_eight()

首先复制了dataframe - s 满足region要求,然后用boolean mask两个条件筛选,以‘washington’开头的条件是我硬数的前十位,最后用.loc[]规定输出的行(全部)和列('a','b')。我觉得还可以。下面看别人的答案:

def answer_eight():
    f1 = census_df[(census_df['POPESTIMATE2015'] > census_df['POPESTIMATE2014']) & ((census_df['REGION'] == 1) | (census_df['REGION'] == 2))]
    f1 = f1[f1['CTYNAME'].str.startswith('Washington')]
    return f1.ix[:,['STNAME', 'CTYNAME']]
answer_eight()

思路相似,这个答案用到了Series.str.startswith('xxxx')的方法,比我硬数10位字符串要好。顺便提一句,这里如果没有.str的话会出问题的,原因不明。。。最后他用的ix[],我用的iloc[],这两个有什么区别?学习一下:

loc 索引的是Index label, 范围包括start和end. 
iloc 索引的是Index, 不包括end. 

ix 先在index label上索引,索引不到就在index上索引, 不包括end.

所以ix没有iloc和loc明确,我还是比较喜欢用后两个。


新知识总结

1. Series.idxmax()用来返回最大值的index label

2. boolean mask的使用:

series[series == max(series)]可以用来取最大值那条的dataframe

df[ (df['a'] < 1) & (df['b']==0)]用来得到满足条件的新dataframe

3.DataFrame.nlargest (n,columns)可以取得某列的前n个最大值

4. 用DataFrame.stack()实现pivot,然后就可以用df.max(level=['a','b'])获得hierarchical index a,b上每条记录的最大值了。

5. Series.str.startswith('xxxx')用来验证值是否以xxxx开头。

6.ix像是iloc和loc的综合版,语法依然是ix[ ['r1','r2'], ['c1','c2']]。如果index label是纯数字,建议还是分开使用iloc 和loc。












 




评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值