整理一下刚做完的coursera专项课程:Introduction to Data Science in Python (密歇根大学)第二周的作业。对我而言这个作业真滴挺难的,小白嘛。最后八道题只做对了六道,只能靠大神的答案才能勉强通过这样子;就算是我做对了的答案,也出现了很多DataFrame的复制,肯定是降低效率的。特此把我的答案和大神的答案放在一起比对~
Part 1
The following code loads the olympics dataset (olympics.csv), which was derrived from the Wikipedia entry on All Time Olympic Games Medals, and does some basic data cleaning.
The columns are organized as # of Summer games, Summer medals, # of Winter games, Winter medals, total # number of games, total # of medals. Use this dataset to answer the questions below.首先载入数据集:
import pandas as pd
import numpy as np
df = pd.read_csv('olympics.csv', index_col=0, skiprows=1)
for col in df.columns:
if col[:2]=='01':
df.rename(columns={col:'Gold'+col[4:]}, inplace=True)
if col[:2]=='02':
df.rename(columns={col:'Silver'+col[4:]}, inplace=True)
if col[:2]=='03':
df.rename(columns={col:'Bronze'+col[4:]}, inplace=True)
if col[:1]=='№':
df.rename(columns={col:'#'+col[1:]}, inplace=True)
names_ids = df.index.str.split('\s\(') # split the index by '('
df.index = names_ids.str[0] # the [0] element is the country name (new index)
df['ID'] = names_ids.str[1].str[:3] # the [1] element is the abbreviation or ID (take first 3 characters from that)
df = df.drop('Totals')
df.head()
数据集大概长这个样子滴
# Summer | Gold | Silver | Bronze | Total | # Winter | Gold.1 | Silver.1 | Bronze.1 | Total.1 | # Games | Gold.2 | Silver.2 | Bronze.2 | Combined total | ID | |
Afghanistan | 13 | 0 | 0 | 2 | 2 | 0 | 0 | 0 | 0 | 0 | 13 | 0 | 0 | 2 | 2 | AFG |
Algeria | 12 | 5 | 2 | 8 | 15 | 3 | 0 | 0 | 0 | 0 | 15 | 5 | 2 | 8 | 15 | ALG |
Argentina | 23 | 18 | 24 | 28 | 70 | 18 | 0 | 0 | 0 | 0 | 41 | 18 | 24 | 28 | 70 | ARG |
Armenia | 5 | 1 | 2 | 9 | 12 | 6 | 0 | 0 | 0 | 0 | 11 | 1 | 2 | 9 | 12 | ARM |
Australasia | 2 | 3 | 4 | 5 | 12 | 0 | 0 | 0 | 0 | 0 | 2 | 3 | 4 | 5 | 12 | ANZ |
Question 1
Which country has won the most gold medals in summer games?
This function should return a single string value.
这是我的答案:
def answer_one():
pdf=df.sort_values(by='Gold',ascending=False)
return pdf.iloc[0].name
answer_one()
#输出为'United States
我的想法是先按夏季金牌数排序,然后取第一个值的index名就是国家的名字。但这个过程中实际上复制了一个DataFrame才实现sort_values()。
这是别人的答案:
def answer_one():
return df['Gold'].idxmax()
answer_one()
查了documentation原来idmax()是这样:
Series.idxmax(axis=None, skipna=True, *args, **kwargs)
Index label of the first occurrence of maximum of values.
Returns:
idxmax : Index of maximum of values
这个函数还有一个DataFrame版本:
DataFrame.idxmax(axis=0, skipna=True)Return index of first occurrence of maximum over requested axis. NA/null values are excluded.
有两个参数:
(1)axis : {0 or ‘index’, 1 or ‘columns’}, default 0
(2)skipna : boolean, default True
Exclude NA/null values. If an entire row/column is NA, the result will be NA.
返回 idxmax : Series
直接就返回了index,看起来很好用的样子!
Question 2
Which country had the biggest difference between their summer and winter gold medal counts?
This function should return a single string value.
我的答案:
def answer_two():
df['dif']=np.abs(df['Gold']-df['Gold.1'])
copy=df.sort_values(by='dif',ascending=False)
return copy.iloc[0].name
首先增加一列'dif'计算夏季金牌和冬季金牌差值的绝对值,然后用sort_values按照新的一列排序,取第一个的id。同样复制了一个dataframe,并且还是个增加了一列的dataframe。
别人的答案
def answer_two():
summer = df['Gold']
winter = df['Gold.1']
delt = summer - winter
return delt[delt == max(delt)].index[0]
answer_two()
首先生成了一个Series - delt保存金牌差值,然后用到了boolean mask: delt[delt== max(delt)]获得series其中一项,由于题目要求的是国家的名字是string value,所以用index[0]取index label。经过我的实验这里直接用delt.idxmax() 是可以的。
没有用整个dataframe,而是直接把要用的series取出来操作,很机智。
Question 3
Which country has the biggest difference between their summer gold medal counts and winter gold medal counts relative to their total gold medal count? Only include countries that have won at least 1 gold in both summer and winter. This function should return a single string value.
(Summer Gold−Winter Gold) / Total Gold我的答案:
def answer_three():
s=df[(df['Gold']>0) & (df['Gold.1']>0) ]#这里语法很重要,两个并列的条件每个都要加小括号哟
s['a3']=np.abs((s['Gold']-s['Gold.1']))/s['Gold.2']
s=s.sort_values(by='a3',ascending=False)
return s.iloc[0].name
answer_three()
首先复制了dataframe - s满足夏季金牌数和冬季金牌数同时大于0,给s增加一列a3储存相对金牌差值,然后按照a3排序。过程中(应该是)复制了2个dataframe。回头再好好看一下第一周的内容有一块说到了这里,如果错了的话回来改正。
别人的答案:
def answer_three():
f1 = df[df['Gold']>0]
f2 = f1[f1['Gold.1']>0]
summer = f2['Gold']
winter = f2['Gold.1']
total = summer + winter
relative = (summer - winter) / total
return relative[relative == max(relative)].index[0]
answer_three()
前两行也是要和我一样达到筛选的目的,我认为我的更好,他多复制出了一个dataframe。后面也可以直接用relative = ( f2['Gold'] - f2['Gold.1'] ) / ( f2['Gold'] + f2['Gold.1'] ),这样就少创建了3个series。最后一行依然用到了直接用max取series的最大值而无需排序,也可以用relative.idxmax() 。
Question 4
Write a function that creates a Series called "Points" which is a weighted value where each gold medal (Gold.2) counts for 3 points, silver medals (Silver.2) for 2 points, and bronze medals (Bronze.2) for 1 point. The function should return only the column (a Series object) which you created, with the country names as indices.
This function should return a Series named Points of length 146
我的答案:
def answer_four():
Points=3*df['Gold.2']+2*df['Silver.2']+df['Bronze.2']
return Points
这题很简单我做的没问题,过啦。
Part 2
For the next set of questions, we will be using census data from the United States Census Bureau. Counties are political and geographic subdivisions of states in the United States. This dataset contains population data for counties and states in the US from 2010 to 2015. See this document for a description of the variable names.
The census dataset (census.csv) should be loaded as census_df. Answer questions using this as appropriate.
换数据集啦,新的数据集长这样,估计显示在这里格式不会太好看。
SUMLEV | REGION | DIVISION | STATE | COUNTY | STNAME | CTYNAME | CENSUS2010POP | ESTIMATESBASE2010 | POPESTIMATE2010 | ... | RDOMESTICMIG2011 | RDOMESTICMIG2012 | RDOMESTICMIG2013 | RDOMESTICMIG2014 | RDOMESTICMIG2015 | RNETMIG2011 | RNETMIG2012 | RNETMIG2013 | RNETMIG2014 | RNETMIG2015 | |
0 | 40 | 3 | 6 | 1 | 0 | Alabama | Alabama | 4779736 | 4780127 | 4785161 | ... | 0.002295 | -0.193196 | 0.381066 | 0.582002 | -0.467369 | 1.030015 | 0.826644 | 1.383282 | 1.724718 | 0.712594 |
1 | 50 | 3 | 6 | 1 | 1 | Alabama | Autauga County | 54571 | 54571 | 54660 | ... | 7.242091 | -2.915927 | -3.012349 | 2.265971 | -2.530799 | 7.606016 | -2.626146 | -2.722002 | 2.592270 | -2.187333 |
2 | 50 | 3 | 6 | 1 | 3 | Alabama | Baldwin County | 182265 | 182265 | 183193 | ... | 14.832960 | 17.647293 | 21.845705 | 19.243287 | 17.197872 | 15.844176 | 18.559627 | 22.727626 | 20.317142 | 18.293499 |
3 | 50 | 3 | 6 | 1 | 5 | Alabama | Barbour County | 27457 | 27457 | 27341 | ... | -4.728132 | -2.500690 | -7.056824 | -3.904217 | -10.543299 | -4.874741 | -2.758113 | -7.167664 | -3.978583 | -10.543299 |
4 | 50 | 3 | 6 | 1 | 7 | Alabama | Bibb County | 22915 | 22919 | 22861 | ... | -5.527043 | -5.068871 | -6.201001 | -0.177537 | 0.177258 | -5.088389 | -4.363636 | -5.403729 | 0.754533 | 1.107861 |
Question 5
Which state has the most counties in it? (hint: consider the sumlevel key carefully! You'll need this for future questions too...)This function should return a single string value.
我的答案:
def answer_five():
count=census_df.groupby(by='STNAME').count().sort_values(by='SUMLEV',ascending=False)
return count.iloc[0].name
先按照州名STNAME分组,数数,按照'sumlev'降序排列。这里的count是个dataframe。这题做的还可以,过啦。
Question 6
Only looking at the three most populous counties for each state, what are the three most populous states (in order of highest population to lowest population)? Use CENSUS2010POP.
This function should return a list of string values.
这是别人的答案:
def answer_six():
f1 = census_df[census_df['SUMLEV'] == 50].groupby('STNAME')['CENSUS2010POP'].apply(lambda x: x.nlargest(3).sum()).nlargest(3).index.values.tolist()
return f1
answer_six()
首先筛选郡级记录,按照州名分类,只取CENSUS2010POP列,然后用了一个lambda函数,用到nlargest()方法。这个方法我是头一次见到,看起来很好用的样子。
Which county has had the largest absolute change in population within the period 2010-2015? (Hint: population values are stored in columns POPESTIMATE2010 through POPESTIMATE2015, you need to consider all six columns.)
e.g. If County Population in the 5 year period is 100, 120, 80, 105, 100, 130, then its largest change in the period would be |130-80| = 50.
This function should return a single string value.
别人的答案:
def answer_seven():
f1 = census_df[census_df['SUMLEV'] == 50].set_index(['STNAME','CTYNAME']).ix[:,['POPESTIMATE2010','POPESTIMATE2011','POPESTIMATE2012','POPESTIMATE2013','POPESTIMATE2014','POPESTIMATE2015']].stack()
f2 = f1.max(level=['STNAME','CTYNAME']) - f1.min(level=['STNAME','CTYNAME'])
return f2.idxmax()[1]
answer_seven()
首先用'SUMLEV==50'选择所有郡级条目,用set_index规定hierarchical index,并选择只展示'POPESTIMATE2010-2015'的列信息。最后的.stack()是什么呢?
python文档里写道DataFrame.stack()的作用是这样的:
Pivot a level of the (possibly hierarchical) column labels, returning a DataFrame having a hierarchical index with a new inner-most level of row labels. The level involved will automatically get sorted.
Parameters:
1. level : int, string, or list of these, default last level
Level(s) to stack, can pass level name
2. dropna : boolean, default True
Whether to drop rows in the resulting Frame/Series with no valid values
如果没有.stack(), f1是一个五列四行,两项hiearchical index:
( POPESTIMATE2010 POPESTIMATE2011 \
STNAME CTYNAME
Alabama Autauga County 54660 55253
Baldwin County 183193 186659
Wisconsin Washburn County 15930 15784
Washington County 131967 132225
POPESTIMATE2012 POPESTIMATE2013 \
STNAME CTYNAME
Alabama Autauga County 55175 55038
Baldwin County 190396 195126
Wisconsin Washburn County 15831 15657
Washington County 132649 132758
POPESTIMATE2014 POPESTIMATE2015
STNAME CTYNAME
Alabama Autauga County 55290 55347
Baldwin County 199713 203709
Wisconsin Washburn County 15678 15552
Washington County 133301 133674
用了.stack()以后,变成这个样子:
STNAME CTYNAME Alabama Autauga County POPESTIMATE2010 54660 POPESTIMATE2011 55253 POPESTIMATE2012 55175 POPESTIMATE2013 55038 POPESTIMATE2014 55290 POPESTIMATE2015 55347 Baldwin County POPESTIMATE2010 183193 POPESTIMATE2011 186659 POPESTIMATE2012 190396 POPESTIMATE2013 195126 POPESTIMATE2014 199713 POPESTIMATE2015 203709
stack()以后最大的好处就是现在一个郡的五个人口值属于同一个Object, 可以用DataFrame.max()和.min()函数了。文档是这样写的:
DataFrame.max(axis=None, skipna=None, level=None, numeric_only=None, **kwargs)[source]
This method returns the maximum of the values in the object. If you want the index of the maximum, use idxmax. This is the equivalent of the numpy.ndarray method argmax.
Parameters:
1.axis : {index (0), columns (1)}
2.skipna : boolean, default True. Exclude NA/null values when computing the result.
3.level : int or level name, default None. If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series
4.numeric_only : boolean, default None
Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.
Returns: max : Series or DataFrame (if level specified)
Question 8
In this datafile, the United States is broken up into four regions using the "REGION" column.Create a query that finds the counties that belong to regions 1 or 2, whose name starts with 'Washington', and whose POPESTIMATE2015 was greater than their POPESTIMATE 2014.
This function should return a 5x2 DataFrame with the columns = ['STNAME', 'CTYNAME'] and the same index ID as the census_df (sorted ascending by index).
题目要首先确定region,然后返回满足条件a和b的dataframe。我的答案:
def answer_eight():
s=census_df[census_df['REGION'] < 3]
r = s[(s['CTYNAME'].str[:10]=='Washington') & (s['POPESTIMATE2015'] > s['POPESTIMATE2014'])].loc[:,('STNAME','CTYNAME')]
return r
answer_eight()
首先复制了dataframe - s 满足region要求,然后用boolean mask两个条件筛选,以‘washington’开头的条件是我硬数的前十位,最后用.loc[]规定输出的行(全部)和列('a','b')。我觉得还可以。下面看别人的答案:
def answer_eight():
f1 = census_df[(census_df['POPESTIMATE2015'] > census_df['POPESTIMATE2014']) & ((census_df['REGION'] == 1) | (census_df['REGION'] == 2))]
f1 = f1[f1['CTYNAME'].str.startswith('Washington')]
return f1.ix[:,['STNAME', 'CTYNAME']]
answer_eight()
思路相似,这个答案用到了Series.str.startswith('xxxx')的方法,比我硬数10位字符串要好。顺便提一句,这里如果没有.str的话会出问题的,原因不明。。。最后他用的ix[],我用的iloc[],这两个有什么区别?学习一下:
loc 索引的是Index label, 范围包括start和end.
iloc 索引的是Index, 不包括end.
ix 先在index label上索引,索引不到就在index上索引, 不包括end.
所以ix没有iloc和loc明确,我还是比较喜欢用后两个。
新知识总结:
1. Series.idxmax()用来返回最大值的index label
2. boolean mask的使用:
series[series == max(series)]可以用来取最大值那条的dataframe
df[ (df['a'] < 1) & (df['b']==0)]用来得到满足条件的新dataframe
3.DataFrame.nlargest (n,columns)可以取得某列的前n个最大值
4. 用DataFrame.stack()实现pivot,然后就可以用df.max(level=['a','b'])获得hierarchical index a,b上每条记录的最大值了。
5. Series.str.startswith('xxxx')用来验证值是否以xxxx开头。
6.ix像是iloc和loc的综合版,语法依然是ix[ ['r1','r2'], ['c1','c2']]。如果index label是纯数字,建议还是分开使用iloc 和loc。