数据科学Python库01:Pandas
Series和DataFrame类的基本用法
Series
和DataFrame
类是Pandas的两个最核心的类.Series
用于表示一维的数据,而DataFrame
用于表示二维的数据.
Series类
创建Series对象
可以使用一个类数组(array-like
)对象(如list
或dict
)创建Series
.
-
使用
list
创建Series
时,索引为从0开始的数字animals = ['Tiger', 'Bear', 'Moose'] pd.Series(animals)
0 Tiger 1 Bear 2 Moose dtype: object
-
使用
dict
创建Series
时,索引为字典的键sports = {'Archery': 'Bhutan', 'Golf': 'Scotland', 'Sumo': 'Japan', 'Taekwondo': 'South Korea'} s = pd.Series(sports)
Archery Bhutan Golf Scotland Sumo Japan Taekwondo South Korea dtype: object
也可以在构造函数中使用index
参数显式指明索引.
s = pd.Series(['Tiger', 'Bear', 'Moose'], index=['India', 'America', 'Canada'])
India Tiger
America Bear
Canada Moose
dtype: object
若传给构造函数的数据中存在缺失值None
,则Pandas会自动将其补全为对应的数据(对于数值类型数据是NaN
,对于其他类型数据是None
).
animals = ['Tiger', 'Bear', None]
pd.Series(animals)
numbers = [1, 2, None]
pd.Series(numbers)
0 Tiger
1 Bear
2 None
dtype: object
0 1.0
1 2.0
2 NaN
dtype: float64
对Series对象进行索引
可以使用iloc
或loc
属性对Series
对象进行索引查询.前者是位置索引,根据行数查询;而后者是标签索引,根据标签查询.
sports = {'Archery': 'Bhutan', 'Golf': 'Scotland', 'Sumo': 'Japan', 'Taekwondo': 'South Korea'}
s = pd.Series(sports)
'''创建Series对象内容如下:
Archery Bhutan
Golf Scotland
Sumo Japan
Taekwondo South Korea
dtype: object
'''
s.iloc[3] # 得到 'South Korea'
s.loc['Golf'] # 得到 'Scotland'
注意,
iloc
和loc
是对象的属性而非方法,因此使用方括号[]
而非圆括号()
包裹索引值.
当然,Pandas也支持直接对Series进行下标索引,Pandas会根据我们传入的键来判断去进行位置索引还是标签索引.但当传入的下标为一个整数时可能会判断出错,因此应尽量优先使用iloc
或loc
.
s.[3] # 等价于 s.iloc[3],会得到 'South Korea'
s.['Golf'] # 等价于 s.loc['Golf'],会得到 'Scotland'
sports = {99: 'Bhutan', 100: 'Scotland', 101: 'Japan', 102: 'South Korea'}
s = pd.Series(sports)
'''创建Series对象内容如下:
99 Bhutan
100 Scotland
101 Japan
102 South Korea
dtype: object
'''
s[0] # 并不会如预期得那样调用s.iloc[0],而是报错KeyError
Series对象的标签索引
与关系型数据库不同的是,Series对象的标签索引是可以重复的.
original_sports = pd.Series({'Archery': 'Bhutan', 'Golf': 'Scotland', 'Sumo': 'Japan', 'Taekwondo': 'South Korea'})
cricket_loving_countries = pd.Series(['Australia', 'Barbados', 'Pakistan', 'England'],
index=['Cricket', 'Cricket', 'Cricket', 'Cricket'])
all_countries = original_sports.append(cricket_loving_countries) # Series对象的append方法不会在原对象上修改
执行上述代码,得到Series
对象如下:
Archery Bhutan
Golf Scotland
Sumo Japan
Taekwondo South Korea
Cricket Australia
Cricket Barbados
Cricket Pakistan
Cricket England
dtype: object
可以看到,该Series
对象的标签存在重复,对重复的标签进行索引,得到的仍是Series
对象.
all_countries.loc['Cricket']
Cricket Australia
Cricket Barbados
Cricket Pakistan
Cricket England
dtype: object
对Series对象应用Numpy方法
Pandas库是基于,因此可以对Series
对象调用Numpy中array
类的方法.
s = pd.Series([100.00, 120.00, 101.00, 3.00])
np.sum(s) # 得到 324.0
DataFrame类
创建DataFrame对象
可以直接创建DataFrame
对象,或者从csv文件创建DataFrame
对象.
-
直接创建
DataFrame
对象:可以使用
Series
类列表或dict
列表创建Dataframe
对象,每个Series
或dict
都被认为是Dataframe
的一行.与Series
类似,也可以在构造函数中显示指明标签索引.purchase_1 = pd.Series({'Name': 'Chris', 'Item Purchased': 'Dog Food', 'Cost': 22.50}) purchase_2 = pd.Series({'Name': 'Kevyn', 'Item Purchased': 'Kitty Litter', 'Cost': 2.50}) purchase_3 = pd.Series({'Name': 'Vinod', 'Item Purchased': 'Bird Seed', 'Cost': 5.00}) inventory_df = pd.DataFrame([purchase_1, purchase_2, purchase_3], index=['Store 1', 'Store 1', 'Store 2']) inventory_df.head() # 使用head()方法查看DataFrame对象的头几行
得到
DataFrame
对象如下:Cost Item Purchased Name Store 1 22.5 Dog Food Chris Store 1 2.5 Kitty Litter Kevyn Store 2 5.0 Bird Seed Vinod -
从csv文件创建
DataFrame
对象:使用
read_csv()
函数可以从csv文件创建DataFrame
对象.通过设置一系列参数可以使Pandas为我们推断出数据的标签和字段.下面以奥林匹克数据集为例,展示如何使用
read_csv()
函数,数据源文件olympics.csv
内容如下:0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 ,№ Summer,01 !,02 !,03 !,Total,№ Winter,01 !,02 !,03 !,Total,№ Games,01 !,02 !,03 !,Combined total Afghanistan (AFG),13,0,0,2,2,0,0,0,0,0,13,0,0,2,2 Algeria (ALG),12,5,2,8,15,3,0,0,0,0,15,5,2,8,15 Argentina (ARG),23,18,24,28,70,18,0,0,0,0,41,18,24,28,70 Armenia (ARM),5,1,2,9,12,6,0,0,0,0,11,1,2,9,12 Australasia (ANZ) [ANZ],2,3,4,5,12,0,0,0,0,0,2,3,4,5,12 ...... Zambia (ZAM) [ZAM],12,0,1,1,2,0,0,0,0,0,12,0,1,1,2 Zimbabwe (ZIM) [ZIM],12,3,4,1,8,1,0,0,0,0,13,3,4,1,8 Mixed team (ZZX) [ZZX],3,8,5,4,17,0,0,0,0,0,3,8,5,4,17 Totals,27,4809,4775,5130,14714,22,959,958,948,2865,49,5768,5733,6078,17579
可以看到,第一行是无效数据,第二行为字段名,第一列为标签索引.因此调用
read_csv()
函数的方式如下:olympics_df = pd.read_csv('olympics.csv', index_col = 0, skiprows=1)
得到的
DataFrame
对象如下:№ Summer 01 ! 02 ! 03 ! Total № Winter 01 !.1 02 !.1 03 !.1 Total.1 № Games 01 !.2 02 !.2 03 !.2 Combined total Afghanistan (AFG) 13 0 0 2 2 0 0 0 0 0 13 0 0 2 2 Algeria (ALG) 12 5 2 8 15 3 0 0 0 0 15 5 2 8 15 Argentina (ARG) 23 18 24 28 70 18 0 0 0 0 41 18 24 28 70 Armenia (ARM) 5 1 2 9 12 6 0 0 0 0 11 1 2 9 12 Australasia (ANZ) [ANZ] 2 3 4 5 12 0 0 0 0 0 2 3 4 5 12 … … … … … … … … … … … … … … … … Zimbabwe (ZIM) [ZIM] 12 3 4 1 8 1 0 0 0 0 13 3 4 1 8 Mixed team (ZZX) [ZZX] 3 8 5 4 17 0 0 0 0 0 3 8 5 4 17 Totals 27 4809 4775 5130 14714 22 959 958 948 2865 49 5768 5733 6078 17579
使用indexs
和columns
属性可以分别访问DataFrame
对象的标签索引和字段.
for col in olympics_df.columns:
if col[:2]=='01':
olympics_df.rename(columns={col:'Gold' + col[4:]}, inplace=True) # 设置inplace参数为true,在原df对象上进行修改
if col[:2]=='02':
olympics_df.rename(columns={col:'Silver' + col[4:]}, inplace=True)
if col[:2]=='03':
olympics_df.rename(columns={col:'Bronze' + col[4:]}, inplace=True)
if col[:1]=='№':
olympics_df.rename(columns={col:'#' + col[1:]}, inplace=True)
对DataFrame对象进行索引
与Series
类似,我们可以分别使用iloc
和loc
分别对DataFrame
对象进行位置索引和标签索引,这是针对行进行索引的.我们也可以使用直接下标对DataFrame
对象的列进行索引.
inventory_df.iloc[0] # 索引得到第一行
inventory_df.loc['Store 2'] # 索引得到'Store 2'标签对应的行
inventory_df['Item Purchased'] # 索引得到'Item Purchased'属性对应的列
对DataFrame
对象进行索引会返回原对象的一个视图(view
)而非拷贝copy
,这也就是说,对索引结果的修改会作用在原数据上.
DataFrame对象的标签索引
-
设置单字段标签索引
调用
DataFrame
对象的set_index()
方法可以将某字段设置为标签索引,调用reset_index()
方法可以取消标签索引.olympics_df['country'] = olympics_df.index # 将原标签索引设为 'conuntry' 字段 olympics_df = olympics_df.set_index('Gold') # 将 'Gold' 字段设为标签索引
得到的
DataFrame
对象如下:# Summer Silver Bronze Total # Winter Gold.1 Silver.1 Bronze.1 Total.1 # Games Gold.2 Silver.2 Bronze.2 Combined total country Gold 0 13 0 2 2 0 0 0 0 0 13 0 0 2 2 Afghanistan (AFG) 5 12 2 8 15 3 0 0 0 0 15 5 2 8 15 Algeria (ALG) 18 23 24 28 70 18 0 0 0 0 41 18 24 28 70 Argentina (ARG) 1 5 2 9 12 6 0 0 0 0 11 1 2 9 12 Armenia (ARM) 3 2 4 5 12 0 0 0 0 0 2 3 4 5 12 Australasia (ANZ) [ANZ] ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 3 12 4 1 8 1 0 0 0 0 13 3 4 1 8 Zimbabwe (ZIM) [ZIM] 8 3 5 4 17 0 0 0 0 0 3 8 5 4 17 Mixed team (ZZX) [ZZX] 4809 27 4775 5130 14714 22 959 958 948 2865 49 5768 5733 6078 17579 Totals -
设置多字段联合标签索引
与关系型数据库类似,
DataFrame
对象也支持创建多字段联合索引,只需向set_index()
函数传入字段数组即可.这要求使用loc
属性进行标签索引时遵循最左前缀原则.下面以美国人口普查数据为例,演示多字段联合标签索引.
census_df = pd.read_csv('census.csv') columns_to_keep = ['STNAME', 'CTYNAME', 'BIRTHS2010', 'BIRTHS2011', 'BIRTHS2012', 'BIRTHS2013', 'BIRTHS2014', 'BIRTHS2015', 'POPESTIMATE2010', 'POPESTIMATE2011', 'POPESTIMATE2012', 'POPESTIMATE2013', 'POPESTIMATE2014', 'POPESTIMATE2015'] census_df = census_df[columns_to_keep]
STNAME CTYNAME BIRTHS2010 BIRTHS2011 BIRTHS2012 BIRTHS2013 BIRTHS2014 BIRTHS2015 POPESTIMATE2010 POPESTIMATE2011 POPESTIMATE2012 POPESTIMATE2013 POPESTIMATE2014 POPESTIMATE2015 0 Alabama Alabama 14226 59689 59062 57938 58334 58305 4785161 4801108 4816089 4830533 4846411 4858979 1 Alabama Autauga County 151 636 615 574 623 600 54660 55253 55175 55038 55290 55347 2 Alabama Baldwin County 517 2187 2092 2160 2186 2240 183193 186659 190396 195126 199713 203709 3 Alabama Barbour County 70 335 300 283 260 269 27341 27226 27159 26973 26815 26489 4 Alabama Bibb County 44 266 245 259 247 253 22861 22733 22642 22512 22549 22583 5 Alabama Blount County 183 744 710 646 618 603 57373 57711 57776 57734 57658 57673 … … … … … … … … … … … … … … … 3192 Wyoming Weston County 26 81 74 93 77 79 7181 7114 7065 7160 7185 7234 我们对
STNAME
和CTYNAME
字段建立联合标签索引,代码如下:census_df = census_df.set_index(['STNAME', 'CTYNAME'])
BIRTHS2010 BIRTHS2011 BIRTHS2012 BIRTHS2013 BIRTHS2014 BIRTHS2015 POPESTIMATE2010 POPESTIMATE2011 POPESTIMATE2012 POPESTIMATE2013 POPESTIMATE2014 POPESTIMATE2015 STNAME CTYNAME Alabama Autauga County 151 636 615 574 623 600 54660 55253 55175 55038 55290 55347 Baldwin County 517 2187 2092 2160 2186 2240 183193 186659 190396 195126 199713 203709 Barbour County 70 335 300 283 260 269 27341 27226 27159 26973 26815 26489 Bibb County 44 266 245 259 247 253 22861 22733 22642 22512 22549 22583 Blount County 183 744 710 646 618 603 57373 57711 57776 57734 57658 57673 使用
loc
属性对联合索引进行标签索引时,要符合最左前缀原则.# 获取一个城市的统计结果 census_df.loc['Michigan', 'Washtenaw County'] # 获取多个城市的统计结果 census_df.loc[[('Michigan', 'Washtenaw County'), ('Michigan', 'Wayne County')]] # 获取多个州的统计结果 census_df.loc[['Alabama','Michigan']]
对DataFrame对象进行排序
sort_key()
,sort_value()
,nlargest()
,见名知义,不用解释了.
对DataFrame对象进行查询
与MATLAB类似,可以使用布尔掩码(Boolean mask
)对DataFrame
对象进行下标索引,而布尔掩码对象可以通过条件表达式求得,从而实现了对数据的查询.
olympics_df['Gold'] > 0 # 得到一个长度为147,内容为True或False的Series对象
only_gold = olympics_df.where(olympics_df['Gold'] > 0) # 得到一个 147*15的DataFrame对象
len(olympics_df[(olympics_df['Gold'] > 0) | (olympics_df['Gold.1'] > 0)]) # 得到 101
len(olympics_df[(olympics_df['Gold'] == 0) & (olympics_df['Gold.1'] > 0) ]) # 得到 1
DataFrame类的高级用法
Dataframe的合并
向Dataframe添加一个新列
有3种方式向DataFrame中添加一个新列
inventory_df = pd.DataFrame([{'Name': 'Chris', 'Item Purchased': 'Sponge', 'Cost': 22.50},
{'Name': 'Kevyn', 'Item Purchased': 'Kitty Litter', 'Cost': 2.50},
{'Name': 'Filip', 'Item Purchased': 'Spoon', 'Cost': 5.00}],
index=['Store 1', 'Store 1', 'Store 2'])
Cost | Item Purchased | Name | |
---|---|---|---|
Store 1 | 22.5 | Sponge | Chris |
Store 1 | 2.5 | Kitty Litter | Kevyn |
Store 2 | 5.0 | Spoon | Filip |
-
向DataFrame的新字段中传入一个与行数相同的列表或
Series
对象inventory_df['Date'] = ['December 1', 'January 1', 'mid-May']
Cost Item Purchased Name Date Store 1 22.5 Sponge Chris December 1 Store 1 2.5 Kitty Litter Kevyn January 1 Store 2 5.0 Spoon Filip mid-May -
向DataFrame的新字段中传入一个标量,以广播的形式填充字段
inventory_df['Delivered'] = True
Cost Item Purchased Name Date Delivered Store 1 22.5 Sponge Chris December 1 True Store 1 2.5 Kitty Litter Kevyn January 1 True Store 2 5.0 Spoon Filip mid-May True -
向DataFrame的新字段中传入一个字典,字典的值会被填充到键对应的行中.
inventory_df['Date'] = pd.Series({'Store 1': 'December 1', 'Store 2': 'mid-May'})
Cost Item Purchased Name Date Delivered Feedback Store 1 22.5 Sponge Chris December 1 True Positive Store 1 2.5 Kitty Litter Kevyn December 1 True None Store 2 5.0 Spoon Filip mid-May True Negative
两个DataFrame间的合并
使用pandas.merge()
函数可以实现两个DataFrame
的联合查询,使用how
参数指定连接查询类型,使用left_index
或left_on
以及right_index
或right_on
指定连接字段.
可以像操作关系型数据库那样对DataFrame
进行连接查询.
staff_df = pd.DataFrame([{'Name': 'Kelly', 'Role': 'Director of HR'},
{'Name': 'Sally', 'Role': 'Course liasion'},
{'Name': 'James', 'Role': 'Grader'}])
staff_df = staff_df.set_index('Name')
student_df = pd.DataFrame([{'Name': 'James', 'School': 'Business'},
{'Name': 'Mike', 'School': 'Law'},
{'Name': 'Sally', 'School': 'Engineering'}])
student_df = student_df.set_index('Name')
pd.merge(staff_df, student_df, how='outer', left_index=True, right_index=True)
Role | School | |
---|---|---|
Name | ||
James | Grader | Business |
Kelly | Director of HR | NaN |
Mike | NaN | Law |
Sally | Course liasion | Engineering |
pd.merge(staff_df, student_df, how='left', left_index=True, right_index=True)
Role | School | |
---|---|---|
Name | ||
Kelly | Director of HR | NaN |
Sally | Course liasion | Engineering |
James | Grader | Business |
staff_df = staff_df.reset_index()
student_df = student_df.reset_index()
pd.merge(staff_df, student_df, how='left', left_on='Name', right_on='Name')
Name | Role | School | |
---|---|---|---|
0 | Kelly | Director of HR | NaN |
1 | Sally | Course liasion | Engineering |
2 | James | Grader | Business |
也可以指定以多个字段相连接
staff_df = pd.DataFrame([{'First Name': 'Kelly', 'Last Name': 'Desjardins', 'Role': 'Director of HR'},
{'First Name': 'Sally', 'Last Name': 'Brooks', 'Role': 'Course liasion'},
{'First Name': 'James', 'Last Name': 'Wilde', 'Role': 'Grader'}])
student_df = pd.DataFrame([{'First Name': 'James', 'Last Name': 'Hammond', 'School': 'Business'},
{'First Name': 'Mike', 'Last Name': 'Smith', 'School': 'Law'},
{'First Name': 'Sally', 'Last Name': 'Brooks', 'School': 'Engineering'}])
pd.merge(staff_df, student_df, how='inner', left_on=['First Name','Last Name'], right_on=['First Name','Last Name'])
First Name | Last Name | Role | School | |
---|---|---|---|---|
0 | Sally | Brooks | Course liasion | Engineering |
若合并过程中同名字段发生冲突,会将两个都保留下来,并重命名字段名.
staff_df = pd.DataFrame([{'Name': 'Kelly', 'Role': 'Director of HR', 'Location': 'State Street'},
{'Name': 'Sally', 'Role': 'Course liasion', 'Location': 'Washington Avenue'},
{'Name': 'James', 'Role': 'Grader', 'Location': 'Washington Avenue'}])
student_df = pd.DataFrame([{'Name': 'James', 'School': 'Business', 'Location': '1024 Billiard Avenue'},
{'Name': 'Mike', 'School': 'Law', 'Location': 'Fraternity House #22'},
{'Name': 'Sally', 'School': 'Engineering', 'Location': '512 Wilson Crescent'}])
pd.merge(staff_df, student_df, how='left', left_on='Name', right_on='Name')
Location_x | Name | Role | Location_y | School | |
---|---|---|---|---|---|
0 | State Street | Kelly | Director of HR | NaN | NaN |
1 | Washington Avenue | Sally | Course liasion | 512 Wilson Crescent | Engineering |
2 | Washington Avenue | James | Grader | 1024 Billiard Avenue | Business |
通过链式调用增强代码的可读性
Pandas的大部分函数操作都将inplace
参数设为False
,这意味着返回一个原数据的拷贝(copy
)而非视图(view
),这样我们可以对DataFrame
进行链式调用.
因为使用下标索引返回的是原数据的一个视图view
,对索引结果的修改会作用在原数据上,有时候我们不希望这样.因此我们引入了一个函数DataFrame.where()
,它的inplace
参数默认为False
,这意味着我们可以通过使用该函数像下标引用那样通过布尔掩码或索引选取一个子数据集,同时不会改变原数据.
# 加括号是为了多行书写表达式而不报错
(census_df.where(df['SUMLEV']==50)
.dropna() # 不符合条件项的默认用NAN占位,这里的dropna是去掉这些不符合条件的项
.set_index(['STNAME','CTYNAME'])
.rename(columns={'ESTIMATESBASE2010': 'Estimates Base 2010'}))
SUMLEV | REGION | DIVISION | STATE | COUNTY | CENSUS2010POP | Estimates Base 2010 | POPESTIMATE2010 | POPESTIMATE2011 | POPESTIMATE2012 | ... | RDOMESTICMIG2011 | RDOMESTICMIG2012 | RDOMESTICMIG2013 | RDOMESTICMIG2014 | RDOMESTICMIG2015 | RNETMIG2011 | RNETMIG2012 | RNETMIG2013 | RNETMIG2014 | RNETMIG2015 | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
STNAME | CTYNAME | |||||||||||||||||||||
Alabama | Autauga County | 50.0 | 3.0 | 6.0 | 1.0 | 1.0 | 54571.0 | 54571.0 | 54660.0 | 55253.0 | 55175.0 | ... | 7.242091 | -2.915927 | -3.012349 | 2.265971 | -2.530799 | 7.606016 | -2.626146 | -2.722002 | 2.592270 | -2.187333 |
Baldwin County | 50.0 | 3.0 | 6.0 | 1.0 | 3.0 | 182265.0 | 182265.0 | 183193.0 | 186659.0 | 190396.0 | ... | 14.832960 | 17.647293 | 21.845705 | 19.243287 | 17.197872 | 15.844176 | 18.559627 | 22.727626 | 20.317142 | 18.293499 | |
Barbour County | 50.0 | 3.0 | 6.0 | 1.0 | 5.0 | 27457.0 | 27457.0 | 27341.0 | 27226.0 | 27159.0 | ... | -4.728132 | -2.500690 | -7.056824 | -3.904217 | -10.543299 | -4.874741 | -2.758113 | -7.167664 | -3.978583 | -10.543299 | |
Bibb County | 50.0 | 3.0 | 6.0 | 1.0 | 7.0 | 22915.0 | 22919.0 | 22861.0 | 22733.0 | 22642.0 | ... | -5.527043 | -5.068871 | -6.201001 | -0.177537 | 0.177258 | -5.088389 | -4.363636 | -5.403729 | 0.754533 | 1.107861 | |
Blount County | 50.0 | 3.0 | 6.0 | 1.0 | 9.0 | 57322.0 | 57322.0 | 57373.0 | 57711.0 | 57776.0 | ... | 1.807375 | -1.177622 | -1.748766 | -2.062535 | -1.369970 | 1.859511 | -0.848580 | -1.402476 | -1.577232 | -0.884411 | |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
Wyoming | Weston County | 50.0 | 4.0 | 8.0 | 56.0 | 45.0 | 7208.0 | 7208.0 | 7181.0 | 7114.0 | 7065.0 | ... | -11.752361 | -8.040059 | 12.372583 | 1.533635 | 6.935294 | -12.032179 | -8.040059 | 12.372583 | 1.533635 | 6.935294 |
上述代码与下面的非链式调用代码的作用完全相同,且性能略逊于非链式调用的代码.不过,我们认为这种性能代价是值得的.
census_df = census_df[census_df['SUMLEV']==50]
census_df.set_index(['STNAME','CTYNAME'], inplace=True)
census_df.rename(columns={'ESTIMATESBASE2010': 'Estimates Base 2010'})
通过apply()
将函数应用到所有行或列
通过调用apply()
函数,我们可以将某函数应用到DataFrame
的所有行或列.下面例子展示apply()
的用法:
# 为某行数据添加两个字段max和min
def min_max(row):
data = row[['POPESTIMATE2010', 'POPESTIMATE2011', 'POPESTIMATE2012', 'POPESTIMATE2013', 'POPESTIMATE2014', 'POPESTIMATE2015']]
row['max'] = np.max(data)
row['min'] = np.min(data)
return row
census_df.apply(min_max, axis=1)
在这里,我们DataFrame
的每一行都应用min_max
函数,但是axis
却设为1
而非0
,这是因为Pandas的axis
参数指的是我们用到的索引的维度,对每一行进行操作时,我们用到的是列索引,因此axis
参数被设为1
.
使用lambda
函数结合apply()
函数可以提高代码的整洁度:
rows = ['POPESTIMATE2010', 'POPESTIMATE2011', 'POPESTIMATE2012', 'POPESTIMATE2013', 'POPESTIMATE2014', 'POPESTIMATE2015']
census_df.apply(lambda x: np.max(x[rows]), axis=1)
组操作
Pandas的组操作流程包括3步: 拆分(split
)→应用(apply
)→组合(combine
).
关于Pandas的组操作,可以参考官方文档.
使用groupby()
拆分分组
类似于关系型数据库,通过DataFrame.groupby()
方法可以将DataFrame
分组,每一组包含一个分组标签和一个子DataFrame
.
可以通过by
或level
参数指定分组的依据.
-
向
by
参数传入字符串(列表),字典或函数指定分组依据.-
若传给
by
参数的是字符串(列表),则根据对应字段进行分组.for group, frame in census_df.groupby(by='STNAME'): avg = np.average(frame['CENSUS2010POP']) print('Counties in state ' + group + ' have an average population of ' + str(avg))
Counties in state Alabama have an average population of 140580.470588 Counties in state Alaska have an average population of 47348.7333333 Counties in state Arizona have an average population of 799002.125 Counties in state Arkansas have an average population of 76734.6842105 Counties in state California have an average population of 1262845.9661 ...... Counties in state Wyoming have an average population of 46968.8333333
-
若传给
by
参数的是字典,则根据DataFrame对象的索引和字典的键将数据分组到对应的值中.ContinentDict = {'China':'Asia', 'United States':'North America', 'Japan':'Asia', 'United Kingdom':'Europe', 'Russian Federation':'Europe', 'Canada':'North America', 'Germany':'Europe', 'India':'Asia', 'France':'Europe', 'South Korea':'Asia', 'Italy':'Europe', 'Spain':'Europe', 'Iran':'Asia', 'Australia':'Australia', 'Brazil':'South America'} # countries_df的索引为国家名 for group, frame in countries_df.groupby(by=ContinentDict): print('There are ' + str(len(frame)) + ' countries in continent ' + str(group) + '.')
There are 5 countries in continent Asia. There are 1 countries in continent Australia. There are 6 countries in continent Europe. There are 2 countries in continent North America. There are 1 countries in continent South America.
-
若传给
by
参数的是函数,则根据将该函数作用在标签索引上得到的结果进行分组.census_df = census_df.set_index('STNAME') def fun(item): if item[0]<'M': return 0 if item[0]<'Q': return 1 return 2 for group, frame in census_df.groupby(by=fun): print('There are ' + str(len(frame)) + ' records in group ' + str(group) + ' for processing.')
There are 1177 records in group 0 for processing. There are 1134 records in group 1 for processing. There are 831 records in group 2 for processing.
-
-
也可以通过
level
参数指定分组依据.Pandas会根据多级索引的前level
级进行分组.census_df = census_df.set_index(['STNAME', 'CTYNAME']) for group,frame in census_df.groupby(level=1): avg = np.average(frame['CENSUS2010POP']) print('Counties in state and city ' + str(group) + ' have an average population of ' + str(avg))
Counties in state and city Abbeville County have an average population of 25417.0 Counties in state and city Acadia Parish have an average population of 61773.0 Counties in state and city Accomack County have an average population of 33164.0 Counties in state and city Ada County have an average population of 392365.0 Counties in state and city Adair County have an average population of 18657.0 ...... Counties in state and city Ziebach County have an average population of 2801.0
使用agg()
合并分组
使用DataFrame.aggregate()
(或缩写为DataFrame.agg()
)可以合并分组.DataFrame.agg()
接收一个函数或字典(字典的键为字符串,值为函数).
-
当接收的是一个函数时,Pandas会将该函数应用到每个字段上.
census_df.groupby(by='STNAME').agg(np.average)
SUMLEV REGION DIVISION STATE COUNTY CENSUS2010POP ESTIMATESBASE2010 POPESTIMATE2010 POPESTIMATE2011 POPESTIMATE2012 … RDOMESTICMIG2011 RDOMESTICMIG2012 RDOMESTICMIG2013 RDOMESTICMIG2014 RDOMESTICMIG2015 RNETMIG2011 RNETMIG2012 RNETMIG2013 RNETMIG2014 RNETMIG2015 STNAME Alabama 50 3 6 1 67.000000 71339.343284 71345.179104 71420.313433 71658.328358 71881.925373 … -3.395735 -3.043109 -2.186529 -1.715909 -1.958259 -2.781050 -2.339462 -1.506142 -0.944449 -1.155698 Alaska 50 4 9 2 149.655172 24490.724138 24491.344828 24621.413793 24921.379310 25214.758621 … -1.153129 -5.261918 -9.277180 -18.013402 -13.090652 2.175991 -1.651751 -5.795667 -14.182244 -9.125865 Arizona 50 4 8 4 13.866667 426134.466667 426153.800000 427213.866667 431248.800000 436884.133333 … 0.422680 -1.901080 -2.447228 2.016780 -0.523498 1.670682 -0.513930 -1.044427 3.601728 1.128518 Arkansas 50 3 7 5 75.000000 38878.906667 38879.440000 38965.253333 39180.506667 39326.653333 … -2.496257 -4.938436 -4.300058 -5.542308 -4.393712 -2.070819 -4.508319 -3.829464 -4.988838 -3.824365 California 50 4 9 6 58.000000 642309.586207 642319.017241 643691.017241 650000.586207 656138.879310 … -4.244651 -3.831785 -3.463105 -2.027410 -0.745475 -2.174454 -1.788510 -1.285478 0.468384 1.797060 … … …. … … … … … … … … … … … … … … … … …. … …. Wyoming 50 4 8 56 23.000000 24505.478261 24511.608696 24544.173913 24685.565217 25090.434783 … -2.199453 4.416514 3.632000 -5.586437 -2.703374 -1.544964 5.095661 4.342728 -4.753322 -1.848371 -
当接收到的是一个字典时,Pandas会根据字典的键决定函数行为.
-
若字典的键为
DataFrame
的字段名时,Pandas会将函数作用在对应字段上(census_df.set_index('STNAME').groupby(level=0)['POPESTIMATE2010','POPESTIMATE2011'] .agg({'POPESTIMATE2010': np.average, 'POPESTIMATE2011': np.sum}))
POPESTIMATE2010 POPESTIMATE2011 STNAME Alabama 71420.313433 4801108 Alaska 24621.413793 722720 Arizona 427213.866667 6468732 Arkansas 38965.253333 2938538 California 643691.017241 37700034 … … … Wyoming 24544.173913 567768 -
若字典的键不是
DataFrame
的字段名,Pandas会对所有字段应用该函数并将其存入对应字段中.(census_df.set_index('STNAME').groupby(level=0)['POPESTIMATE2010','POPESTIMATE2011'] .agg({'avg': np.average, 'sum': np.sum}))
avg sum POPESTIMATE2010 POPESTIMATE2011 POPESTIMATE2010 POPESTIMATE2011 STNAME Alabama 71420.313433 71658.328358 4785161 4801108 Alaska 24621.413793 24921.379310 714021 722720 Arizona 427213.866667 431248.800000 6408208 6468732 Arkansas 38965.253333 39180.506667 2922394 2938538 California 643691.017241 650000.586207 37334079 37700034 … … … … … Wyoming 24544.173913 24685.565217 564516 567768
-
当然,我们也可以直接使用聚合函数来合并分组,常用的聚合函数如下:
Function | Description |
---|---|
mean() | Compute mean of groups |
sum() | Compute sum of group values |
size() | Compute group sizes |
count() | Compute count of group |
std() | Standard deviation of groups |
var() | Compute variance of groups |
sem() | Standard error of the mean of groups |
describe() | Generates descriptive statistics |
first() | Compute first of group values |
last() | Compute last of group values |
nth() | Take nth value, or a subset if n is a list |
min() | Compute min of group values |
max() | Compute max of group values |
枢纽分析表
可以通过DataFrame.pivot_table()
函数生成枢纽分析表(数据透视表),下面以cars.csv
为例,演示生成枢纽分析表的语法.
cars_df = pd.read_csv('cars.csv')
YEAR | Make | Model | Size | (kW) | Unnamed: 5 | TYPE | CITY (kWh/100 km) | HWY (kWh/100 km) | COMB (kWh/100 km) | CITY (Le/100 km) | HWY (Le/100 km) | COMB (Le/100 km) | (g/km) | RATING | (km) | TIME (h) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2012 | MITSUBISHI | i-MiEV | SUBCOMPACT | 49 | A1 | B | 16.9 | 21.4 | 18.7 | 1.9 | 2.4 | 2.1 | 0 | n/a | 100 | 7 |
1 | 2012 | NISSAN | LEAF | MID-SIZE | 80 | A1 | B | 19.3 | 23.0 | 21.1 | 2.2 | 2.6 | 2.4 | 0 | n/a | 117 | 7 |
2 | 2013 | FORD | FOCUS ELECTRIC | COMPACT | 107 | A1 | B | 19.0 | 21.1 | 20.0 | 2.1 | 2.4 | 2.2 | 0 | n/a | 122 | 4 |
3 | 2013 | MITSUBISHI | i-MiEV | SUBCOMPACT | 49 | A1 | B | 16.9 | 21.4 | 18.7 | 1.9 | 2.4 | 2.1 | 0 | n/a | 100 | 7 |
4 | 2013 | NISSAN | LEAF | MID-SIZE | 80 | A1 | B | 19.3 | 23.0 | 21.1 | 2.2 | 2.6 | 2.4 | 0 | n/a | 117 | 7 |
… | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
52 | 2016 | TESLA | MODEL X P90D | SUV - STANDARD | 568 | A1 | B | 23.6 | 23.3 | 23.5 | 2.7 | 2.6 | 2.6 | 0 | 10 | 402 | 12 |
生成枢纽分析表的语法如下:
cars_df.pivot_table(values='(kW)', index='YEAR', columns='Make', aggfunc=np.mean)
Make | BMW | CHEVROLET | FORD | KIA | MITSUBISHI | NISSAN | SMART | TESLA |
---|---|---|---|---|---|---|---|---|
YEAR | ||||||||
2012 | NaN | NaN | NaN | NaN | 49.0 | 80.0 | NaN | NaN |
2013 | NaN | NaN | 107.0 | NaN | 49.0 | 80.0 | 35.0 | 280.000000 |
2014 | NaN | 104.0 | 107.0 | NaN | 49.0 | 80.0 | 35.0 | 268.333333 |
2015 | 125.0 | 104.0 | 107.0 | 81.0 | 49.0 | 80.0 | 35.0 | 320.666667 |
2016 | 125.0 | 104.0 | 107.0 | 81.0 | 49.0 | 80.0 | 35.0 | 409.700000 |
通过设置margins=True
可以使生成的枢纽分析表带有边缘:
Make | BMW | CHEVROLET | FORD | KIA | MITSUBISHI | NISSAN | SMART | TESLA | All |
---|---|---|---|---|---|---|---|---|---|
YEAR | |||||||||
2012 | NaN | NaN | NaN | NaN | 49.0 | 80.0 | NaN | NaN | 64.500000 |
2013 | NaN | NaN | 107.0 | NaN | 49.0 | 80.0 | 35.0 | 280.000000 | 158.444444 |
2014 | NaN | 104.0 | 107.0 | NaN | 49.0 | 80.0 | 35.0 | 268.333333 | 135.000000 |
2015 | 125.0 | 104.0 | 107.0 | 81.0 | 49.0 | 80.0 | 35.0 | 320.666667 | 181.428571 |
2016 | 125.0 | 104.0 | 107.0 | 81.0 | 49.0 | 80.0 | 35.0 | 409.700000 | 252.263158 |
All | 125.0 | 104.0 | 107.0 | 81.0 | 49.0 | 80.0 | 35.0 | 345.478261 | 190.622642 |
数据尺度
统计学中的数据尺度
在统计学中,有4种数据尺度:定比尺度(Ratio scale
), 定距尺度(Interval scale
), 定序尺度(Ordinal scale
), 定类尺度(Nominal scale
).四种尺度的区别如下:
-
定比尺度(
Ratio scale
): 数据单位是等间隔的;存在绝对零点可以进行+
,-
,×
,÷
和比较运算.长度是一个典型的定比变量.
-
定距尺度(
Interval scale
): 数据单位是等间隔的;不存在绝对零点,可以进行+
,-
和比较运算,×
,÷
运算无意义.摄氏温度是一个典型的定距变量.
-
定序尺度(
Ordinal scale
): 数据单位并不等间隔但有序,可以进行比较运算,+
,-
,×
,÷
运算无意义.以字母表示的绩点(
A+
,A
,A-
)是一个典型的定序变量. -
定类尺度(
Nominal scale
): 数据单位仅表示类别,没有大小关系.球队队名是一个典型的定类变量.
Pandas中数据尺度间的转换
-
定序变量和定类变量
Pandas将定类变量和定序变量都视为
'category'
变量,区别在于定序变量是有序的而定类变量不是.df = pd.DataFrame(['A+', 'A', 'A-', 'B+', 'B', 'B-', 'C+', 'C', 'C-', 'D+', 'D'], index=['excellent', 'excellent', 'excellent', 'good', 'good', 'good', 'ok', 'ok', 'ok', 'poor', 'poor']) df.rename(columns={0: 'Grades'}, inplace=True)
Grades excellent A+ excellent A excellent A- good B+ good B good B- ok C+ ok C ok C- poor D+ poor D # 创建定类变量 nominal_grades = df['Grades'].astype('category') grades_01 > 'C' # 报错 Unordered Categoricals can only compare equality or not # 创建定序变量 ordinal_grades = df['Grades'].astype('category', ordered=True) ordinal_grades # 返回对应的布尔矩阵
-
定比变量转为定序变量
使用
pd.cut()
可以将数据分桶,从而将定比变量转为定序变量census_df = census_df.set_index('STNAME').groupby(level=0)['CENSUS2010POP'].agg({'avg': np.average}) pd.cut(census_df['avg'],10)
STNAME Alabama (11706.0871, 75333.413] Alaska (11706.0871, 75333.413] Arizona (390320.176, 453317.529] Arkansas (11706.0871, 75333.413] California (579312.234, 642309.586] ...... Wyoming (11706.0871, 75333.413] Name: avg, dtype: category Categories (10, object): [(11706.0871, 75333.413] < (75333.413, 138330.766] < (138330.766, 201328.118] < (201328.118, 264325.471] ... (390320.176, 453317.529] < (453317.529, 516314.881] < (516314.881, 579312.234] < (579312.234, 642309.586]]