数据科学Python库01:Pandas

最新推荐文章于 2022-03-21 16:41:52 发布

ncepu_Chen

最新推荐文章于 2022-03-21 16:41:52 发布

阅读量1.1k

点赞数 3

分类专栏：数据科学文章标签： Pandas

本文链接：https://blog.csdn.net/ncepu_chen/article/details/103866563

版权

数据科学专栏收录该内容

1 篇文章 0 订阅

订阅专栏

Series和DataFrame类的基本用法

Series和DataFrame类是Pandas的两个最核心的类.Series用于表示一维的数据,而DataFrame用于表示二维的数据.

在这里插入图片描述

Series类

创建Series对象

可以使用一个类数组(array-like)对象(如list或dict)创建Series.

使用list创建Series时,索引为从0开始的数字

animals = ['Tiger', 'Bear', 'Moose']
pd.Series(animals)

0    Tiger
1     Bear
2    Moose
dtype: object

使用dict创建Series时,索引为字典的键

sports = {'Archery': 'Bhutan',
          'Golf': 'Scotland',
          'Sumo': 'Japan',
          'Taekwondo': 'South Korea'}
s = pd.Series(sports)

Archery           Bhutan
Golf            Scotland
Sumo               Japan
Taekwondo    South Korea
dtype: object

也可以在构造函数中使用index参数显式指明索引.

s = pd.Series(['Tiger', 'Bear', 'Moose'], index=['India', 'America', 'Canada'])

India      Tiger
America     Bear
Canada     Moose
dtype: object

若传给构造函数的数据中存在缺失值None,则Pandas会自动将其补全为对应的数据(对于数值类型数据是NaN,对于其他类型数据是None).

animals = ['Tiger', 'Bear', None]
pd.Series(animals)

numbers = [1, 2, None]
pd.Series(numbers)

0    Tiger
1     Bear
2     None
dtype: object

0    1.0
1    2.0
2    NaN
dtype: float64

对Series对象进行索引

可以使用iloc或loc属性对Series对象进行索引查询.前者是位置索引,根据行数查询;而后者是标签索引,根据标签查询.

sports = {'Archery': 'Bhutan', 'Golf': 'Scotland', 'Sumo': 'Japan', 'Taekwondo': 'South Korea'}
s = pd.Series(sports)
'''创建Series对象内容如下:
    Archery           Bhutan
    Golf            Scotland
    Sumo               Japan
    Taekwondo    South Korea
    dtype: object
'''

s.iloc[3]       # 得到 'South Korea'
s.loc['Golf']   # 得到 'Scotland'

注意,iloc和loc是对象的属性而非方法,因此使用方括号[]而非圆括号()包裹索引值.

当然,Pandas也支持直接对Series进行下标索引,Pandas会根据我们传入的键来判断去进行位置索引还是标签索引.但当传入的下标为一个整数时可能会判断出错,因此应尽量优先使用iloc或loc.

s.[3]       # 等价于 s.iloc[3],会得到 'South Korea'
s.['Golf']  # 等价于 s.loc['Golf'],会得到 'Scotland'

sports = {99: 'Bhutan', 100: 'Scotland', 101: 'Japan', 102: 'South Korea'}
s = pd.Series(sports)
'''创建Series对象内容如下:
    99          Bhutan
    100       Scotland
    101          Japan
    102    South Korea
    dtype: object
'''
s[0]    # 并不会如预期得那样调用s.iloc[0],而是报错KeyError

Series对象的标签索引

与关系型数据库不同的是,Series对象的标签索引是可以重复的.

original_sports = pd.Series({'Archery': 'Bhutan', 'Golf': 'Scotland', 'Sumo': 'Japan', 'Taekwondo': 'South Korea'}) 
cricket_loving_countries = pd.Series(['Australia', 'Barbados', 'Pakistan', 'England'], 
                                     index=['Cricket', 'Cricket', 'Cricket', 'Cricket']) 
all_countries = original_sports.append(cricket_loving_countries)    # Series对象的append方法不会在原对象上修改

执行上述代码,得到Series对象如下:

Archery           Bhutan
Golf            Scotland
Sumo               Japan
Taekwondo    South Korea
Cricket        Australia
Cricket         Barbados
Cricket         Pakistan
Cricket          England
dtype: object

可以看到,该Series对象的标签存在重复,对重复的标签进行索引,得到的仍是Series对象.

all_countries.loc['Cricket']

Cricket    Australia
Cricket     Barbados
Cricket     Pakistan
Cricket      England
dtype: object

对Series对象应用Numpy方法

Pandas库是基于,因此可以对Series对象调用Numpy中array类的方法.

s = pd.Series([100.00, 120.00, 101.00, 3.00])
np.sum(s)       # 得到 324.0

DataFrame类

创建DataFrame对象

可以直接创建DataFrame对象,或者从csv文件创建DataFrame对象.

直接创建DataFrame对象:

可以使用Series类列表或dict列表创建Dataframe对象,每个Series或dict都被认为是Dataframe的一行.与Series类似,也可以在构造函数中显示指明标签索引.

purchase_1 = pd.Series({'Name': 'Chris', 'Item Purchased': 'Dog Food', 'Cost': 22.50})
purchase_2 = pd.Series({'Name': 'Kevyn', 'Item Purchased': 'Kitty Litter', 'Cost': 2.50})
purchase_3 = pd.Series({'Name': 'Vinod', 'Item Purchased': 'Bird Seed', 'Cost': 5.00})
inventory_df = pd.DataFrame([purchase_1, purchase_2, purchase_3], index=['Store 1', 'Store 1', 'Store 2'])
inventory_df.head() # 使用head()方法查看DataFrame对象的头几行

得到DataFrame对象如下:

	Cost	Item Purchased	Name
Store 1	22.5	Dog Food	Chris
Store 1	2.5	Kitty Litter	Kevyn
Store 2	5.0	Bird Seed	Vinod

从csv文件创建DataFrame对象:

使用read_csv()函数可以从csv文件创建DataFrame对象.通过设置一系列参数可以使Pandas为我们推断出数据的标签和字段.

下面以奥林匹克数据集为例,展示如何使用read_csv()函数,数据源文件olympics.csv内容如下:

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
,№ Summer,01 !,02 !,03 !,Total,№ Winter,01 !,02 !,03 !,Total,№ Games,01 !,02 !,03 !,Combined total
Afghanistan (AFG),13,0,0,2,2,0,0,0,0,0,13,0,0,2,2
Algeria (ALG),12,5,2,8,15,3,0,0,0,0,15,5,2,8,15
Argentina (ARG),23,18,24,28,70,18,0,0,0,0,41,18,24,28,70
Armenia (ARM),5,1,2,9,12,6,0,0,0,0,11,1,2,9,12
Australasia (ANZ) [ANZ],2,3,4,5,12,0,0,0,0,0,2,3,4,5,12
......
Zambia (ZAM) [ZAM],12,0,1,1,2,0,0,0,0,0,12,0,1,1,2
Zimbabwe (ZIM) [ZIM],12,3,4,1,8,1,0,0,0,0,13,3,4,1,8
Mixed team (ZZX) [ZZX],3,8,5,4,17,0,0,0,0,0,3,8,5,4,17
Totals,27,4809,4775,5130,14714,22,959,958,948,2865,49,5768,5733,6078,17579

可以看到,第一行是无效数据,第二行为字段名,第一列为标签索引.因此调用read_csv()函数的方式如下:

olympics_df = pd.read_csv('olympics.csv', index_col = 0, skiprows=1)

得到的DataFrame对象如下:

	№ Summer	01 !	02 !	03 !	Total	№ Winter	01 !.1	02 !.1	03 !.1	Total.1	№ Games	01 !.2	02 !.2	03 !.2	Combined total
Afghanistan (AFG)	13	0	0	2	2	0	0	0	0	0	13	0	0	2	2
Algeria (ALG)	12	5	2	8	15	3	0	0	0	0	15	5	2	8	15
Argentina (ARG)	23	18	24	28	70	18	0	0	0	0	41	18	24	28	70
Armenia (ARM)	5	1	2	9	12	6	0	0	0	0	11	1	2	9	12
Australasia (ANZ) [ANZ]	2	3	4	5	12	0	0	0	0	0	2	3	4	5	12
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
Zimbabwe (ZIM) [ZIM]	12	3	4	1	8	1	0	0	0	0	13	3	4	1	8
Mixed team (ZZX) [ZZX]	3	8	5	4	17	0	0	0	0	0	3	8	5	4	17
Totals	27	4809	4775	5130	14714	22	959	958	948	2865	49	5768	5733	6078	17579

使用indexs和columns属性可以分别访问DataFrame对象的标签索引和字段.

for col in olympics_df.columns:
    if col[:2]=='01':
        olympics_df.rename(columns={col:'Gold' + col[4:]}, inplace=True)  # 设置inplace参数为true,在原df对象上进行修改
    if col[:2]=='02':
        olympics_df.rename(columns={col:'Silver' + col[4:]}, inplace=True)
    if col[:2]=='03':
        olympics_df.rename(columns={col:'Bronze' + col[4:]}, inplace=True)
    if col[:1]=='№':
        olympics_df.rename(columns={col:'#' + col[1:]}, inplace=True)

对DataFrame对象进行索引

与Series类似,我们可以分别使用iloc和loc分别对DataFrame对象进行位置索引和标签索引,这是针对行进行索引的.我们也可以使用直接下标对DataFrame对象的列进行索引.

inventory_df.iloc[0]            # 索引得到第一行
inventory_df.loc['Store 2']     # 索引得到'Store 2'标签对应的行
inventory_df['Item Purchased']  # 索引得到'Item Purchased'属性对应的列

对DataFrame对象进行索引会返回原对象的一个视图(view)而非拷贝copy,这也就是说,对索引结果的修改会作用在原数据上.

DataFrame对象的标签索引

设置单字段标签索引

调用DataFrame对象的set_index()方法可以将某字段设置为标签索引,调用reset_index()方法可以取消标签索引.

olympics_df['country'] = olympics_df.index  # 将原标签索引设为 'conuntry' 字段
olympics_df = olympics_df.set_index('Gold') # 将 'Gold' 字段设为标签索引

得到的DataFrame对象如下:

	# Summer	Silver	Bronze	Total	# Winter	Gold.1	Silver.1	Bronze.1	Total.1	# Games	Gold.2	Silver.2	Bronze.2	Combined total	country
Gold
0	13	0	2	2	0	0	0	0	0	13	0	0	2	2	Afghanistan (AFG)
5	12	2	8	15	3	0	0	0	0	15	5	2	8	15	Algeria (ALG)
18	23	24	28	70	18	0	0	0	0	41	18	24	28	70	Argentina (ARG)
1	5	2	9	12	6	0	0	0	0	11	1	2	9	12	Armenia (ARM)
3	2	4	5	12	0	0	0	0	0	2	3	4	5	12	Australasia (ANZ) [ANZ]
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
3	12	4	1	8	1	0	0	0	0	13	3	4	1	8	Zimbabwe (ZIM) [ZIM]
8	3	5	4	17	0	0	0	0	0	3	8	5	4	17	Mixed team (ZZX) [ZZX]
4809	27	4775	5130	14714	22	959	958	948	2865	49	5768	5733	6078	17579	Totals

设置多字段联合标签索引

与关系型数据库类似,DataFrame对象也支持创建多字段联合索引,只需向set_index()函数传入字段数组即可.这要求使用loc属性进行标签索引时遵循最左前缀原则.

下面以美国人口普查数据为例,演示多字段联合标签索引.

census_df = pd.read_csv('census.csv')
columns_to_keep = ['STNAME', 'CTYNAME', 'BIRTHS2010', 'BIRTHS2011', 'BIRTHS2012', 'BIRTHS2013', 'BIRTHS2014', 'BIRTHS2015', 'POPESTIMATE2010', 'POPESTIMATE2011', 'POPESTIMATE2012', 'POPESTIMATE2013', 'POPESTIMATE2014', 'POPESTIMATE2015'] 
census_df = census_df[columns_to_keep]

	STNAME	CTYNAME	BIRTHS2010	BIRTHS2011	BIRTHS2012	BIRTHS2013	BIRTHS2014	BIRTHS2015	POPESTIMATE2010	POPESTIMATE2011	POPESTIMATE2012	POPESTIMATE2013	POPESTIMATE2014	POPESTIMATE2015
0	Alabama	Alabama	14226	59689	59062	57938	58334	58305	4785161	4801108	4816089	4830533	4846411	4858979
1	Alabama	Autauga County	151	636	615	574	623	600	54660	55253	55175	55038	55290	55347
2	Alabama	Baldwin County	517	2187	2092	2160	2186	2240	183193	186659	190396	195126	199713	203709
3	Alabama	Barbour County	70	335	300	283	260	269	27341	27226	27159	26973	26815	26489
4	Alabama	Bibb County	44	266	245	259	247	253	22861	22733	22642	22512	22549	22583
5	Alabama	Blount County	183	744	710	646	618	603	57373	57711	57776	57734	57658	57673
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
3192	Wyoming	Weston County	26	81	74	93	77	79	7181	7114	7065	7160	7185	7234

我们对STNAME和CTYNAME字段建立联合标签索引,代码如下:

census_df = census_df.set_index(['STNAME', 'CTYNAME'])

		BIRTHS2010	BIRTHS2011	BIRTHS2012	BIRTHS2013	BIRTHS2014	BIRTHS2015	POPESTIMATE2010	POPESTIMATE2011	POPESTIMATE2012	POPESTIMATE2013	POPESTIMATE2014	POPESTIMATE2015
STNAME	CTYNAME
Alabama	Autauga County	151	636	615	574	623	600	54660	55253	55175	55038	55290	55347
	Baldwin County	517	2187	2092	2160	2186	2240	183193	186659	190396	195126	199713	203709
	Barbour County	70	335	300	283	260	269	27341	27226	27159	26973	26815	26489
	Bibb County	44	266	245	259	247	253	22861	22733	22642	22512	22549	22583
	Blount County	183	744	710	646	618	603	57373	57711	57776	57734	57658	57673

使用loc属性对联合索引进行标签索引时,要符合最左前缀原则.

# 获取一个城市的统计结果
census_df.loc['Michigan', 'Washtenaw County']  

# 获取多个城市的统计结果
census_df.loc[[('Michigan', 'Washtenaw County'), ('Michigan', 'Wayne County')]]

# 获取多个州的统计结果
census_df.loc[['Alabama','Michigan']]

对DataFrame对象进行排序

sort_key(),sort_value(),nlargest(),见名知义,不用解释了.

对DataFrame对象进行查询

与MATLAB类似,可以使用布尔掩码(Boolean mask)对DataFrame对象进行下标索引,而布尔掩码对象可以通过条件表达式求得,从而实现了对数据的查询.

在这里插入图片描述

olympics_df['Gold'] > 0     # 得到一个长度为147,内容为True或False的Series对象

only_gold = olympics_df.where(olympics_df['Gold'] > 0)  # 得到一个 147*15的DataFrame对象

len(olympics_df[(olympics_df['Gold'] > 0) | (olympics_df['Gold.1'] > 0)])       # 得到 101
len(olympics_df[(olympics_df['Gold'] == 0) & (olympics_df['Gold.1'] > 0) ]) # 得到 1

DataFrame类的高级用法

Dataframe的合并

向Dataframe添加一个新列

有3种方式向DataFrame中添加一个新列

inventory_df = pd.DataFrame([{'Name': 'Chris', 'Item Purchased': 'Sponge', 'Cost': 22.50},
                   {'Name': 'Kevyn', 'Item Purchased': 'Kitty Litter', 'Cost': 2.50},
                   {'Name': 'Filip', 'Item Purchased': 'Spoon', 'Cost': 5.00}],
                  index=['Store 1', 'Store 1', 'Store 2'])

	Cost	Item Purchased	Name
Store 1	22.5	Sponge	Chris
Store 1	2.5	Kitty Litter	Kevyn
Store 2	5.0	Spoon	Filip

向DataFrame的新字段中传入一个与行数相同的列表或Series对象
```
inventory_df['Date'] = ['December 1', 'January 1', 'mid-May']
```
Cost Item Purchased Name Date
Store 1 22.5 Sponge Chris December 1
Store 1 2.5 Kitty Litter Kevyn January 1
Store 2 5.0 Spoon Filip mid-May
向DataFrame的新字段中传入一个标量,以广播的形式填充字段
```
inventory_df['Delivered'] = True
```
Cost Item Purchased Name Date Delivered
Store 1 22.5 Sponge Chris December 1 True
Store 1 2.5 Kitty Litter Kevyn January 1 True
Store 2 5.0 Spoon Filip mid-May True
向DataFrame的新字段中传入一个字典,字典的值会被填充到键对应的行中.
```
inventory_df['Date'] = pd.Series({'Store 1': 'December 1', 'Store 2': 'mid-May'})
```
Cost Item Purchased Name Date Delivered Feedback
Store 1 22.5 Sponge Chris December 1 True Positive
Store 1 2.5 Kitty Litter Kevyn December 1 True None
Store 2 5.0 Spoon Filip mid-May True Negative

	Cost	Item Purchased	Name	Date
Store 1	22.5	Sponge	Chris	December 1
Store 1	2.5	Kitty Litter	Kevyn	January 1
Store 2	5.0	Spoon	Filip	mid-May

	Cost	Item Purchased	Name	Date	Delivered
Store 1	22.5	Sponge	Chris	December 1	True
Store 1	2.5	Kitty Litter	Kevyn	January 1	True
Store 2	5.0	Spoon	Filip	mid-May	True

	Cost	Item Purchased	Name	Date	Delivered	Feedback
Store 1	22.5	Sponge	Chris	December 1	True	Positive
Store 1	2.5	Kitty Litter	Kevyn	December 1	True	None
Store 2	5.0	Spoon	Filip	mid-May	True	Negative

两个DataFrame间的合并

使用pandas.merge()函数可以实现两个DataFrame的联合查询,使用how参数指定连接查询类型,使用left_index或left_on以及right_index或right_on指定连接字段.

在这里插入图片描述

可以像操作关系型数据库那样对DataFrame进行连接查询.

staff_df = pd.DataFrame([{'Name': 'Kelly', 'Role': 'Director of HR'},
                         {'Name': 'Sally', 'Role': 'Course liasion'},
                         {'Name': 'James', 'Role': 'Grader'}])
staff_df = staff_df.set_index('Name')

student_df = pd.DataFrame([{'Name': 'James', 'School': 'Business'},
                           {'Name': 'Mike', 'School': 'Law'},
                           {'Name': 'Sally', 'School': 'Engineering'}])
student_df = student_df.set_index('Name')

在这里插入图片描述

pd.merge(staff_df, student_df, how='outer', left_index=True, right_index=True)

	Role	School
Name
James	Grader	Business
Kelly	Director of HR	NaN
Mike	NaN	Law
Sally	Course liasion	Engineering

pd.merge(staff_df, student_df, how='left', left_index=True, right_index=True)

	Role	School
Name
Kelly	Director of HR	NaN
Sally	Course liasion	Engineering
James	Grader	Business

staff_df = staff_df.reset_index()
student_df = student_df.reset_index()
pd.merge(staff_df, student_df, how='left', left_on='Name', right_on='Name')

	Name	Role	School
0	Kelly	Director of HR	NaN
1	Sally	Course liasion	Engineering
2	James	Grader	Business

也可以指定以多个字段相连接

staff_df = pd.DataFrame([{'First Name': 'Kelly', 'Last Name': 'Desjardins', 'Role': 'Director of HR'},
                         {'First Name': 'Sally', 'Last Name': 'Brooks', 'Role': 'Course liasion'},
                         {'First Name': 'James', 'Last Name': 'Wilde', 'Role': 'Grader'}])
student_df = pd.DataFrame([{'First Name': 'James', 'Last Name': 'Hammond', 'School': 'Business'},
                           {'First Name': 'Mike', 'Last Name': 'Smith', 'School': 'Law'},
                           {'First Name': 'Sally', 'Last Name': 'Brooks', 'School': 'Engineering'}])
pd.merge(staff_df, student_df, how='inner', left_on=['First Name','Last Name'], right_on=['First Name','Last Name'])

	First Name	Last Name	Role	School
0	Sally	Brooks	Course liasion	Engineering

若合并过程中同名字段发生冲突,会将两个都保留下来,并重命名字段名.

staff_df = pd.DataFrame([{'Name': 'Kelly', 'Role': 'Director of HR', 'Location': 'State Street'},
                         {'Name': 'Sally', 'Role': 'Course liasion', 'Location': 'Washington Avenue'},
                         {'Name': 'James', 'Role': 'Grader', 'Location': 'Washington Avenue'}])
student_df = pd.DataFrame([{'Name': 'James', 'School': 'Business', 'Location': '1024 Billiard Avenue'},
                           {'Name': 'Mike', 'School': 'Law', 'Location': 'Fraternity House #22'},
                           {'Name': 'Sally', 'School': 'Engineering', 'Location': '512 Wilson Crescent'}])
pd.merge(staff_df, student_df, how='left', left_on='Name', right_on='Name')

	Location_x	Name	Role	Location_y	School
0	State Street	Kelly	Director of HR	NaN	NaN
1	Washington Avenue	Sally	Course liasion	512 Wilson Crescent	Engineering
2	Washington Avenue	James	Grader	1024 Billiard Avenue	Business

通过链式调用增强代码的可读性

Pandas的大部分函数操作都将inplace参数设为False,这意味着返回一个原数据的拷贝(copy)而非视图(view),这样我们可以对DataFrame进行链式调用.

因为使用下标索引返回的是原数据的一个视图view,对索引结果的修改会作用在原数据上,有时候我们不希望这样.因此我们引入了一个函数DataFrame.where(),它的inplace参数默认为False,这意味着我们可以通过使用该函数像下标引用那样通过布尔掩码或索引选取一个子数据集,同时不会改变原数据.

# 加括号是为了多行书写表达式而不报错
(census_df.where(df['SUMLEV']==50)
    .dropna()	# 不符合条件项的默认用NAN占位,这里的dropna是去掉这些不符合条件的项
    .set_index(['STNAME','CTYNAME'])
    .rename(columns={'ESTIMATESBASE2010': 'Estimates Base 2010'}))

		SUMLEV	REGION	DIVISION	STATE	COUNTY	CENSUS2010POP	Estimates Base 2010	POPESTIMATE2010	POPESTIMATE2011	POPESTIMATE2012	...	RDOMESTICMIG2011	RDOMESTICMIG2012	RDOMESTICMIG2013	RDOMESTICMIG2014	RDOMESTICMIG2015	RNETMIG2011	RNETMIG2012	RNETMIG2013	RNETMIG2014	RNETMIG2015
STNAME	CTYNAME
Alabama	Autauga County	50.0	3.0	6.0	1.0	1.0	54571.0	54571.0	54660.0	55253.0	55175.0	...	7.242091	-2.915927	-3.012349	2.265971	-2.530799	7.606016	-2.626146	-2.722002	2.592270	-2.187333
	Baldwin County	50.0	3.0	6.0	1.0	3.0	182265.0	182265.0	183193.0	186659.0	190396.0	...	14.832960	17.647293	21.845705	19.243287	17.197872	15.844176	18.559627	22.727626	20.317142	18.293499
	Barbour County	50.0	3.0	6.0	1.0	5.0	27457.0	27457.0	27341.0	27226.0	27159.0	...	-4.728132	-2.500690	-7.056824	-3.904217	-10.543299	-4.874741	-2.758113	-7.167664	-3.978583	-10.543299
	Bibb County	50.0	3.0	6.0	1.0	7.0	22915.0	22919.0	22861.0	22733.0	22642.0	...	-5.527043	-5.068871	-6.201001	-0.177537	0.177258	-5.088389	-4.363636	-5.403729	0.754533	1.107861
	Blount County	50.0	3.0	6.0	1.0	9.0	57322.0	57322.0	57373.0	57711.0	57776.0	...	1.807375	-1.177622	-1.748766	-2.062535	-1.369970	1.859511	-0.848580	-1.402476	-1.577232	-0.884411
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
Wyoming	Weston County	50.0	4.0	8.0	56.0	45.0	7208.0	7208.0	7181.0	7114.0	7065.0	...	-11.752361	-8.040059	12.372583	1.533635	6.935294	-12.032179	-8.040059	12.372583	1.533635	6.935294

上述代码与下面的非链式调用代码的作用完全相同,且性能略逊于非链式调用的代码.不过,我们认为这种性能代价是值得的.

census_df = census_df[census_df['SUMLEV']==50]
census_df.set_index(['STNAME','CTYNAME'], inplace=True)
census_df.rename(columns={'ESTIMATESBASE2010': 'Estimates Base 2010'})

通过`apply()`将函数应用到所有行或列

通过调用apply()函数,我们可以将某函数应用到DataFrame的所有行或列.下面例子展示apply()的用法:

# 为某行数据添加两个字段max和min
def min_max(row):
    data = row[['POPESTIMATE2010', 'POPESTIMATE2011', 'POPESTIMATE2012', 'POPESTIMATE2013', 'POPESTIMATE2014', 'POPESTIMATE2015']]
    row['max'] = np.max(data)
    row['min'] = np.min(data)
    return row

census_df.apply(min_max, axis=1)

在这里,我们DataFrame的每一行都应用min_max函数,但是axis却设为1而非0,这是因为Pandas的axis参数指的是我们用到的索引的维度,对每一行进行操作时,我们用到的是列索引,因此axis参数被设为1.

使用lambda函数结合apply()函数可以提高代码的整洁度:

rows = ['POPESTIMATE2010', 'POPESTIMATE2011', 'POPESTIMATE2012', 'POPESTIMATE2013', 'POPESTIMATE2014', 'POPESTIMATE2015']
census_df.apply(lambda x: np.max(x[rows]), axis=1)

组操作

Pandas的组操作流程包括3步: 拆分(split)→应用(apply)→组合(combine).

关于Pandas的组操作,可以参考官方文档.

使用`groupby()`拆分分组

类似于关系型数据库,通过DataFrame.groupby()方法可以将DataFrame分组,每一组包含一个分组标签和一个子DataFrame.

可以通过by或level参数指定分组的依据.

向by参数传入字符串(列表),字典或函数指定分组依据.

若传给by参数的是字符串(列表),则根据对应字段进行分组.

for group, frame in census_df.groupby(by='STNAME'):
    avg = np.average(frame['CENSUS2010POP'])
    print('Counties in state ' + group + ' have an average population of ' + str(avg))

Counties in state Alabama have an average population of 140580.470588
Counties in state Alaska have an average population of 47348.7333333
Counties in state Arizona have an average population of 799002.125
Counties in state Arkansas have an average population of 76734.6842105
Counties in state California have an average population of 1262845.9661
......
Counties in state Wyoming have an average population of 46968.8333333

若传给by参数的是字典,则根据DataFrame对象的索引和字典的键将数据分组到对应的值中.

ContinentDict  = {'China':'Asia', 
                  'United States':'North America', 
                  'Japan':'Asia', 
                  'United Kingdom':'Europe', 
                  'Russian Federation':'Europe', 
                  'Canada':'North America', 
                  'Germany':'Europe', 
                  'India':'Asia',
                  'France':'Europe', 
                  'South Korea':'Asia', 
                  'Italy':'Europe', 
                  'Spain':'Europe', 
                  'Iran':'Asia',
                  'Australia':'Australia', 
                  'Brazil':'South America'}

# countries_df的索引为国家名
for group, frame in countries_df.groupby(by=ContinentDict):
    print('There are ' + str(len(frame)) + ' countries in continent ' + str(group) + '.')

There are 5 countries in continent Asia.
There are 1 countries in continent Australia.
There are 6 countries in continent Europe.
There are 2 countries in continent North America.
There are 1 countries in continent South America.

若传给by参数的是函数,则根据将该函数作用在标签索引上得到的结果进行分组.

census_df = census_df.set_index('STNAME')

def fun(item):
    if item[0]<'M':
        return 0
    if item[0]<'Q':
        return 1
    return 2

for group, frame in census_df.groupby(by=fun):
    print('There are ' + str(len(frame)) + ' records in group ' + str(group) + ' for processing.')

There are 1177 records in group 0 for processing.
There are 1134 records in group 1 for processing.
There are 831 records in group 2 for processing.

也可以通过level参数指定分组依据.Pandas会根据多级索引的前level级进行分组.

census_df = census_df.set_index(['STNAME', 'CTYNAME'])

for group,frame in census_df.groupby(level=1):
        avg = np.average(frame['CENSUS2010POP'])
        print('Counties in state and city ' + str(group) + ' have an average population of ' + str(avg))

Counties in state and city Abbeville County have an average population of 25417.0
Counties in state and city Acadia Parish have an average population of 61773.0
Counties in state and city Accomack County have an average population of 33164.0
Counties in state and city Ada County have an average population of 392365.0
Counties in state and city Adair County have an average population of 18657.0
......
Counties in state and city Ziebach County have an average population of 2801.0

使用`agg()`合并分组

使用DataFrame.aggregate()(或缩写为DataFrame.agg())可以合并分组.DataFrame.agg()接收一个函数或字典(字典的键为字符串,值为函数).

当接收的是一个函数时,Pandas会将该函数应用到每个字段上.

census_df.groupby(by='STNAME').agg(np.average)

	SUMLEV	REGION	DIVISION	STATE	COUNTY	CENSUS2010POP	ESTIMATESBASE2010	POPESTIMATE2010	POPESTIMATE2011	POPESTIMATE2012	…	RDOMESTICMIG2011	RDOMESTICMIG2012	RDOMESTICMIG2013	RDOMESTICMIG2014	RDOMESTICMIG2015	RNETMIG2011	RNETMIG2012	RNETMIG2013	RNETMIG2014	RNETMIG2015
STNAME
Alabama	50	3	6	1	67.000000	71339.343284	71345.179104	71420.313433	71658.328358	71881.925373	…	-3.395735	-3.043109	-2.186529	-1.715909	-1.958259	-2.781050	-2.339462	-1.506142	-0.944449	-1.155698
Alaska	50	4	9	2	149.655172	24490.724138	24491.344828	24621.413793	24921.379310	25214.758621	…	-1.153129	-5.261918	-9.277180	-18.013402	-13.090652	2.175991	-1.651751	-5.795667	-14.182244	-9.125865
Arizona	50	4	8	4	13.866667	426134.466667	426153.800000	427213.866667	431248.800000	436884.133333	…	0.422680	-1.901080	-2.447228	2.016780	-0.523498	1.670682	-0.513930	-1.044427	3.601728	1.128518
Arkansas	50	3	7	5	75.000000	38878.906667	38879.440000	38965.253333	39180.506667	39326.653333	…	-2.496257	-4.938436	-4.300058	-5.542308	-4.393712	-2.070819	-4.508319	-3.829464	-4.988838	-3.824365
California	50	4	9	6	58.000000	642309.586207	642319.017241	643691.017241	650000.586207	656138.879310	…	-4.244651	-3.831785	-3.463105	-2.027410	-0.745475	-2.174454	-1.788510	-1.285478	0.468384	1.797060
…	…	….	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	….	…	….
Wyoming	50	4	8	56	23.000000	24505.478261	24511.608696	24544.173913	24685.565217	25090.434783	…	-2.199453	4.416514	3.632000	-5.586437	-2.703374	-1.544964	5.095661	4.342728	-4.753322	-1.848371

当接收到的是一个字典时,Pandas会根据字典的键决定函数行为.

若字典的键为DataFrame的字段名时,Pandas会将函数作用在对应字段上

(census_df.set_index('STNAME').groupby(level=0)['POPESTIMATE2010','POPESTIMATE2011']
    .agg({'POPESTIMATE2010': np.average, 'POPESTIMATE2011': np.sum}))

	POPESTIMATE2010	POPESTIMATE2011
STNAME
Alabama	71420.313433	4801108
Alaska	24621.413793	722720
Arizona	427213.866667	6468732
Arkansas	38965.253333	2938538
California	643691.017241	37700034
…	…	…
Wyoming	24544.173913	567768

若字典的键不是DataFrame的字段名,Pandas会对所有字段应用该函数并将其存入对应字段中.

(census_df.set_index('STNAME').groupby(level=0)['POPESTIMATE2010','POPESTIMATE2011']
    .agg({'avg': np.average, 'sum': np.sum}))

	avg		sum
	POPESTIMATE2010	POPESTIMATE2011	POPESTIMATE2010	POPESTIMATE2011
STNAME
Alabama	71420.313433	71658.328358	4785161	4801108
Alaska	24621.413793	24921.379310	714021	722720
Arizona	427213.866667	431248.800000	6408208	6468732
Arkansas	38965.253333	39180.506667	2922394	2938538
California	643691.017241	650000.586207	37334079	37700034
…	…	…	…	…
Wyoming	24544.173913	24685.565217	564516	567768

当然,我们也可以直接使用聚合函数来合并分组,常用的聚合函数如下:

Function	Description
`mean()`	Compute mean of groups
`sum()`	Compute sum of group values
`size()`	Compute group sizes
`count()`	Compute count of group
`std()`	Standard deviation of groups
`var()`	Compute variance of groups
`sem()`	Standard error of the mean of groups
`describe()`	Generates descriptive statistics
`first()`	Compute first of group values
`last()`	Compute last of group values
`nth()`	Take nth value, or a subset if n is a list
`min()`	Compute min of group values
`max()`	Compute max of group values

枢纽分析表

可以通过DataFrame.pivot_table()函数生成枢纽分析表(数据透视表),下面以cars.csv为例,演示生成枢纽分析表的语法.

cars_df = pd.read_csv('cars.csv')

YEAR	Make	Model	Size	(kW)	Unnamed: 5	TYPE	CITY (kWh/100 km)	HWY (kWh/100 km)	COMB (kWh/100 km)	CITY (Le/100 km)	HWY (Le/100 km)	COMB (Le/100 km)	(g/km)	RATING	(km)	TIME (h)
0	2012	MITSUBISHI	i-MiEV	SUBCOMPACT	49	A1	B	16.9	21.4	18.7	1.9	2.4	2.1	0	n/a	100	7
1	2012	NISSAN	LEAF	MID-SIZE	80	A1	B	19.3	23.0	21.1	2.2	2.6	2.4	0	n/a	117	7
2	2013	FORD	FOCUS ELECTRIC	COMPACT	107	A1	B	19.0	21.1	20.0	2.1	2.4	2.2	0	n/a	122	4
3	2013	MITSUBISHI	i-MiEV	SUBCOMPACT	49	A1	B	16.9	21.4	18.7	1.9	2.4	2.1	0	n/a	100	7
4	2013	NISSAN	LEAF	MID-SIZE	80	A1	B	19.3	23.0	21.1	2.2	2.6	2.4	0	n/a	117	7
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
52	2016	TESLA	MODEL X P90D	SUV - STANDARD	568	A1	B	23.6	23.3	23.5	2.7	2.6	2.6	0	10	402	12

生成枢纽分析表的语法如下:

cars_df.pivot_table(values='(kW)', index='YEAR', columns='Make', aggfunc=np.mean)

Make	BMW	CHEVROLET	FORD	KIA	MITSUBISHI	NISSAN	SMART	TESLA
YEAR
2012	NaN	NaN	NaN	NaN	49.0	80.0	NaN	NaN
2013	NaN	NaN	107.0	NaN	49.0	80.0	35.0	280.000000
2014	NaN	104.0	107.0	NaN	49.0	80.0	35.0	268.333333
2015	125.0	104.0	107.0	81.0	49.0	80.0	35.0	320.666667
2016	125.0	104.0	107.0	81.0	49.0	80.0	35.0	409.700000

通过设置margins=True可以使生成的枢纽分析表带有边缘:

Make	BMW	CHEVROLET	FORD	KIA	MITSUBISHI	NISSAN	SMART	TESLA	All
YEAR
2012	NaN	NaN	NaN	NaN	49.0	80.0	NaN	NaN	64.500000
2013	NaN	NaN	107.0	NaN	49.0	80.0	35.0	280.000000	158.444444
2014	NaN	104.0	107.0	NaN	49.0	80.0	35.0	268.333333	135.000000
2015	125.0	104.0	107.0	81.0	49.0	80.0	35.0	320.666667	181.428571
2016	125.0	104.0	107.0	81.0	49.0	80.0	35.0	409.700000	252.263158
All	125.0	104.0	107.0	81.0	49.0	80.0	35.0	345.478261	190.622642

数据尺度

统计学中的数据尺度

在统计学中,有4种数据尺度:定比尺度(Ratio scale), 定距尺度(Interval scale), 定序尺度(Ordinal scale), 定类尺度(Nominal scale).四种尺度的区别如下:

定比尺度(Ratio scale): 数据单位是等间隔的;存在绝对零点可以进行+,-,×,÷和比较运算.

长度是一个典型的定比变量.
定距尺度(Interval scale): 数据单位是等间隔的;不存在绝对零点,可以进行+,-和比较运算,×,÷运算无意义.

摄氏温度是一个典型的定距变量.
定序尺度(Ordinal scale): 数据单位并不等间隔但有序,可以进行比较运算,+,-,×,÷运算无意义.

以字母表示的绩点(A+,A,A-)是一个典型的定序变量.
定类尺度(Nominal scale): 数据单位仅表示类别,没有大小关系.

球队队名是一个典型的定类变量.

Pandas中数据尺度间的转换

定序变量和定类变量

Pandas将定类变量和定序变量都视为'category'变量,区别在于定序变量是有序的而定类变量不是.

df = pd.DataFrame(['A+', 'A', 'A-', 'B+', 'B', 'B-', 'C+', 'C', 'C-', 'D+', 'D'],
                  index=['excellent', 'excellent', 'excellent', 'good', 'good', 'good', 'ok', 'ok', 'ok', 'poor', 'poor'])
df.rename(columns={0: 'Grades'}, inplace=True)

	Grades
excellent	A+
excellent	A
excellent	A-
good	B+
good	B
good	B-
ok	C+
ok	C
ok	C-
poor	D+
poor	D

# 创建定类变量
nominal_grades = df['Grades'].astype('category')
grades_01 > 'C' 	# 报错 Unordered Categoricals can only compare equality or not

# 创建定序变量
ordinal_grades = df['Grades'].astype('category', ordered=True)
ordinal_grades		# 返回对应的布尔矩阵

定比变量转为定序变量

使用pd.cut()可以将数据分桶,从而将定比变量转为定序变量

census_df = census_df.set_index('STNAME').groupby(level=0)['CENSUS2010POP'].agg({'avg': np.average})
pd.cut(census_df['avg'],10)

STNAME
Alabama                  (11706.0871, 75333.413]
Alaska                   (11706.0871, 75333.413]
Arizona                 (390320.176, 453317.529]
Arkansas                 (11706.0871, 75333.413]
California              (579312.234, 642309.586]
......
Wyoming                  (11706.0871, 75333.413]
Name: avg, dtype: category
Categories (10, object): [(11706.0871, 75333.413] < (75333.413, 138330.766] < (138330.766, 201328.118] < (201328.118, 264325.471] ... (390320.176, 453317.529] < (453317.529, 516314.881] < (516314.881, 579312.234] < (579312.234, 642309.586]]