数据科学Python库01:Pandas

Series和DataFrame类的基本用法

SeriesDataFrame类是Pandas的两个最核心的类.Series用于表示一维的数据,而DataFrame用于表示二维的数据.

在这里插入图片描述

Series类

创建Series对象

可以使用一个类数组(array-like)对象(如listdict)创建Series.

  • 使用list创建Series时,索引为从0开始的数字

    animals = ['Tiger', 'Bear', 'Moose']
    pd.Series(animals)
    
    0    Tiger
    1     Bear
    2    Moose
    dtype: object
    
  • 使用dict创建Series时,索引为字典的键

    sports = {'Archery': 'Bhutan',
              'Golf': 'Scotland',
              'Sumo': 'Japan',
              'Taekwondo': 'South Korea'}
    s = pd.Series(sports)
    
    Archery           Bhutan
    Golf            Scotland
    Sumo               Japan
    Taekwondo    South Korea
    dtype: object
    

也可以在构造函数中使用index参数显式指明索引.

s = pd.Series(['Tiger', 'Bear', 'Moose'], index=['India', 'America', 'Canada'])
India      Tiger
America     Bear
Canada     Moose
dtype: object

若传给构造函数的数据中存在缺失值None,则Pandas会自动将其补全为对应的数据(对于数值类型数据是NaN,对于其他类型数据是None).

animals = ['Tiger', 'Bear', None]
pd.Series(animals)

numbers = [1, 2, None]
pd.Series(numbers)
0    Tiger
1     Bear
2     None
dtype: object

0    1.0
1    2.0
2    NaN
dtype: float64

对Series对象进行索引

可以使用ilocloc属性对Series对象进行索引查询.前者是位置索引,根据行数查询;而后者是标签索引,根据标签查询.

sports = {'Archery': 'Bhutan', 'Golf': 'Scotland', 'Sumo': 'Japan', 'Taekwondo': 'South Korea'}
s = pd.Series(sports)
'''创建Series对象内容如下:
    Archery           Bhutan
    Golf            Scotland
    Sumo               Japan
    Taekwondo    South Korea
    dtype: object
'''

s.iloc[3]       # 得到 'South Korea'
s.loc['Golf']   # 得到 'Scotland'

注意,ilocloc是对象的属性而非方法,因此使用方括号[]而非圆括号()包裹索引值.

当然,Pandas也支持直接对Series进行下标索引,Pandas会根据我们传入的键来判断去进行位置索引还是标签索引.但当传入的下标为一个整数时可能会判断出错,因此应尽量优先使用ilocloc.

s.[3]       # 等价于 s.iloc[3],会得到 'South Korea'
s.['Golf']  # 等价于 s.loc['Golf'],会得到 'Scotland'
sports = {99: 'Bhutan', 100: 'Scotland', 101: 'Japan', 102: 'South Korea'}
s = pd.Series(sports)
'''创建Series对象内容如下:
    99          Bhutan
    100       Scotland
    101          Japan
    102    South Korea
    dtype: object
'''
s[0]    # 并不会如预期得那样调用s.iloc[0],而是报错KeyError

Series对象的标签索引

与关系型数据库不同的是,Series对象的标签索引是可以重复的.

original_sports = pd.Series({'Archery': 'Bhutan', 'Golf': 'Scotland', 'Sumo': 'Japan', 'Taekwondo': 'South Korea'}) 
cricket_loving_countries = pd.Series(['Australia', 'Barbados', 'Pakistan', 'England'], 
                                     index=['Cricket', 'Cricket', 'Cricket', 'Cricket']) 
all_countries = original_sports.append(cricket_loving_countries)    # Series对象的append方法不会在原对象上修改

执行上述代码,得到Series对象如下:

Archery           Bhutan
Golf            Scotland
Sumo               Japan
Taekwondo    South Korea
Cricket        Australia
Cricket         Barbados
Cricket         Pakistan
Cricket          England
dtype: object

可以看到,该Series对象的标签存在重复,对重复的标签进行索引,得到的仍是Series对象.

all_countries.loc['Cricket']
Cricket    Australia
Cricket     Barbados
Cricket     Pakistan
Cricket      England
dtype: object

对Series对象应用Numpy方法

Pandas库是基于,因此可以对Series对象调用Numpy中array类的方法.

s = pd.Series([100.00, 120.00, 101.00, 3.00])
np.sum(s)       # 得到 324.0

DataFrame类

创建DataFrame对象

可以直接创建DataFrame对象,或者从csv文件创建DataFrame对象.

  1. 直接创建DataFrame对象:

    可以使用Series类列表或dict列表创建Dataframe对象,每个Seriesdict都被认为是Dataframe的一行.与Series类似,也可以在构造函数中显示指明标签索引.

    purchase_1 = pd.Series({'Name': 'Chris', 'Item Purchased': 'Dog Food', 'Cost': 22.50})
    purchase_2 = pd.Series({'Name': 'Kevyn', 'Item Purchased': 'Kitty Litter', 'Cost': 2.50})
    purchase_3 = pd.Series({'Name': 'Vinod', 'Item Purchased': 'Bird Seed', 'Cost': 5.00})
    inventory_df = pd.DataFrame([purchase_1, purchase_2, purchase_3], index=['Store 1', 'Store 1', 'Store 2'])
    inventory_df.head() # 使用head()方法查看DataFrame对象的头几行
    

    得到DataFrame对象如下:

    CostItem PurchasedName
    Store 122.5Dog FoodChris
    Store 12.5Kitty LitterKevyn
    Store 25.0Bird SeedVinod
  2. 从csv文件创建DataFrame对象:

    使用read_csv()函数可以从csv文件创建DataFrame对象.通过设置一系列参数可以使Pandas为我们推断出数据的标签和字段.

    下面以奥林匹克数据集为例,展示如何使用read_csv()函数,数据源文件olympics.csv内容如下:

    0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
    ,№ Summer,01 !,02 !,03 !,Total,№ Winter,01 !,02 !,03 !,Total,№ Games,01 !,02 !,03 !,Combined total
    Afghanistan (AFG),13,0,0,2,2,0,0,0,0,0,13,0,0,2,2
    Algeria (ALG),12,5,2,8,15,3,0,0,0,0,15,5,2,8,15
    Argentina (ARG),23,18,24,28,70,18,0,0,0,0,41,18,24,28,70
    Armenia (ARM),5,1,2,9,12,6,0,0,0,0,11,1,2,9,12
    Australasia (ANZ) [ANZ],2,3,4,5,12,0,0,0,0,0,2,3,4,5,12
    ......
    Zambia (ZAM) [ZAM],12,0,1,1,2,0,0,0,0,0,12,0,1,1,2
    Zimbabwe (ZIM) [ZIM],12,3,4,1,8,1,0,0,0,0,13,3,4,1,8
    Mixed team (ZZX) [ZZX],3,8,5,4,17,0,0,0,0,0,3,8,5,4,17
    Totals,27,4809,4775,5130,14714,22,959,958,948,2865,49,5768,5733,6078,17579
    

    可以看到,第一行是无效数据,第二行为字段名,第一列为标签索引.因此调用read_csv()函数的方式如下:

    olympics_df = pd.read_csv('olympics.csv', index_col = 0, skiprows=1)
    

    得到的DataFrame对象如下:

    № Summer01 !02 !03 !Total№ Winter01 !.102 !.103 !.1Total.1№ Games01 !.202 !.203 !.2Combined total
    Afghanistan (AFG)13002200000130022
    Algeria (ALG)1252815300001552815
    Argentina (ARG)23182428701800004118242870
    Armenia (ARM)512912600001112912
    Australasia (ANZ) [ANZ]23451200000234512
    Zimbabwe (ZIM) [ZIM]12341810000133418
    Mixed team (ZZX) [ZZX]38541700000385417
    Totals27480947755130147142295995894828654957685733607817579

使用indexscolumns属性可以分别访问DataFrame对象的标签索引和字段.

for col in olympics_df.columns:
    if col[:2]=='01':
        olympics_df.rename(columns={col:'Gold' + col[4:]}, inplace=True)  # 设置inplace参数为true,在原df对象上进行修改
    if col[:2]=='02':
        olympics_df.rename(columns={col:'Silver' + col[4:]}, inplace=True)
    if col[:2]=='03':
        olympics_df.rename(columns={col:'Bronze' + col[4:]}, inplace=True)
    if col[:1]=='№':
        olympics_df.rename(columns={col:'#' + col[1:]}, inplace=True) 

对DataFrame对象进行索引

Series类似,我们可以分别使用ilocloc分别对DataFrame对象进行位置索引和标签索引,这是针对行进行索引的.我们也可以使用直接下标对DataFrame对象的列进行索引.

inventory_df.iloc[0]            # 索引得到第一行
inventory_df.loc['Store 2']     # 索引得到'Store 2'标签对应的行
inventory_df['Item Purchased']  # 索引得到'Item Purchased'属性对应的列    

DataFrame对象进行索引会返回原对象的一个视图(view)而非拷贝copy,这也就是说,对索引结果的修改会作用在原数据上.

DataFrame对象的标签索引

  1. 设置单字段标签索引

    调用DataFrame对象的set_index()方法可以将某字段设置为标签索引,调用reset_index()方法可以取消标签索引.

    olympics_df['country'] = olympics_df.index  # 将原标签索引设为 'conuntry' 字段
    olympics_df = olympics_df.set_index('Gold') # 将 'Gold' 字段设为标签索引
    

    得到的DataFrame对象如下:

    # SummerSilverBronzeTotal# WinterGold.1Silver.1Bronze.1Total.1# GamesGold.2Silver.2Bronze.2Combined totalcountry
    Gold
    01302200000130022Afghanistan (AFG)
    5122815300001552815Algeria (ALG)
    18232428701800004118242870Argentina (ARG)
    152912600001112912Armenia (ARM)
    32451200000234512Australasia (ANZ) [ANZ]
    ................................................
    31241810000133418Zimbabwe (ZIM) [ZIM]
    83541700000385417Mixed team (ZZX) [ZZX]
    48092747755130147142295995894828654957685733607817579Totals
  2. 设置多字段联合标签索引

    与关系型数据库类似,DataFrame对象也支持创建多字段联合索引,只需向set_index()函数传入字段数组即可.这要求使用loc属性进行标签索引时遵循最左前缀原则.

    下面以美国人口普查数据为例,演示多字段联合标签索引.

    census_df = pd.read_csv('census.csv')
    columns_to_keep = ['STNAME', 'CTYNAME', 'BIRTHS2010', 'BIRTHS2011', 'BIRTHS2012', 'BIRTHS2013', 'BIRTHS2014', 'BIRTHS2015', 'POPESTIMATE2010', 'POPESTIMATE2011', 'POPESTIMATE2012', 'POPESTIMATE2013', 'POPESTIMATE2014', 'POPESTIMATE2015'] 
    census_df = census_df[columns_to_keep]
    
    STNAMECTYNAMEBIRTHS2010BIRTHS2011BIRTHS2012BIRTHS2013BIRTHS2014BIRTHS2015POPESTIMATE2010POPESTIMATE2011POPESTIMATE2012POPESTIMATE2013POPESTIMATE2014POPESTIMATE2015
    0AlabamaAlabama142265968959062579385833458305478516148011084816089483053348464114858979
    1AlabamaAutauga County151636615574623600546605525355175550385529055347
    2AlabamaBaldwin County51721872092216021862240183193186659190396195126199713203709
    3AlabamaBarbour County70335300283260269273412722627159269732681526489
    4AlabamaBibb County44266245259247253228612273322642225122254922583
    5AlabamaBlount County183744710646618603573735771157776577345765857673
    3192WyomingWeston County268174937779718171147065716071857234

    我们对STNAMECTYNAME字段建立联合标签索引,代码如下:

    census_df = census_df.set_index(['STNAME', 'CTYNAME'])
    
    BIRTHS2010BIRTHS2011BIRTHS2012BIRTHS2013BIRTHS2014BIRTHS2015POPESTIMATE2010POPESTIMATE2011POPESTIMATE2012POPESTIMATE2013POPESTIMATE2014POPESTIMATE2015
    STNAMECTYNAME
    AlabamaAutauga County151636615574623600546605525355175550385529055347
    Baldwin County51721872092216021862240183193186659190396195126199713203709
    Barbour County70335300283260269273412722627159269732681526489
    Bibb County44266245259247253228612273322642225122254922583
    Blount County183744710646618603573735771157776577345765857673

    使用loc属性对联合索引进行标签索引时,要符合最左前缀原则.

    # 获取一个城市的统计结果
    census_df.loc['Michigan', 'Washtenaw County']  
    
    # 获取多个城市的统计结果
    census_df.loc[[('Michigan', 'Washtenaw County'), ('Michigan', 'Wayne County')]]
    
    # 获取多个州的统计结果
    census_df.loc[['Alabama','Michigan']]
    

对DataFrame对象进行排序

sort_key(),sort_value(),nlargest(),见名知义,不用解释了.

对DataFrame对象进行查询

与MATLAB类似,可以使用布尔掩码(Boolean mask)对DataFrame对象进行下标索引,而布尔掩码对象可以通过条件表达式求得,从而实现了对数据的查询.

在这里插入图片描述

olympics_df['Gold'] > 0     # 得到一个长度为147,内容为True或False的Series对象

only_gold = olympics_df.where(olympics_df['Gold'] > 0)  # 得到一个 147*15的DataFrame对象

len(olympics_df[(olympics_df['Gold'] > 0) | (olympics_df['Gold.1'] > 0)])       # 得到 101
len(olympics_df[(olympics_df['Gold'] == 0) & (olympics_df['Gold.1'] > 0) ]) # 得到 1  

DataFrame类的高级用法

Dataframe的合并

向Dataframe添加一个新列

有3种方式向DataFrame中添加一个新列

inventory_df = pd.DataFrame([{'Name': 'Chris', 'Item Purchased': 'Sponge', 'Cost': 22.50},
                   {'Name': 'Kevyn', 'Item Purchased': 'Kitty Litter', 'Cost': 2.50},
                   {'Name': 'Filip', 'Item Purchased': 'Spoon', 'Cost': 5.00}],
                  index=['Store 1', 'Store 1', 'Store 2'])
CostItem PurchasedName
Store 122.5SpongeChris
Store 12.5Kitty LitterKevyn
Store 25.0SpoonFilip
  1. 向DataFrame的新字段中传入一个与行数相同的列表或Series对象

    inventory_df['Date'] = ['December 1', 'January 1', 'mid-May']
    
    CostItem PurchasedNameDate
    Store 122.5SpongeChrisDecember 1
    Store 12.5Kitty LitterKevynJanuary 1
    Store 25.0SpoonFilipmid-May
  2. 向DataFrame的新字段中传入一个标量,以广播的形式填充字段

    inventory_df['Delivered'] = True
    
    CostItem PurchasedNameDateDelivered
    Store 122.5SpongeChrisDecember 1True
    Store 12.5Kitty LitterKevynJanuary 1True
    Store 25.0SpoonFilipmid-MayTrue
  3. 向DataFrame的新字段中传入一个字典,字典的值会被填充到键对应的行中.

    inventory_df['Date'] = pd.Series({'Store 1': 'December 1', 'Store 2': 'mid-May'})
    
    CostItem PurchasedNameDateDeliveredFeedback
    Store 122.5SpongeChrisDecember 1TruePositive
    Store 12.5Kitty LitterKevynDecember 1TrueNone
    Store 25.0SpoonFilipmid-MayTrueNegative

两个DataFrame间的合并

使用pandas.merge()函数可以实现两个DataFrame的联合查询,使用how参数指定连接查询类型,使用left_indexleft_on以及right_indexright_on指定连接字段.

在这里插入图片描述

可以像操作关系型数据库那样对DataFrame进行连接查询.

staff_df = pd.DataFrame([{'Name': 'Kelly', 'Role': 'Director of HR'},
                         {'Name': 'Sally', 'Role': 'Course liasion'},
                         {'Name': 'James', 'Role': 'Grader'}])
staff_df = staff_df.set_index('Name')

student_df = pd.DataFrame([{'Name': 'James', 'School': 'Business'},
                           {'Name': 'Mike', 'School': 'Law'},
                           {'Name': 'Sally', 'School': 'Engineering'}])
student_df = student_df.set_index('Name')

在这里插入图片描述

pd.merge(staff_df, student_df, how='outer', left_index=True, right_index=True)
RoleSchool
Name
JamesGraderBusiness
KellyDirector of HRNaN
MikeNaNLaw
SallyCourse liasionEngineering
pd.merge(staff_df, student_df, how='left', left_index=True, right_index=True)
RoleSchool
Name
KellyDirector of HRNaN
SallyCourse liasionEngineering
JamesGraderBusiness
staff_df = staff_df.reset_index()
student_df = student_df.reset_index()
pd.merge(staff_df, student_df, how='left', left_on='Name', right_on='Name')
NameRoleSchool
0KellyDirector of HRNaN
1SallyCourse liasionEngineering
2JamesGraderBusiness

也可以指定以多个字段相连接

staff_df = pd.DataFrame([{'First Name': 'Kelly', 'Last Name': 'Desjardins', 'Role': 'Director of HR'},
                         {'First Name': 'Sally', 'Last Name': 'Brooks', 'Role': 'Course liasion'},
                         {'First Name': 'James', 'Last Name': 'Wilde', 'Role': 'Grader'}])
student_df = pd.DataFrame([{'First Name': 'James', 'Last Name': 'Hammond', 'School': 'Business'},
                           {'First Name': 'Mike', 'Last Name': 'Smith', 'School': 'Law'},
                           {'First Name': 'Sally', 'Last Name': 'Brooks', 'School': 'Engineering'}])
pd.merge(staff_df, student_df, how='inner', left_on=['First Name','Last Name'], right_on=['First Name','Last Name'])
First NameLast NameRoleSchool
0SallyBrooksCourse liasionEngineering

若合并过程中同名字段发生冲突,会将两个都保留下来,并重命名字段名.

staff_df = pd.DataFrame([{'Name': 'Kelly', 'Role': 'Director of HR', 'Location': 'State Street'},
                         {'Name': 'Sally', 'Role': 'Course liasion', 'Location': 'Washington Avenue'},
                         {'Name': 'James', 'Role': 'Grader', 'Location': 'Washington Avenue'}])
student_df = pd.DataFrame([{'Name': 'James', 'School': 'Business', 'Location': '1024 Billiard Avenue'},
                           {'Name': 'Mike', 'School': 'Law', 'Location': 'Fraternity House #22'},
                           {'Name': 'Sally', 'School': 'Engineering', 'Location': '512 Wilson Crescent'}])
pd.merge(staff_df, student_df, how='left', left_on='Name', right_on='Name')
Location_xNameRoleLocation_ySchool
0State StreetKellyDirector of HRNaNNaN
1Washington AvenueSallyCourse liasion512 Wilson CrescentEngineering
2Washington AvenueJamesGrader1024 Billiard AvenueBusiness

通过链式调用增强代码的可读性

Pandas的大部分函数操作都将inplace参数设为False,这意味着返回一个原数据的拷贝(copy)而非视图(view),这样我们可以对DataFrame进行链式调用.

因为使用下标索引返回的是原数据的一个视图view,对索引结果的修改会作用在原数据上,有时候我们不希望这样.因此我们引入了一个函数DataFrame.where(),它的inplace参数默认为False,这意味着我们可以通过使用该函数像下标引用那样通过布尔掩码或索引选取一个子数据集,同时不会改变原数据.

# 加括号是为了多行书写表达式而不报错
(census_df.where(df['SUMLEV']==50)
    .dropna()	# 不符合条件项的默认用NAN占位,这里的dropna是去掉这些不符合条件的项
    .set_index(['STNAME','CTYNAME'])
    .rename(columns={'ESTIMATESBASE2010': 'Estimates Base 2010'}))
SUMLEVREGIONDIVISIONSTATECOUNTYCENSUS2010POPEstimates Base 2010POPESTIMATE2010POPESTIMATE2011POPESTIMATE2012...RDOMESTICMIG2011RDOMESTICMIG2012RDOMESTICMIG2013RDOMESTICMIG2014RDOMESTICMIG2015RNETMIG2011RNETMIG2012RNETMIG2013RNETMIG2014RNETMIG2015
STNAMECTYNAME
AlabamaAutauga County50.03.06.01.01.054571.054571.054660.055253.055175.0...7.242091-2.915927-3.0123492.265971-2.5307997.606016-2.626146-2.7220022.592270-2.187333
Baldwin County50.03.06.01.03.0182265.0182265.0183193.0186659.0190396.0...14.83296017.64729321.84570519.24328717.19787215.84417618.55962722.72762620.31714218.293499
Barbour County50.03.06.01.05.027457.027457.027341.027226.027159.0...-4.728132-2.500690-7.056824-3.904217-10.543299-4.874741-2.758113-7.167664-3.978583-10.543299
Bibb County50.03.06.01.07.022915.022919.022861.022733.022642.0...-5.527043-5.068871-6.201001-0.1775370.177258-5.088389-4.363636-5.4037290.7545331.107861
Blount County50.03.06.01.09.057322.057322.057373.057711.057776.0...1.807375-1.177622-1.748766-2.062535-1.3699701.859511-0.848580-1.402476-1.577232-0.884411
....................................... ..............................
WyomingWeston County50.04.08.056.045.07208.07208.07181.07114.07065.0...-11.752361-8.04005912.3725831.5336356.935294-12.032179-8.04005912.3725831.5336356.935294

上述代码与下面的非链式调用代码的作用完全相同,且性能略逊于非链式调用的代码.不过,我们认为这种性能代价是值得的.

census_df = census_df[census_df['SUMLEV']==50]
census_df.set_index(['STNAME','CTYNAME'], inplace=True)
census_df.rename(columns={'ESTIMATESBASE2010': 'Estimates Base 2010'})

通过apply()将函数应用到所有行或列

通过调用apply()函数,我们可以将某函数应用到DataFrame的所有行或列.下面例子展示apply()的用法:

# 为某行数据添加两个字段max和min
def min_max(row):
    data = row[['POPESTIMATE2010', 'POPESTIMATE2011', 'POPESTIMATE2012', 'POPESTIMATE2013', 'POPESTIMATE2014', 'POPESTIMATE2015']]
    row['max'] = np.max(data)
    row['min'] = np.min(data)
    return row

census_df.apply(min_max, axis=1)

在这里,我们DataFrame每一行都应用min_max函数,但是axis却设为1而非0,这是因为Pandas的axis参数指的是我们用到的索引的维度,对每一行进行操作时,我们用到的是列索引,因此axis参数被设为1.

使用lambda函数结合apply()函数可以提高代码的整洁度:

rows = ['POPESTIMATE2010', 'POPESTIMATE2011', 'POPESTIMATE2012', 'POPESTIMATE2013', 'POPESTIMATE2014', 'POPESTIMATE2015']
census_df.apply(lambda x: np.max(x[rows]), axis=1)

组操作

Pandas的组操作流程包括3步: 拆分(split)→应用(apply)→组合(combine).

关于Pandas的组操作,可以参考官方文档.

使用groupby()拆分分组

类似于关系型数据库,通过DataFrame.groupby()方法可以将DataFrame分组,每一组包含一个分组标签和一个子DataFrame.

可以通过bylevel参数指定分组的依据.

  • by参数传入字符串(列表),字典函数指定分组依据.

    • 若传给by参数的是字符串(列表),则根据对应字段进行分组.

      for group, frame in census_df.groupby(by='STNAME'):
          avg = np.average(frame['CENSUS2010POP'])
          print('Counties in state ' + group + ' have an average population of ' + str(avg))
      
      Counties in state Alabama have an average population of 140580.470588
      Counties in state Alaska have an average population of 47348.7333333
      Counties in state Arizona have an average population of 799002.125
      Counties in state Arkansas have an average population of 76734.6842105
      Counties in state California have an average population of 1262845.9661
      ......
      Counties in state Wyoming have an average population of 46968.8333333
      
    • 若传给by参数的是字典,则根据DataFrame对象的索引字典的键将数据分组到对应的值中.

      ContinentDict  = {'China':'Asia', 
                        'United States':'North America', 
                        'Japan':'Asia', 
                        'United Kingdom':'Europe', 
                        'Russian Federation':'Europe', 
                        'Canada':'North America', 
                        'Germany':'Europe', 
                        'India':'Asia',
                        'France':'Europe', 
                        'South Korea':'Asia', 
                        'Italy':'Europe', 
                        'Spain':'Europe', 
                        'Iran':'Asia',
                        'Australia':'Australia', 
                        'Brazil':'South America'}
      
      # countries_df的索引为国家名
      for group, frame in countries_df.groupby(by=ContinentDict):
          print('There are ' + str(len(frame)) + ' countries in continent ' + str(group) + '.')
      
      There are 5 countries in continent Asia.
      There are 1 countries in continent Australia.
      There are 6 countries in continent Europe.
      There are 2 countries in continent North America.
      There are 1 countries in continent South America.
      
    • 若传给by参数的是函数,则根据将该函数作用在标签索引上得到的结果进行分组.

      census_df = census_df.set_index('STNAME')
      
      def fun(item):
          if item[0]<'M':
              return 0
          if item[0]<'Q':
              return 1
          return 2
      
      for group, frame in census_df.groupby(by=fun):
          print('There are ' + str(len(frame)) + ' records in group ' + str(group) + ' for processing.')
      
      There are 1177 records in group 0 for processing.
      There are 1134 records in group 1 for processing.
      There are 831 records in group 2 for processing.
      
  • 也可以通过level参数指定分组依据.Pandas会根据多级索引的前level级进行分组.

    census_df = census_df.set_index(['STNAME', 'CTYNAME'])
    
    for group,frame in census_df.groupby(level=1):
            avg = np.average(frame['CENSUS2010POP'])
            print('Counties in state and city ' + str(group) + ' have an average population of ' + str(avg))    
    
    Counties in state and city Abbeville County have an average population of 25417.0
    Counties in state and city Acadia Parish have an average population of 61773.0
    Counties in state and city Accomack County have an average population of 33164.0
    Counties in state and city Ada County have an average population of 392365.0
    Counties in state and city Adair County have an average population of 18657.0
    ......
    Counties in state and city Ziebach County have an average population of 2801.0
    

使用agg()合并分组

使用DataFrame.aggregate()(或缩写为DataFrame.agg())可以合并分组.DataFrame.agg()接收一个函数或字典(字典的键为字符串,值为函数).

  • 当接收的是一个函数时,Pandas会将该函数应用到每个字段上.

    census_df.groupby(by='STNAME').agg(np.average)
    
    SUMLEVREGIONDIVISIONSTATECOUNTYCENSUS2010POPESTIMATESBASE2010POPESTIMATE2010POPESTIMATE2011POPESTIMATE2012RDOMESTICMIG2011RDOMESTICMIG2012RDOMESTICMIG2013RDOMESTICMIG2014RDOMESTICMIG2015RNETMIG2011RNETMIG2012RNETMIG2013RNETMIG2014RNETMIG2015
    STNAME
    Alabama5036167.00000071339.34328471345.17910471420.31343371658.32835871881.925373-3.395735-3.043109-2.186529-1.715909-1.958259-2.781050-2.339462-1.506142-0.944449-1.155698
    Alaska50492149.65517224490.72413824491.34482824621.41379324921.37931025214.758621-1.153129-5.261918-9.277180-18.013402-13.0906522.175991-1.651751-5.795667-14.182244-9.125865
    Arizona5048413.866667426134.466667426153.800000427213.866667431248.800000436884.1333330.422680-1.901080-2.4472282.016780-0.5234981.670682-0.513930-1.0444273.6017281.128518
    Arkansas5037575.00000038878.90666738879.44000038965.25333339180.50666739326.653333-2.496257-4.938436-4.300058-5.542308-4.393712-2.070819-4.508319-3.829464-4.988838-3.824365
    California5049658.000000642309.586207642319.017241643691.017241650000.586207656138.879310-4.244651-3.831785-3.463105-2.027410-0.745475-2.174454-1.788510-1.2854780.4683841.797060
    ….….….
    Wyoming50485623.00000024505.47826124511.60869624544.17391324685.56521725090.434783-2.1994534.4165143.632000-5.586437-2.703374-1.5449645.0956614.342728-4.753322-1.848371
  • 当接收到的是一个字典时,Pandas会根据字典的键决定函数行为.

    • 若字典的键为DataFrame的字段名时,Pandas会将函数作用在对应字段上

      (census_df.set_index('STNAME').groupby(level=0)['POPESTIMATE2010','POPESTIMATE2011']
          .agg({'POPESTIMATE2010': np.average, 'POPESTIMATE2011': np.sum}))
      
       POPESTIMATE2010POPESTIMATE2011
      STNAME
      Alabama71420.3134334801108
      Alaska24621.413793722720
      Arizona427213.8666676468732
      Arkansas38965.2533332938538
      California643691.01724137700034
      Wyoming24544.173913567768
    • 若字典的键不是DataFrame的字段名,Pandas会对所有字段应用该函数并将其存入对应字段中.

      (census_df.set_index('STNAME').groupby(level=0)['POPESTIMATE2010','POPESTIMATE2011']
          .agg({'avg': np.average, 'sum': np.sum}))
      
      avgsum
      POPESTIMATE2010POPESTIMATE2011POPESTIMATE2010POPESTIMATE2011
      STNAME
      Alabama71420.31343371658.32835847851614801108
      Alaska24621.41379324921.379310714021722720
      Arizona427213.866667431248.80000064082086468732
      Arkansas38965.25333339180.50666729223942938538
      California643691.017241650000.5862073733407937700034
      Wyoming24544.17391324685.565217564516567768

当然,我们也可以直接使用聚合函数来合并分组,常用的聚合函数如下:

FunctionDescription
mean()Compute mean of groups
sum()Compute sum of group values
size()Compute group sizes
count()Compute count of group
std()Standard deviation of groups
var()Compute variance of groups
sem()Standard error of the mean of groups
describe()Generates descriptive statistics
first()Compute first of group values
last()Compute last of group values
nth()Take nth value, or a subset if n is a list
min()Compute min of group values
max()Compute max of group values

枢纽分析表

可以通过DataFrame.pivot_table()函数生成枢纽分析表(数据透视表),下面以cars.csv为例,演示生成枢纽分析表的语法.

cars_df = pd.read_csv('cars.csv')
YEARMakeModelSize(kW)Unnamed: 5TYPECITY (kWh/100 km)HWY (kWh/100 km)COMB (kWh/100 km)CITY (Le/100 km)HWY (Le/100 km)COMB (Le/100 km)(g/km)RATING(km)TIME (h)
02012MITSUBISHIi-MiEVSUBCOMPACT49A1B16.921.418.71.92.42.10n/a1007
12012NISSANLEAFMID-SIZE80A1B19.323.021.12.22.62.40n/a1177
22013FORDFOCUS ELECTRICCOMPACT107A1B19.021.120.02.12.42.20n/a1224
32013MITSUBISHIi-MiEVSUBCOMPACT49A1B16.921.418.71.92.42.10n/a1007
42013NISSANLEAFMID-SIZE80A1B19.323.021.12.22.62.40n/a1177
522016TESLAMODEL X P90DSUV - STANDARD568A1B23.623.323.52.72.62.601040212

生成枢纽分析表的语法如下:

cars_df.pivot_table(values='(kW)', index='YEAR', columns='Make', aggfunc=np.mean)
MakeBMWCHEVROLETFORDKIAMITSUBISHINISSANSMARTTESLA
YEAR
2012NaNNaNNaNNaN49.080.0NaNNaN
2013NaNNaN107.0NaN49.080.035.0280.000000
2014NaN104.0107.0NaN49.080.035.0268.333333
2015125.0104.0107.081.049.080.035.0320.666667
2016125.0104.0107.081.049.080.035.0409.700000

通过设置margins=True可以使生成的枢纽分析表带有边缘:

MakeBMWCHEVROLETFORDKIAMITSUBISHINISSANSMARTTESLAAll
YEAR
2012NaNNaNNaNNaN49.080.0NaNNaN64.500000
2013NaNNaN107.0NaN49.080.035.0280.000000158.444444
2014NaN104.0107.0NaN49.080.035.0268.333333135.000000
2015125.0104.0107.081.049.080.035.0320.666667181.428571
2016125.0104.0107.081.049.080.035.0409.700000252.263158
All125.0104.0107.081.049.080.035.0345.478261190.622642

数据尺度

统计学中的数据尺度

在统计学中,有4种数据尺度:定比尺度(Ratio scale), 定距尺度(Interval scale), 定序尺度(Ordinal scale), 定类尺度(Nominal scale).四种尺度的区别如下:

  • 定比尺度(Ratio scale): 数据单位是等间隔的;存在绝对零点可以进行+,-,×,÷和比较运算.

    长度是一个典型的定比变量.

  • 定距尺度(Interval scale): 数据单位是等间隔的;不存在绝对零点,可以进行+,-和比较运算,×,÷运算无意义.

    摄氏温度是一个典型的定距变量.

  • 定序尺度(Ordinal scale): 数据单位并不等间隔但有序,可以进行比较运算,+,-,×,÷运算无意义.

    以字母表示的绩点(A+,A,A-)是一个典型的定序变量.

  • 定类尺度(Nominal scale): 数据单位仅表示类别,没有大小关系.

    球队队名是一个典型的定类变量.

Pandas中数据尺度间的转换

  1. 定序变量和定类变量

    Pandas将定类变量和定序变量都视为'category'变量,区别在于定序变量是有序的而定类变量不是.

    df = pd.DataFrame(['A+', 'A', 'A-', 'B+', 'B', 'B-', 'C+', 'C', 'C-', 'D+', 'D'],
                      index=['excellent', 'excellent', 'excellent', 'good', 'good', 'good', 'ok', 'ok', 'ok', 'poor', 'poor'])
    df.rename(columns={0: 'Grades'}, inplace=True)
    
    Grades
    excellentA+
    excellentA
    excellentA-
    goodB+
    goodB
    goodB-
    okC+
    okC
    okC-
    poorD+
    poorD
    # 创建定类变量
    nominal_grades = df['Grades'].astype('category')
    grades_01 > 'C' 	# 报错 Unordered Categoricals can only compare equality or not
    
    # 创建定序变量
    ordinal_grades = df['Grades'].astype('category', ordered=True)
    ordinal_grades		# 返回对应的布尔矩阵
    
  2. 定比变量转为定序变量

    使用pd.cut()可以将数据分桶,从而将定比变量转为定序变量

    census_df = census_df.set_index('STNAME').groupby(level=0)['CENSUS2010POP'].agg({'avg': np.average})
    pd.cut(census_df['avg'],10)
    
    STNAME
    Alabama                  (11706.0871, 75333.413]
    Alaska                   (11706.0871, 75333.413]
    Arizona                 (390320.176, 453317.529]
    Arkansas                 (11706.0871, 75333.413]
    California              (579312.234, 642309.586]
    ......
    Wyoming                  (11706.0871, 75333.413]
    Name: avg, dtype: category
    Categories (10, object): [(11706.0871, 75333.413] < (75333.413, 138330.766] < (138330.766, 201328.118] < (201328.118, 264325.471] ... (390320.176, 453317.529] < (453317.529, 516314.881] < (516314.881, 579312.234] < (579312.234, 642309.586]]
    
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值