11.Pandas数据透视表

Pandas数据透视表

前面我们讲解了Pandas的GroupBy对象,并且结合行星的例子讲解了Groupby是如何探索数据集内部的关联性了

数据透视表(pivot table)是和groupby操作类似的一种操作方法,常见于Excel与类似的表格应用中

数据透视表将每一列数据作为输入,输出为将数据不断细分成多个维度累计信息的二维数据表

数据表像一种多维的GroupBy累计操作,因为对数据透视表的分割与组合发生在二维网格上

和上一节类似,这次依旧首先讲解数据透视表的相关知识,最后基于讲解的数据透视表的知识分析实际案例


准备工作

本节在最后使用的数据集包括:

  1. 泰坦尼克号乘客的信息数据库
  2. 美国疾病预防中心(CDC)提供的公开的美国人生日数据

这里为了避免网络相关问题,我已经把数据集搬运到了百度网盘上

  1. 泰坦尼克号数据集:百度网盘 提取码:666
  2. 美国人生日数据:百度网盘 提取码:666

此外,由于书中提供的原版美国人生日数据的下载网站无法下载,这里是我找到的另外一个版本

什么是数据透视表

数据透视表就是按照我们的需要,将原数据表按照我们所希望的列重新分割组合所得到的新的数据表

例如在前面的行星的分析中,我们使用groupby方法指定通过method和decade两个维度对数据进行切割,最终得到一个不同年代各种方法发现行星数量的一张表,不妨将其记为A

则A表就是一张数据透视表,我们透过原有数据得到的数据表

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt


planets=pd.read_csv('planets.csv')
decade=10*(planets['year']//10)
decade=decade.astype(str)+'s'
decade.name='decade'
print(planets)
print(planets.groupby(by=['method',decade])['number'].sum().unstack().fillna(0))
print(type(planets.groupby(by=['method',decade])['number'].sum().unstack().fillna(0)))
>>>
decade                         1980s  1990s  2000s  2010s
method                                                   
Astrometry                       0.0    0.0    0.0    2.0
Eclipse Timing Variations        0.0    0.0    5.0   10.0
Imaging                          0.0    0.0   29.0   21.0
Microlensing                     0.0    0.0   12.0   15.0
Orbital Brightness Modulation    0.0    0.0    0.0    5.0
Pulsar Timing                    0.0    9.0    1.0    1.0
Pulsation Timing Variations      0.0    0.0    1.0    0.0
Radial Velocity                  1.0   52.0  475.0  424.0
Transit                          0.0    0.0   64.0  712.0
Transit Timing Variations        0.0    0.0    0.0    9.0
<class 'pandas.core.frame.DataFrame'>

手工制作一张数据透视表

明白什么是数据透视表之后,我们不妨通过手工的方法来制作一张数据透视表,即指定groupby操作的对象是两列

我们首先制作一张数据表,然后重新为其中的两列赋予存在内在关系的数据

接下来使用groupby方法,指定两列来作为分割对象

DataFrame_1=pd.DataFrame(np.random.randint(0,10,(6,7)),index=list('ABCABC'),columns=list('abcdefg'))
DataFrame_1.iloc[:,0]=['man','woman','man','woman','woman','man']
DataFrame_1.rename(columns={'a':'sexual'},inplace=True)
DataFrame_1.iloc[:,1]=['cn','us','uk','cn','us','us']
DataFrame_1.rename(columns={'b':'nationality'},inplace=True)
print(DataFrame_1)
GroupBy_1=DataFrame_1.groupby(by=['sexual','nationality'])
print(GroupBy_1.aggregate('mean').unstack())
>>>
  sexual nationality  c  d  e  f  g
A    man          cn  9  6  1  2  5
B  woman          us  7  7  1  1  4
C    man          uk  8  6  2  0  5
A  woman          cn  8  1  0  1  3
B  woman          us  6  3  8  6  1
C    man          us  6  1  8  6  4

               c              d              e              f              g          
nationality   cn   uk   us   cn   uk   us   cn   uk   us   cn   uk   us   cn   uk   us
sexual                                                                                
man          9.0  8.0  6.0  6.0  6.0  1.0  1.0  2.0  8.0  2.0  0.0  6.0  5.0  5.0  4.0
woman        8.0  NaN  6.5  1.0  NaN  5.0  0.0  NaN  4.5  1.0  NaN  3.5  3.0  NaN  2.5

数据透视表语法

实际上,由于像上面一样的groupby操作来获得一个数据透视表的操作非常常见

所以Pandas提供了一个pivot_table方法来快速制作一张数据透视表

语法如下

DataFrame对象名.pivot_table(value,index,columns,aggfunc,dropna,fill_value)

其中index和columns必须指定,其他的可以省略

所以上面手工制作的数据表可以用如下语句实现

DataFrame_1=pd.DataFrame(np.random.randint(0,10,(6,7)),index=list('ABCABC'),columns=list('abcdefg'))
DataFrame_1.iloc[:,0]=['man','woman','man','woman','woman','man']
DataFrame_1.rename(columns={'a':'sexual'},inplace=True)
DataFrame_1.iloc[:,1]=['cn','us','uk','cn','us','us']
DataFrame_1.rename(columns={'b':'nationality'},inplace=True)
print(DataFrame_1)
print(DataFrame_1.pivot_table(index='sexual',columns='nationality'))
>>>
  sexual nationality  c  d  e  f  g
A    man          cn  1  0  0  9  0
B  woman          us  3  2  3  4  4
C    man          uk  9  9  6  1  2
A  woman          cn  9  3  8  0  8
B  woman          us  9  1  6  6  8
C    man          us  9  8  3  0  5
               c              d              e              f              g          
nationality   cn   uk   us   cn   uk   us   cn   uk   us   cn   uk   us   cn   uk   us
sexual                                                                                
man          1.0  9.0  9.0  0.0  9.0  8.0  0.0  6.0  3.0  9.0  1.0  0.0  0.0  2.0  5.0
woman        9.0  NaN  6.0  3.0  NaN  1.5  8.0  NaN  4.5  0.0  NaN  5.0  8.0  NaN  6.0

多维数据透视表

前面讲过,对于高维数据,我们可以使用MultiIndex对象来用DataFrame对象表示高维数据

所以我们也能够创建多维数据透视表来表示高维数据

我们实际上只要给columns和index两个参数传递作为键的列表,类似于创建MultiIndex对象时候的from_tuple方法,就能创建出数据透视表,例如

DataFrame_1=pd.DataFrame(np.random.randint(0,10,(6,7)),index=list('ABCABC'),columns=list('abcdefg'))
DataFrame_1.iloc[:,0]=['man','woman','man','woman','woman','man']
DataFrame_1.rename(columns={'a':'sexual'},inplace=True)
DataFrame_1.iloc[:,1]=['cn','us','uk','cn','us','us']
DataFrame_1.rename(columns={'b':'nationality'},inplace=True)
DataFrame_1.iloc[:,2]=['child','child','child','adult','child','child']
DataFrame_1.rename(columns={'c':'age'},inplace=True)
print(DataFrame_1)
print(DataFrame_1.pivot_table(index=['sexual','age'],columns='nationality'))
>>>
  sexual nationality    age  d  e  f  g
A    man          cn  child  4  4  3  8
B  woman          us  child  5  5  4  9
C    man          uk  child  5  0  1  3
A  woman          cn  adult  2  9  8  9
B  woman          us  child  3  3  2  9
C    man          us  child  2  5  8  4

                d              e              f              g          
nationality    cn   uk   us   cn   uk   us   cn   uk   us   cn   uk   us
sexual age                                                              
man    child  4.0  5.0  2.0  4.0  0.0  5.0  3.0  1.0  8.0  8.0  3.0  4.0
woman  adult  2.0  NaN  NaN  9.0  NaN  NaN  8.0  NaN  NaN  9.0  NaN  NaN
       child  NaN  NaN  4.0  NaN  NaN  4.0  NaN  NaN  3.0  NaN  NaN  9.0

数据透视表的其他选项

DataFrame对象的pivot_table方法的完整签名如下

DataFrame.pivot_table(values = None,index = None,columns = None,aggfunc ='mean',fill_value = None,margin = False,dropna = True,margins_name ='All'

其中第一个参数是指定需要聚合的对象,第二三个参数是指定数据透视表的行列索引,第四个参数是指定聚合的方式,默认是平均数

第五个参数是指定填充缺失值,margin参数用于计算每一组的总数,dropna用于删除缺失值,margin_name用于设动margin列的名字

下面给一些例子来帮助理解

value参数

DataFrame_1=pd.DataFrame(np.random.randint(0,10,(6,7)),index=list('ABCABC'),columns=list('abcdefg'))
DataFrame_1.iloc[:,0]=['man','woman','man','woman','woman','man']
DataFrame_1.rename(columns={'a':'sexual'},inplace=True)
DataFrame_1.iloc[:,1]=['cn','us','uk','cn','us','us']
DataFrame_1.rename(columns={'b':'nationality'},inplace=True)
DataFrame_1.iloc[:,2]=['child','child','child','adult','child','child']
DataFrame_1.rename(columns={'c':'age'},inplace=True)
print(DataFrame_1)
print(DataFrame_1.pivot_table(index='sexual',columns='nationality'))
print(DataFrame_1.pivot_table(values='d',index='sexual',columns='nationality'))
>>>
  sexual nationality    age  d  e  f  g
A    man          cn  child  3  0  3  3
B  woman          us  child  2  8  4  0
C    man          uk  child  1  9  7  1
A  woman          cn  adult  4  7  0  7
B  woman          us  child  2  8  0  2
C    man          us  child  6  3  0  2
               d              e              f              g          
nationality   cn   uk   us   cn   uk   us   cn   uk   us   cn   uk   us
sexual                                                                 
man          3.0  1.0  6.0  0.0  9.0  3.0  3.0  7.0  0.0  3.0  1.0  2.0
woman        4.0  NaN  2.0  7.0  NaN  8.0  0.0  NaN  2.0  7.0  NaN  1.0
nationality   cn   uk   us
sexual                    
man          3.0  1.0  6.0
woman        4.0  NaN  2.0

aggfunc参数

我们可以直接为aggfuc指定聚合的方法,这样将会运用到透视数据表的全部列上

DataFrame_1=pd.DataFrame(np.random.randint(0,10,(6,7)),index=list('ABCABC'),columns=list('abcdefg'))
DataFrame_1.iloc[:,0]=['man','woman','man','woman','woman','man']
DataFrame_1.rename(columns={'a':'sexual'},inplace=True)
DataFrame_1.iloc[:,1]=['cn','us','uk','cn','us','us']
DataFrame_1.rename(columns={'b':'nationality'},inplace=True)
print(DataFrame_1)
print(DataFrame_1.pivot_table(index='sexual',columns='nationality',aggfunc='max'))
>>>
  sexual nationality  c  d  e  f  g
A    man          cn  1  2  3  4  0
B  woman          us  0  2  7  5  7
C    man          uk  2  4  3  4  5
A  woman          cn  6  9  5  1  2
B  woman          us  0  0  6  1  4
C    man          us  1  5  0  9  8
               c              d              e              f              g          
nationality   cn   uk   us   cn   uk   us   cn   uk   us   cn   uk   us   cn   uk   us
sexual                                                                                
man          1.0  2.0  1.0  2.0  4.0  5.0  3.0  3.0  0.0  4.0  4.0  9.0  0.0  5.0  8.0
woman        6.0  NaN  0.0  9.0  NaN  2.0  5.0  NaN  7.0  1.0  NaN  5.0  2.0  NaN  7.0

我们也可以为aggfunc参数传入指定每一列聚合方法的字典

DataFrame_1=pd.DataFrame(np.random.randint(0,10,(6,7)),index=list('ABCABC'),columns=list('abcdefg'))
DataFrame_1.iloc[:,0]=['man','woman','man','woman','woman','man']
DataFrame_1.rename(columns={'a':'sexual'},inplace=True)
DataFrame_1.iloc[:,1]=['cn','us','uk','cn','us','us']
DataFrame_1.rename(columns={'b':'nationality'},inplace=True)
print(DataFrame_1)
print(DataFrame_1.pivot_table(index='sexual',columns='nationality',aggfunc={'c':'max','d':'min','e':'mean'}))
>>>
sexual nationality  c  d  e  f  g
A    man          cn  0  7  0  8  4
B  woman          us  3  0  3  5  0
C    man          uk  8  7  6  4  1
A  woman          cn  1  4  8  6  2
B  woman          us  5  6  5  3  5
C    man          us  1  4  9  2  8
               c              d              e          
nationality   cn   uk   us   cn   uk   us   cn   uk   us
sexual                                                  
man          0.0  8.0  1.0  7.0  7.0  4.0  0.0  6.0  9.0
woman        1.0  NaN  5.0  4.0  NaN  0.0  8.0  NaN  4.0

margin参数

margin参数可以计算每一组的每列,每行的聚合值

DataFrame_1=pd.DataFrame(np.random.randint(0,10,(6,7)),index=list('ABCABC'),columns=list('abcdefg'))
DataFrame_1.iloc[:,0]=['man','woman','man','woman','woman','man']
DataFrame_1.rename(columns={'a':'sexual'},inplace=True)
DataFrame_1.iloc[:,1]=['cn','us','uk','cn','us','us']
DataFrame_1.rename(columns={'b':'nationality'},inplace=True)
print(DataFrame_1)
print(DataFrame_1.pivot_table('d',index='sexual',columns='nationality',margins=True))
>>>
  sexual nationality  c  d  e  f  g
A    man          cn  8  8  7  1  9
B  woman          us  0  9  5  4  2
C    man          uk  2  0  9  0  0
A  woman          cn  2  7  7  7  6
B  woman          us  4  0  0  7  1
C    man          us  0  6  8  1  0
nationality   cn   uk   us       All
sexual                              
man          8.0  0.0  6.0  4.666667
woman        7.0  NaN  4.5  5.333333
All          7.5  0.0  5.0  5.000000

margins_name参数

margins_name参数用与为统计行 / 列设置行列索引

DataFrame_1=pd.DataFrame(np.random.randint(0,10,(6,7)),index=list('ABCABC'),columns=list('abcdefg'))
DataFrame_1.iloc[:,0]=['man','woman','man','woman','woman','man']
DataFrame_1.rename(columns={'a':'sexual'},inplace=True)
DataFrame_1.iloc[:,1]=['cn','us','uk','cn','us','us']
DataFrame_1.rename(columns={'b':'nationality'},inplace=True)
print(DataFrame_1)
print(DataFrame_1.pivot_table('d',index='sexual',columns='nationality',margins=True,margins_name='total'))
>>>
  sexual nationality  c  d  e  f  g
A    man          cn  3  9  6  9  0
B  woman          us  9  0  6  1  0
C    man          uk  4  9  0  9  8
A  woman          cn  8  0  4  7  7
B  woman          us  7  3  4  5  4
C    man          us  1  5  4  0  5

nationality   cn   uk        us     total
sexual                                   
man          9.0  9.0  5.000000  7.666667
woman        0.0  NaN  1.500000  1.000000
total        4.5  9.0  2.666667  4.333333

泰坦尼克号生还率分析

根据前面讲解的数据透视表,下面将对泰坦尼克号这个数据集进行分析

整体情况

同样上来先对数据集整体进行分析

这里对数据表进行转置,以便于不会折行显示

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt



titanic=pd.read_csv('titanic.csv')
print(titanic.head().T)
print(titanic.describe())
>>>
                       0          1            2            3            4
survived               0          1            1            1            0
pclass                 3          1            3            1            3
sex                 male     female       female       female         male
age                   22         38           26           35           35
sibsp                  1          1            0            1            0
parch                  0          0            0            0            0
fare                7.25    71.2833        7.925         53.1         8.05
embarked               S          C            S            S            S
class              Third      First        Third        First        Third
who                  man      woman        woman        woman          man
adult_male          True      False        False        False         True
deck                 NaN          C          NaN            C          NaN
embark_town  Southampton  Cherbourg  Southampton  Southampton  Southampton
alive                 no        yes          yes          yes           no
alone              False      False         True        False         True
         survived      pclass         age       sibsp       parch        fare
count  891.000000  891.000000  714.000000  891.000000  891.000000  891.000000
mean     0.383838    2.308642   29.699118    0.523008    0.381594   32.204208
std      0.486592    0.836071   14.526497    1.102743    0.806057   49.693429
min      0.000000    1.000000    0.420000    0.000000    0.000000    0.000000
25%      0.000000    2.000000   20.125000    0.000000    0.000000    7.910400
50%      0.000000    3.000000   28.000000    0.000000    0.000000   14.454200
75%      1.000000    3.000000   38.000000    1.000000    0.000000   31.000000
max      1.000000    3.000000   80.000000    8.000000    6.000000  512.329200

男女船舱登记生还率

接下来使用数据透视表,将原有数据表重新划分为一张透视表来查看不同性别及不同船舱的生还率

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt



titanic=pd.read_csv('titanic.csv')
print(titanic.pivot_table('survived',index='who',columns='class'))

>>>
class     First    Second     Third
who                                
child  0.833333  1.000000  0.431034
man    0.352941  0.080808  0.119122
woman  0.978022  0.909091  0.491228

我们发现,除了儿童以外,船舱登记越高,生还率越高,尤其是女性

排除异常

上面我们发现二等舱的儿童生还率竟然达到了100%,这是一个相对来说异常的数字,我们自然要去查看下所有的二等舱的儿童

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt



titanic=pd.read_csv('titanic.csv')
# print(titanic.head().T)
# print(titanic.describe())
# print(titanic.pivot_table('survived',index='who',columns='class'))
child=titanic.set_index('who').T['child'].T
print(child.reset_index().pivot_table('survived',index='who',columns='class',aggfunc='sum'))
print(child.reset_index().pivot_table('survived',index='who',columns='class',aggfunc='count'))
>>>
class  First  Second  Third
who                        
child      5      19     25
class  First  Second  Third
who                        
child      6      19     58

我们发现之所以一等舱的儿童生还率低于二等舱儿童生还率,一方面是因为一等舱位的儿童太少了,另外一方面确实是二等舱的儿童全部得救

美国人生日分析

下面我们将使用上面的美国人生日来进行分析

整体情况

首先还是了解一下整个数据集的情况

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt




birthday=pd.read_csv('birthdates-1968-1988.csv')

print(birthday)
print(birthday.describe())
>>>
      year  month  day  births  day_of_year  day_of_week
0     1969      1    1    8486            1            3
1     1969      1    2    9002            2            4
2     1969      1    3    9542            3            5
3     1969      1    4    8960            4            6
4     1969      1    5    8390            5            7
...    ...    ...  ...     ...          ...          ...
7300  1988     12   27   11528          362            2
7301  1988     12   28   11847          363            3
7302  1988     12   29   11704          364            4
7303  1988     12   30   11837          365            5
7304  1988     12   31    9133          366            6

[7305 rows x 6 columns]
              year        month          day        births  day_of_year  day_of_week
count  7305.000000  7305.000000  7305.000000   7305.000000  7305.000000  7305.000000
mean   1978.501027     6.522930    15.729637   9648.940178   183.753593     4.000274
std       5.766735     3.448939     8.800694   1127.315229   105.621885     1.999795
min    1969.000000     1.000000     1.000000   6675.000000     1.000000     1.000000
25%    1974.000000     4.000000     8.000000   8792.000000    93.000000     2.000000
50%    1979.000000     7.000000    16.000000   9622.000000   184.000000     4.000000
75%    1984.000000    10.000000    23.000000  10510.000000   275.000000     6.000000
max    1988.000000    12.000000    31.000000  12851.000000   366.000000     7.000000

每年的人口出生情况

接下来看一下每年的人口出生情况

由于这里给定的数据表是每天的人口数,所以我们首先需要计算每年出生的人口数,然后绘制出每年出生人口的折线图

首先要计算每年出生的人口,那么就需要使用sum来对每个年份出生的人口求和

而且从上面对数据集的整体情况的了解中我们知道,每年中每一天出生的人口和年份是关联的

所以我们可以使用数据透视表来指定以年份来分割原表的同时,使用sum来对births求和,这样将会返回一个Series对象

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt


birthday=pd.read_csv('birthdates-1968-1988.csv')

birthnumber=birthday.pivot_table('births',index='year',aggfunc='sum')
print(birthnumber)
plt.plot(birthnumber.reset_index()['year'],birthnumber['births'])
plt.xlabel('year')
plt.ylabel('number')
plt.title('Birth Number of Every Year')
plt.show()

在这里插入图片描述

注意,这里虽然我们指定了每个数据点的横轴,但是绘图的时候matplotlib却依旧对年份区间取了均值,实际上这是由于matplotlib的默认特性,关于如何解决这一点,将在后面matplotlib的讲解中讲解

创建日期索引

为了承接后面的Pandas向量化字符串操作和Pandas处理时间序列两节内容,我们下面不妨先为每个数据创建一个日期索引

print(birthday)
print(type(birthday.iloc[0,1]),'\t',type(birthday.iloc[0,2]),type(birthday.iloc[0,3]))
birthday.index=pd.to_datetime(10000*birthday['year']+100*birthday['month']+birthday['day'],format='%Y%m%d')
print(birthday)
>>>
      year  month  day  births  day_of_year  day_of_week
0     1969      1    1    8486            1            3
1     1969      1    2    9002            2            4
2     1969      1    3    9542            3            5
3     1969      1    4    8960            4            6
4     1969      1    5    8390            5            7
...    ...    ...  ...     ...          ...          ...
7300  1988     12   27   11528          362            2
7301  1988     12   28   11847          363            3
7302  1988     12   29   11704          364            4
7303  1988     12   30   11837          365            5
7304  1988     12   31    9133          366            6

[7305 rows x 6 columns]

<class 'numpy.int64'>    <class 'numpy.int64'> <class 'numpy.int64'>

            year  month  day  births  day_of_year  day_of_week
1969-01-01  1969      1    1    8486            1            3
1969-01-02  1969      1    2    9002            2            4
1969-01-03  1969      1    3    9542            3            5
1969-01-04  1969      1    4    8960            4            6
1969-01-05  1969      1    5    8390            5            7
...          ...    ...  ...     ...          ...          ...
1988-12-27  1988     12   27   11528          362            2
1988-12-28  1988     12   28   11847          363            3
1988-12-29  1988     12   29   11704          364            4
1988-12-30  1988     12   30   11837          365            5
1988-12-31  1988     12   31    9133          366            6

[7305 rows x 6 columns]

注意,我们这里使用了to_datetime方法将会返回一个datetime对象,这是后面会在的Pandas处理时间序列这节中详细讲解

所以我们直接调用datetime对象的dayofweek方法来快速计算这个日期是周几

print(birthday)
print(type(birthday.iloc[0,1]),'\t',type(birthday.iloc[0,2]),type(birthday.iloc[0,3]))
birthday.index=pd.to_datetime(10000*birthday['year']+100*birthday['month']+birthday['day'],format='%Y%m%d')
birthday['day_of_week_2']=birthday.index.dayofweek
print(birthday)
>>>
      year  month  day  births  day_of_year  day_of_week
0     1969      1    1    8486            1            3
1     1969      1    2    9002            2            4
2     1969      1    3    9542            3            5
3     1969      1    4    8960            4            6
4     1969      1    5    8390            5            7
...    ...    ...  ...     ...          ...          ...
7300  1988     12   27   11528          362            2
7301  1988     12   28   11847          363            3
7302  1988     12   29   11704          364            4
7303  1988     12   30   11837          365            5
7304  1988     12   31    9133          366            6

[7305 rows x 6 columns]
<class 'numpy.int64'>    <class 'numpy.int64'> <class 'numpy.int64'>
            year  month  day  births  day_of_year  day_of_week  day_of_week_2
1969-01-01  1969      1    1    8486            1            3              2
1969-01-02  1969      1    2    9002            2            4              3
1969-01-03  1969      1    3    9542            3            5              4
1969-01-04  1969      1    4    8960            4            6              5
1969-01-05  1969      1    5    8390            5            7              6
...          ...    ...  ...     ...          ...          ...            ...
1988-12-27  1988     12   27   11528          362            2              1
1988-12-28  1988     12   28   11847          363            3              2
1988-12-29  1988     12   29   11704          364            4              3
1988-12-30  1988     12   30   11837          365            5              4
1988-12-31  1988     12   31    9133          366            6              5

[7305 rows x 7 columns]

但是这里出现了一些问题,可能是由于低层计算的问题

不同年代的星期日均出生数

利用上面得到的day_of_week这一列,我们可以看看不同年代不同星期的日均出生

这里直接用原有的day_of_week

print(birthday)
print(type(birthday.iloc[0,1]),'\t',type(birthday.iloc[0,2]),type(birthday.iloc[0,3]))
birthday['decade']=birthday['year']//10*10
birthday.index=pd.to_datetime(10000*birthday['year']+100*birthday['month']+birthday['day'],format='%Y%m%d')
print(birthday)
PviotTable_decade_week=birthday.pivot_table('births',index='day_of_week',columns='decade',aggfunc='mean')
print(PviotTable_decade_week)
PviotTable_decade_week.plot()
plt.gca().set_xticklabels(['a','Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday'])
plt.ylabel('Mean Births by day')
plt.show()
>>>
decade               1960         1970          1980
day_of_week                                         
1            10127.653846  9378.195402  10553.814499
2            10572.192308  9770.504798  11007.685106
3            10149.245283  9500.752399  10735.285106
4             9956.576923  9393.846743  10666.970213
5            10215.769231  9564.191571  10786.174468
6             9302.115385  8415.568966   8967.802128
7             8684.692308  7958.557471   8616.240938

得到图像如下:

在这里插入图片描述

这里由于直接使用set_xticklabels方法的话,传入列表的第一个值会被丢弃,所以才会添加一个‘a’

此外Matplotlib是完美支持Pandas的,所以我们可以直接调用DataFrame对象的plot方法来绘图,并且会直接添加图例

DataFrame_1=pd.DataFrame(np.random.randint(0,10,(6,4)))
Series_1=pd.Series(np.random.randint(0,10,10))
print(DataFrame_1)
DataFrame_1.plot()
plt.show()
>>>
   0  1  2  3
0  6  4  2  1
1  3  6  4  1
2  8  9  8  6
3  8  7  8  3
4  1  5  7  6
5  8  9  3  6

在这里插入图片描述

各年份平均每天的出生数

我们如果忽略年份的差异,而去查看各年份平均每天的出生数,那么将会从图表中发现出生数和一年中每日的关系

分析一下,我们将对每年中的各日的出生人数求平均数所以value参数就是birth,然后为了便于画图,需要将日期作为行索引,所以我们依旧需要使用数据透视表

但是有一个问题,如果我们直接指定使用的日期为日的话,那么同一年的1~12月也会被计算

print(birthday)
print(type(birthday.iloc[0,1]),'\t',type(birthday.iloc[0,2]),type(birthday.iloc[0,3]))
birthday['decade']=birthday['year']//10*10
birthday.index=pd.to_datetime(10000*birthday['year']+100*birthday['month']+birthday['day'],format='%Y%m%d')
print(birthday)

birthday_by_date=birthday.pivot_table('births',index=birthday.index.day,aggfunc='mean')
# birthday_by_date=birthday.pivot_table('births',index=[birthday.index.month,birthday.index.day],aggfunc='mean')
print(birthday_by_date)
>>>
      year  month  day  births  day_of_year  day_of_week
0     1969      1    1    8486            1            3
1     1969      1    2    9002            2            4
2     1969      1    3    9542            3            5
3     1969      1    4    8960            4            6
4     1969      1    5    8390            5            7
...    ...    ...  ...     ...          ...          ...
7300  1988     12   27   11528          362            2
7301  1988     12   28   11847          363            3
7302  1988     12   29   11704          364            4
7303  1988     12   30   11837          365            5
7304  1988     12   31    9133          366            6

[7305 rows x 6 columns]
<class 'numpy.int64'>    <class 'numpy.int64'> <class 'numpy.int64'>
            year  month  day  births  day_of_year  day_of_week  decade
1969-01-01  1969      1    1    8486            1            3    1960
1969-01-02  1969      1    2    9002            2            4    1960
1969-01-03  1969      1    3    9542            3            5    1960
1969-01-04  1969      1    4    8960            4            6    1960
1969-01-05  1969      1    5    8390            5            7    1960
...          ...    ...  ...     ...          ...          ...     ...
1988-12-27  1988     12   27   11528          362            2    1980
1988-12-28  1988     12   28   11847          363            3    1980
1988-12-29  1988     12   29   11704          364            4    1980
1988-12-30  1988     12   30   11837          365            5    1980
1988-12-31  1988     12   31    9133          366            6    1980

[7305 rows x 7 columns]
         births
1   9528.558333
2   9596.533333
3   9621.195833
4   9527.450000
5   9570.629167
6   9616.970833
7   9642.004167
8   9676.566667
9   9646.062500
10  9696.654167
11  9664.837500
12  9678.983333
13  9545.400000
14  9731.462500
15  9722.608333
16  9718.566667
17  9731.075000
18  9728.954167
19  9713.262500
20  9740.129167
21  9706.512500
22  9657.129167
23  9604.891667
24  9543.750000
25  9516.562500
26  9584.200000
27  9642.033333
28  9702.625000
29  9729.297778
30  9734.954545
31  9581.285714

为此,我们需要创建多级索引,来避免每月的日期被计算

print(birthday)
print(type(birthday.iloc[0,1]),'\t',type(birthday.iloc[0,2]),type(birthday.iloc[0,3]))
birthday['decade']=birthday['year']//10*10
birthday.index=pd.to_datetime(10000*birthday['year']+100*birthday['month']+birthday['day'],format='%Y%m%d')
print(birthday)
birthday_by_date=birthday.pivot_table('births',index=[birthday.index.month,birthday.index.day],aggfunc='mean')
print(birthday_by_date)
>>>
      year  month  day  births  day_of_year  day_of_week
0     1969      1    1    8486            1            3
1     1969      1    2    9002            2            4
2     1969      1    3    9542            3            5
3     1969      1    4    8960            4            6
4     1969      1    5    8390            5            7
...    ...    ...  ...     ...          ...          ...
7300  1988     12   27   11528          362            2
7301  1988     12   28   11847          363            3
7302  1988     12   29   11704          364            4
7303  1988     12   30   11837          365            5
7304  1988     12   31    9133          366            6

[7305 rows x 6 columns]
<class 'numpy.int64'>    <class 'numpy.int64'> <class 'numpy.int64'>
            year  month  day  births  day_of_year  day_of_week  decade
1969-01-01  1969      1    1    8486            1            3    1960
1969-01-02  1969      1    2    9002            2            4    1960
1969-01-03  1969      1    3    9542            3            5    1960
1969-01-04  1969      1    4    8960            4            6    1960
1969-01-05  1969      1    5    8390            5            7    1960
...          ...    ...  ...     ...          ...          ...     ...
1988-12-27  1988     12   27   11528          362            2    1980
1988-12-28  1988     12   28   11847          363            3    1980
1988-12-29  1988     12   29   11704          364            4    1980
1988-12-30  1988     12   30   11837          365            5    1980
1988-12-31  1988     12   31    9133          366            6    1980

[7305 rows x 7 columns]
         births
1  1    8018.45
   2    8494.80
   3    9001.80
   4    9142.70
   5    9207.25
...         ...
12 27   9700.30
   28  10088.40
   29  10240.30
   30  10344.70
   31   9718.40

[366 rows x 1 columns

最后绘图即可

print(birthday)
print(type(birthday.iloc[0,1]),'\t',type(birthday.iloc[0,2]),type(birthday.iloc[0,3]))
birthday['decade']=birthday['year']//10*10
birthday.index=pd.to_datetime(10000*birthday['year']+100*birthday['month']+birthday['day'],format='%Y%m%d')
print(birthday)
birthday_by_date=birthday.pivot_table('births',index=[birthday.index.month,birthday.index.day],aggfunc='mean')
print(birthday_by_date)
fig,ax=plt.subplots(figsize=(12,4))
birthday_by_date.plot(ax=ax)
plt.show()
>>>
      year  month  day  births  day_of_year  day_of_week
0     1969      1    1    8486            1            3
1     1969      1    2    9002            2            4
2     1969      1    3    9542            3            5
3     1969      1    4    8960            4            6
4     1969      1    5    8390            5            7
...    ...    ...  ...     ...          ...          ...
7300  1988     12   27   11528          362            2
7301  1988     12   28   11847          363            3
7302  1988     12   29   11704          364            4
7303  1988     12   30   11837          365            5
7304  1988     12   31    9133          366            6

[7305 rows x 6 columns]
<class 'numpy.int64'>    <class 'numpy.int64'> <class 'numpy.int64'>
            year  month  day  births  day_of_year  day_of_week  decade
1969-01-01  1969      1    1    8486            1            3    1960
1969-01-02  1969      1    2    9002            2            4    1960
1969-01-03  1969      1    3    9542            3            5    1960
1969-01-04  1969      1    4    8960            4            6    1960
1969-01-05  1969      1    5    8390            5            7    1960
...          ...    ...  ...     ...          ...          ...     ...
1988-12-27  1988     12   27   11528          362            2    1980
1988-12-28  1988     12   28   11847          363            3    1980
1988-12-29  1988     12   29   11704          364            4    1980
1988-12-30  1988     12   30   11837          365            5    1980
1988-12-31  1988     12   31    9133          366            6    1980

[7305 rows x 7 columns]
         births
1  1    8018.45
   2    8494.80
   3    9001.80
   4    9142.70
   5    9207.25
...         ...
12 27   9700.30
   28  10088.40
   29  10240.30
   30  10344.70
   31   9718.40

[366 rows x 1 columns]

在这里插入图片描述

我们发现,每逢美国的节假日出生率就会降低,例如:独立日,劳动节,圣诞节和新年

这种现象可能是由于医院放假到底到医院接生的人变少所导致的

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值