文章目录
Pandas数据透视表
前面我们讲解了Pandas的GroupBy对象,并且结合行星的例子讲解了Groupby是如何探索数据集内部的关联性了
数据透视表(pivot table)是和groupby操作类似的一种操作方法,常见于Excel与类似的表格应用中
数据透视表将每一列数据作为输入,输出为将数据不断细分成多个维度累计信息的二维数据表
数据表像一种多维的GroupBy累计操作,因为对数据透视表的分割与组合发生在二维网格上
和上一节类似,这次依旧首先讲解数据透视表的相关知识,最后基于讲解的数据透视表的知识分析实际案例
准备工作
本节在最后使用的数据集包括:
- 泰坦尼克号乘客的信息数据库
- 美国疾病预防中心(CDC)提供的公开的美国人生日数据
这里为了避免网络相关问题,我已经把数据集搬运到了百度网盘上
此外,由于书中提供的原版美国人生日数据的下载网站无法下载,这里是我找到的另外一个版本
什么是数据透视表
数据透视表就是按照我们的需要,将原数据表按照我们所希望的列重新分割组合所得到的新的数据表
例如在前面的行星的分析中,我们使用groupby方法指定通过method和decade两个维度对数据进行切割,最终得到一个不同年代各种方法发现行星数量的一张表,不妨将其记为A
则A表就是一张数据透视表,我们透过原有数据得到的数据表
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
planets=pd.read_csv('planets.csv')
decade=10*(planets['year']//10)
decade=decade.astype(str)+'s'
decade.name='decade'
print(planets)
print(planets.groupby(by=['method',decade])['number'].sum().unstack().fillna(0))
print(type(planets.groupby(by=['method',decade])['number'].sum().unstack().fillna(0)))
>>>
decade 1980s 1990s 2000s 2010s
method
Astrometry 0.0 0.0 0.0 2.0
Eclipse Timing Variations 0.0 0.0 5.0 10.0
Imaging 0.0 0.0 29.0 21.0
Microlensing 0.0 0.0 12.0 15.0
Orbital Brightness Modulation 0.0 0.0 0.0 5.0
Pulsar Timing 0.0 9.0 1.0 1.0
Pulsation Timing Variations 0.0 0.0 1.0 0.0
Radial Velocity 1.0 52.0 475.0 424.0
Transit 0.0 0.0 64.0 712.0
Transit Timing Variations 0.0 0.0 0.0 9.0
<class 'pandas.core.frame.DataFrame'>
手工制作一张数据透视表
明白什么是数据透视表之后,我们不妨通过手工的方法来制作一张数据透视表,即指定groupby操作的对象是两列
我们首先制作一张数据表,然后重新为其中的两列赋予存在内在关系的数据
接下来使用groupby方法,指定两列来作为分割对象
DataFrame_1=pd.DataFrame(np.random.randint(0,10,(6,7)),index=list('ABCABC'),columns=list('abcdefg'))
DataFrame_1.iloc[:,0]=['man','woman','man','woman','woman','man']
DataFrame_1.rename(columns={'a':'sexual'},inplace=True)
DataFrame_1.iloc[:,1]=['cn','us','uk','cn','us','us']
DataFrame_1.rename(columns={'b':'nationality'},inplace=True)
print(DataFrame_1)
GroupBy_1=DataFrame_1.groupby(by=['sexual','nationality'])
print(GroupBy_1.aggregate('mean').unstack())
>>>
sexual nationality c d e f g
A man cn 9 6 1 2 5
B woman us 7 7 1 1 4
C man uk 8 6 2 0 5
A woman cn 8 1 0 1 3
B woman us 6 3 8 6 1
C man us 6 1 8 6 4
c d e f g
nationality cn uk us cn uk us cn uk us cn uk us cn uk us
sexual
man 9.0 8.0 6.0 6.0 6.0 1.0 1.0 2.0 8.0 2.0 0.0 6.0 5.0 5.0 4.0
woman 8.0 NaN 6.5 1.0 NaN 5.0 0.0 NaN 4.5 1.0 NaN 3.5 3.0 NaN 2.5
数据透视表语法
实际上,由于像上面一样的groupby操作来获得一个数据透视表的操作非常常见
所以Pandas提供了一个pivot_table方法来快速制作一张数据透视表
语法如下
DataFrame对象名.pivot_table(value,index,columns,aggfunc,dropna,fill_value)
其中index和columns必须指定,其他的可以省略
所以上面手工制作的数据表可以用如下语句实现
DataFrame_1=pd.DataFrame(np.random.randint(0,10,(6,7)),index=list('ABCABC'),columns=list('abcdefg'))
DataFrame_1.iloc[:,0]=['man','woman','man','woman','woman','man']
DataFrame_1.rename(columns={'a':'sexual'},inplace=True)
DataFrame_1.iloc[:,1]=['cn','us','uk','cn','us','us']
DataFrame_1.rename(columns={'b':'nationality'},inplace=True)
print(DataFrame_1)
print(DataFrame_1.pivot_table(index='sexual',columns='nationality'))
>>>
sexual nationality c d e f g
A man cn 1 0 0 9 0
B woman us 3 2 3 4 4
C man uk 9 9 6 1 2
A woman cn 9 3 8 0 8
B woman us 9 1 6 6 8
C man us 9 8 3 0 5
c d e f g
nationality cn uk us cn uk us cn uk us cn uk us cn uk us
sexual
man 1.0 9.0 9.0 0.0 9.0 8.0 0.0 6.0 3.0 9.0 1.0 0.0 0.0 2.0 5.0
woman 9.0 NaN 6.0 3.0 NaN 1.5 8.0 NaN 4.5 0.0 NaN 5.0 8.0 NaN 6.0
多维数据透视表
前面讲过,对于高维数据,我们可以使用MultiIndex对象来用DataFrame对象表示高维数据
所以我们也能够创建多维数据透视表来表示高维数据
我们实际上只要给columns和index两个参数传递作为键的列表,类似于创建MultiIndex对象时候的from_tuple方法,就能创建出数据透视表,例如
DataFrame_1=pd.DataFrame(np.random.randint(0,10,(6,7)),index=list('ABCABC'),columns=list('abcdefg'))
DataFrame_1.iloc[:,0]=['man','woman','man','woman','woman','man']
DataFrame_1.rename(columns={'a':'sexual'},inplace=True)
DataFrame_1.iloc[:,1]=['cn','us','uk','cn','us','us']
DataFrame_1.rename(columns={'b':'nationality'},inplace=True)
DataFrame_1.iloc[:,2]=['child','child','child','adult','child','child']
DataFrame_1.rename(columns={'c':'age'},inplace=True)
print(DataFrame_1)
print(DataFrame_1.pivot_table(index=['sexual','age'],columns='nationality'))
>>>
sexual nationality age d e f g
A man cn child 4 4 3 8
B woman us child 5 5 4 9
C man uk child 5 0 1 3
A woman cn adult 2 9 8 9
B woman us child 3 3 2 9
C man us child 2 5 8 4
d e f g
nationality cn uk us cn uk us cn uk us cn uk us
sexual age
man child 4.0 5.0 2.0 4.0 0.0 5.0 3.0 1.0 8.0 8.0 3.0 4.0
woman adult 2.0 NaN NaN 9.0 NaN NaN 8.0 NaN NaN 9.0 NaN NaN
child NaN NaN 4.0 NaN NaN 4.0 NaN NaN 3.0 NaN NaN 9.0
数据透视表的其他选项
DataFrame对象的pivot_table方法的完整签名如下
DataFrame.pivot_table(values = None,index = None,columns = None,aggfunc ='mean',fill_value = None,margin = False,dropna = True,margins_name ='All' )
其中第一个参数是指定需要聚合的对象,第二三个参数是指定数据透视表的行列索引,第四个参数是指定聚合的方式,默认是平均数
第五个参数是指定填充缺失值,margin参数用于计算每一组的总数,dropna用于删除缺失值,margin_name用于设动margin列的名字
下面给一些例子来帮助理解
value参数
DataFrame_1=pd.DataFrame(np.random.randint(0,10,(6,7)),index=list('ABCABC'),columns=list('abcdefg'))
DataFrame_1.iloc[:,0]=['man','woman','man','woman','woman','man']
DataFrame_1.rename(columns={'a':'sexual'},inplace=True)
DataFrame_1.iloc[:,1]=['cn','us','uk','cn','us','us']
DataFrame_1.rename(columns={'b':'nationality'},inplace=True)
DataFrame_1.iloc[:,2]=['child','child','child','adult','child','child']
DataFrame_1.rename(columns={'c':'age'},inplace=True)
print(DataFrame_1)
print(DataFrame_1.pivot_table(index='sexual',columns='nationality'))
print(DataFrame_1.pivot_table(values='d',index='sexual',columns='nationality'))
>>>
sexual nationality age d e f g
A man cn child 3 0 3 3
B woman us child 2 8 4 0
C man uk child 1 9 7 1
A woman cn adult 4 7 0 7
B woman us child 2 8 0 2
C man us child 6 3 0 2
d e f g
nationality cn uk us cn uk us cn uk us cn uk us
sexual
man 3.0 1.0 6.0 0.0 9.0 3.0 3.0 7.0 0.0 3.0 1.0 2.0
woman 4.0 NaN 2.0 7.0 NaN 8.0 0.0 NaN 2.0 7.0 NaN 1.0
nationality cn uk us
sexual
man 3.0 1.0 6.0
woman 4.0 NaN 2.0
aggfunc参数
我们可以直接为aggfuc指定聚合的方法,这样将会运用到透视数据表的全部列上
DataFrame_1=pd.DataFrame(np.random.randint(0,10,(6,7)),index=list('ABCABC'),columns=list('abcdefg'))
DataFrame_1.iloc[:,0]=['man','woman','man','woman','woman','man']
DataFrame_1.rename(columns={'a':'sexual'},inplace=True)
DataFrame_1.iloc[:,1]=['cn','us','uk','cn','us','us']
DataFrame_1.rename(columns={'b':'nationality'},inplace=True)
print(DataFrame_1)
print(DataFrame_1.pivot_table(index='sexual',columns='nationality',aggfunc='max'))
>>>
sexual nationality c d e f g
A man cn 1 2 3 4 0
B woman us 0 2 7 5 7
C man uk 2 4 3 4 5
A woman cn 6 9 5 1 2
B woman us 0 0 6 1 4
C man us 1 5 0 9 8
c d e f g
nationality cn uk us cn uk us cn uk us cn uk us cn uk us
sexual
man 1.0 2.0 1.0 2.0 4.0 5.0 3.0 3.0 0.0 4.0 4.0 9.0 0.0 5.0 8.0
woman 6.0 NaN 0.0 9.0 NaN 2.0 5.0 NaN 7.0 1.0 NaN 5.0 2.0 NaN 7.0
我们也可以为aggfunc参数传入指定每一列聚合方法的字典
DataFrame_1=pd.DataFrame(np.random.randint(0,10,(6,7)),index=list('ABCABC'),columns=list('abcdefg'))
DataFrame_1.iloc[:,0]=['man','woman','man','woman','woman','man']
DataFrame_1.rename(columns={'a':'sexual'},inplace=True)
DataFrame_1.iloc[:,1]=['cn','us','uk','cn','us','us']
DataFrame_1.rename(columns={'b':'nationality'},inplace=True)
print(DataFrame_1)
print(DataFrame_1.pivot_table(index='sexual',columns='nationality',aggfunc={'c':'max','d':'min','e':'mean'}))
>>>
sexual nationality c d e f g
A man cn 0 7 0 8 4
B woman us 3 0 3 5 0
C man uk 8 7 6 4 1
A woman cn 1 4 8 6 2
B woman us 5 6 5 3 5
C man us 1 4 9 2 8
c d e
nationality cn uk us cn uk us cn uk us
sexual
man 0.0 8.0 1.0 7.0 7.0 4.0 0.0 6.0 9.0
woman 1.0 NaN 5.0 4.0 NaN 0.0 8.0 NaN 4.0
margin参数
margin参数可以计算每一组的每列,每行的聚合值
DataFrame_1=pd.DataFrame(np.random.randint(0,10,(6,7)),index=list('ABCABC'),columns=list('abcdefg'))
DataFrame_1.iloc[:,0]=['man','woman','man','woman','woman','man']
DataFrame_1.rename(columns={'a':'sexual'},inplace=True)
DataFrame_1.iloc[:,1]=['cn','us','uk','cn','us','us']
DataFrame_1.rename(columns={'b':'nationality'},inplace=True)
print(DataFrame_1)
print(DataFrame_1.pivot_table('d',index='sexual',columns='nationality',margins=True))
>>>
sexual nationality c d e f g
A man cn 8 8 7 1 9
B woman us 0 9 5 4 2
C man uk 2 0 9 0 0
A woman cn 2 7 7 7 6
B woman us 4 0 0 7 1
C man us 0 6 8 1 0
nationality cn uk us All
sexual
man 8.0 0.0 6.0 4.666667
woman 7.0 NaN 4.5 5.333333
All 7.5 0.0 5.0 5.000000
margins_name参数
margins_name参数用与为统计行 / 列设置行列索引
DataFrame_1=pd.DataFrame(np.random.randint(0,10,(6,7)),index=list('ABCABC'),columns=list('abcdefg'))
DataFrame_1.iloc[:,0]=['man','woman','man','woman','woman','man']
DataFrame_1.rename(columns={'a':'sexual'},inplace=True)
DataFrame_1.iloc[:,1]=['cn','us','uk','cn','us','us']
DataFrame_1.rename(columns={'b':'nationality'},inplace=True)
print(DataFrame_1)
print(DataFrame_1.pivot_table('d',index='sexual',columns='nationality',margins=True,margins_name='total'))
>>>
sexual nationality c d e f g
A man cn 3 9 6 9 0
B woman us 9 0 6 1 0
C man uk 4 9 0 9 8
A woman cn 8 0 4 7 7
B woman us 7 3 4 5 4
C man us 1 5 4 0 5
nationality cn uk us total
sexual
man 9.0 9.0 5.000000 7.666667
woman 0.0 NaN 1.500000 1.000000
total 4.5 9.0 2.666667 4.333333
泰坦尼克号生还率分析
根据前面讲解的数据透视表,下面将对泰坦尼克号这个数据集进行分析
整体情况
同样上来先对数据集整体进行分析
这里对数据表进行转置,以便于不会折行显示
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
titanic=pd.read_csv('titanic.csv')
print(titanic.head().T)
print(titanic.describe())
>>>
0 1 2 3 4
survived 0 1 1 1 0
pclass 3 1 3 1 3
sex male female female female male
age 22 38 26 35 35
sibsp 1 1 0 1 0
parch 0 0 0 0 0
fare 7.25 71.2833 7.925 53.1 8.05
embarked S C S S S
class Third First Third First Third
who man woman woman woman man
adult_male True False False False True
deck NaN C NaN C NaN
embark_town Southampton Cherbourg Southampton Southampton Southampton
alive no yes yes yes no
alone False False True False True
survived pclass age sibsp parch fare
count 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200
男女船舱登记生还率
接下来使用数据透视表,将原有数据表重新划分为一张透视表来查看不同性别及不同船舱的生还率
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
titanic=pd.read_csv('titanic.csv')
print(titanic.pivot_table('survived',index='who',columns='class'))
>>>
class First Second Third
who
child 0.833333 1.000000 0.431034
man 0.352941 0.080808 0.119122
woman 0.978022 0.909091 0.491228
我们发现,除了儿童以外,船舱登记越高,生还率越高,尤其是女性
排除异常
上面我们发现二等舱的儿童生还率竟然达到了100%,这是一个相对来说异常的数字,我们自然要去查看下所有的二等舱的儿童
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
titanic=pd.read_csv('titanic.csv')
# print(titanic.head().T)
# print(titanic.describe())
# print(titanic.pivot_table('survived',index='who',columns='class'))
child=titanic.set_index('who').T['child'].T
print(child.reset_index().pivot_table('survived',index='who',columns='class',aggfunc='sum'))
print(child.reset_index().pivot_table('survived',index='who',columns='class',aggfunc='count'))
>>>
class First Second Third
who
child 5 19 25
class First Second Third
who
child 6 19 58
我们发现之所以一等舱的儿童生还率低于二等舱儿童生还率,一方面是因为一等舱位的儿童太少了,另外一方面确实是二等舱的儿童全部得救
美国人生日分析
下面我们将使用上面的美国人生日来进行分析
整体情况
首先还是了解一下整个数据集的情况
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
birthday=pd.read_csv('birthdates-1968-1988.csv')
print(birthday)
print(birthday.describe())
>>>
year month day births day_of_year day_of_week
0 1969 1 1 8486 1 3
1 1969 1 2 9002 2 4
2 1969 1 3 9542 3 5
3 1969 1 4 8960 4 6
4 1969 1 5 8390 5 7
... ... ... ... ... ... ...
7300 1988 12 27 11528 362 2
7301 1988 12 28 11847 363 3
7302 1988 12 29 11704 364 4
7303 1988 12 30 11837 365 5
7304 1988 12 31 9133 366 6
[7305 rows x 6 columns]
year month day births day_of_year day_of_week
count 7305.000000 7305.000000 7305.000000 7305.000000 7305.000000 7305.000000
mean 1978.501027 6.522930 15.729637 9648.940178 183.753593 4.000274
std 5.766735 3.448939 8.800694 1127.315229 105.621885 1.999795
min 1969.000000 1.000000 1.000000 6675.000000 1.000000 1.000000
25% 1974.000000 4.000000 8.000000 8792.000000 93.000000 2.000000
50% 1979.000000 7.000000 16.000000 9622.000000 184.000000 4.000000
75% 1984.000000 10.000000 23.000000 10510.000000 275.000000 6.000000
max 1988.000000 12.000000 31.000000 12851.000000 366.000000 7.000000
每年的人口出生情况
接下来看一下每年的人口出生情况
由于这里给定的数据表是每天的人口数,所以我们首先需要计算每年出生的人口数,然后绘制出每年出生人口的折线图
首先要计算每年出生的人口,那么就需要使用sum来对每个年份出生的人口求和
而且从上面对数据集的整体情况的了解中我们知道,每年中每一天出生的人口和年份是关联的
所以我们可以使用数据透视表来指定以年份来分割原表的同时,使用sum来对births求和,这样将会返回一个Series对象
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
birthday=pd.read_csv('birthdates-1968-1988.csv')
birthnumber=birthday.pivot_table('births',index='year',aggfunc='sum')
print(birthnumber)
plt.plot(birthnumber.reset_index()['year'],birthnumber['births'])
plt.xlabel('year')
plt.ylabel('number')
plt.title('Birth Number of Every Year')
plt.show()
注意,这里虽然我们指定了每个数据点的横轴,但是绘图的时候matplotlib却依旧对年份区间取了均值,实际上这是由于matplotlib的默认特性,关于如何解决这一点,将在后面matplotlib的讲解中讲解
创建日期索引
为了承接后面的Pandas向量化字符串操作和Pandas处理时间序列两节内容,我们下面不妨先为每个数据创建一个日期索引
print(birthday)
print(type(birthday.iloc[0,1]),'\t',type(birthday.iloc[0,2]),type(birthday.iloc[0,3]))
birthday.index=pd.to_datetime(10000*birthday['year']+100*birthday['month']+birthday['day'],format='%Y%m%d')
print(birthday)
>>>
year month day births day_of_year day_of_week
0 1969 1 1 8486 1 3
1 1969 1 2 9002 2 4
2 1969 1 3 9542 3 5
3 1969 1 4 8960 4 6
4 1969 1 5 8390 5 7
... ... ... ... ... ... ...
7300 1988 12 27 11528 362 2
7301 1988 12 28 11847 363 3
7302 1988 12 29 11704 364 4
7303 1988 12 30 11837 365 5
7304 1988 12 31 9133 366 6
[7305 rows x 6 columns]
<class 'numpy.int64'> <class 'numpy.int64'> <class 'numpy.int64'>
year month day births day_of_year day_of_week
1969-01-01 1969 1 1 8486 1 3
1969-01-02 1969 1 2 9002 2 4
1969-01-03 1969 1 3 9542 3 5
1969-01-04 1969 1 4 8960 4 6
1969-01-05 1969 1 5 8390 5 7
... ... ... ... ... ... ...
1988-12-27 1988 12 27 11528 362 2
1988-12-28 1988 12 28 11847 363 3
1988-12-29 1988 12 29 11704 364 4
1988-12-30 1988 12 30 11837 365 5
1988-12-31 1988 12 31 9133 366 6
[7305 rows x 6 columns]
注意,我们这里使用了to_datetime方法将会返回一个datetime对象,这是后面会在的Pandas处理时间序列这节中详细讲解
所以我们直接调用datetime对象的dayofweek方法来快速计算这个日期是周几
print(birthday)
print(type(birthday.iloc[0,1]),'\t',type(birthday.iloc[0,2]),type(birthday.iloc[0,3]))
birthday.index=pd.to_datetime(10000*birthday['year']+100*birthday['month']+birthday['day'],format='%Y%m%d')
birthday['day_of_week_2']=birthday.index.dayofweek
print(birthday)
>>>
year month day births day_of_year day_of_week
0 1969 1 1 8486 1 3
1 1969 1 2 9002 2 4
2 1969 1 3 9542 3 5
3 1969 1 4 8960 4 6
4 1969 1 5 8390 5 7
... ... ... ... ... ... ...
7300 1988 12 27 11528 362 2
7301 1988 12 28 11847 363 3
7302 1988 12 29 11704 364 4
7303 1988 12 30 11837 365 5
7304 1988 12 31 9133 366 6
[7305 rows x 6 columns]
<class 'numpy.int64'> <class 'numpy.int64'> <class 'numpy.int64'>
year month day births day_of_year day_of_week day_of_week_2
1969-01-01 1969 1 1 8486 1 3 2
1969-01-02 1969 1 2 9002 2 4 3
1969-01-03 1969 1 3 9542 3 5 4
1969-01-04 1969 1 4 8960 4 6 5
1969-01-05 1969 1 5 8390 5 7 6
... ... ... ... ... ... ... ...
1988-12-27 1988 12 27 11528 362 2 1
1988-12-28 1988 12 28 11847 363 3 2
1988-12-29 1988 12 29 11704 364 4 3
1988-12-30 1988 12 30 11837 365 5 4
1988-12-31 1988 12 31 9133 366 6 5
[7305 rows x 7 columns]
但是这里出现了一些问题,可能是由于低层计算的问题
不同年代的星期日均出生数
利用上面得到的day_of_week这一列,我们可以看看不同年代不同星期的日均出生
这里直接用原有的day_of_week
print(birthday)
print(type(birthday.iloc[0,1]),'\t',type(birthday.iloc[0,2]),type(birthday.iloc[0,3]))
birthday['decade']=birthday['year']//10*10
birthday.index=pd.to_datetime(10000*birthday['year']+100*birthday['month']+birthday['day'],format='%Y%m%d')
print(birthday)
PviotTable_decade_week=birthday.pivot_table('births',index='day_of_week',columns='decade',aggfunc='mean')
print(PviotTable_decade_week)
PviotTable_decade_week.plot()
plt.gca().set_xticklabels(['a','Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday'])
plt.ylabel('Mean Births by day')
plt.show()
>>>
decade 1960 1970 1980
day_of_week
1 10127.653846 9378.195402 10553.814499
2 10572.192308 9770.504798 11007.685106
3 10149.245283 9500.752399 10735.285106
4 9956.576923 9393.846743 10666.970213
5 10215.769231 9564.191571 10786.174468
6 9302.115385 8415.568966 8967.802128
7 8684.692308 7958.557471 8616.240938
得到图像如下:
这里由于直接使用set_xticklabels方法的话,传入列表的第一个值会被丢弃,所以才会添加一个‘a’
此外Matplotlib是完美支持Pandas的,所以我们可以直接调用DataFrame对象的plot方法来绘图,并且会直接添加图例
DataFrame_1=pd.DataFrame(np.random.randint(0,10,(6,4)))
Series_1=pd.Series(np.random.randint(0,10,10))
print(DataFrame_1)
DataFrame_1.plot()
plt.show()
>>>
0 1 2 3
0 6 4 2 1
1 3 6 4 1
2 8 9 8 6
3 8 7 8 3
4 1 5 7 6
5 8 9 3 6
各年份平均每天的出生数
我们如果忽略年份的差异,而去查看各年份平均每天的出生数,那么将会从图表中发现出生数和一年中每日的关系
分析一下,我们将对每年中的各日的出生人数求平均数所以value参数就是birth,然后为了便于画图,需要将日期作为行索引,所以我们依旧需要使用数据透视表
但是有一个问题,如果我们直接指定使用的日期为日的话,那么同一年的1~12月也会被计算
print(birthday)
print(type(birthday.iloc[0,1]),'\t',type(birthday.iloc[0,2]),type(birthday.iloc[0,3]))
birthday['decade']=birthday['year']//10*10
birthday.index=pd.to_datetime(10000*birthday['year']+100*birthday['month']+birthday['day'],format='%Y%m%d')
print(birthday)
birthday_by_date=birthday.pivot_table('births',index=birthday.index.day,aggfunc='mean')
# birthday_by_date=birthday.pivot_table('births',index=[birthday.index.month,birthday.index.day],aggfunc='mean')
print(birthday_by_date)
>>>
year month day births day_of_year day_of_week
0 1969 1 1 8486 1 3
1 1969 1 2 9002 2 4
2 1969 1 3 9542 3 5
3 1969 1 4 8960 4 6
4 1969 1 5 8390 5 7
... ... ... ... ... ... ...
7300 1988 12 27 11528 362 2
7301 1988 12 28 11847 363 3
7302 1988 12 29 11704 364 4
7303 1988 12 30 11837 365 5
7304 1988 12 31 9133 366 6
[7305 rows x 6 columns]
<class 'numpy.int64'> <class 'numpy.int64'> <class 'numpy.int64'>
year month day births day_of_year day_of_week decade
1969-01-01 1969 1 1 8486 1 3 1960
1969-01-02 1969 1 2 9002 2 4 1960
1969-01-03 1969 1 3 9542 3 5 1960
1969-01-04 1969 1 4 8960 4 6 1960
1969-01-05 1969 1 5 8390 5 7 1960
... ... ... ... ... ... ... ...
1988-12-27 1988 12 27 11528 362 2 1980
1988-12-28 1988 12 28 11847 363 3 1980
1988-12-29 1988 12 29 11704 364 4 1980
1988-12-30 1988 12 30 11837 365 5 1980
1988-12-31 1988 12 31 9133 366 6 1980
[7305 rows x 7 columns]
births
1 9528.558333
2 9596.533333
3 9621.195833
4 9527.450000
5 9570.629167
6 9616.970833
7 9642.004167
8 9676.566667
9 9646.062500
10 9696.654167
11 9664.837500
12 9678.983333
13 9545.400000
14 9731.462500
15 9722.608333
16 9718.566667
17 9731.075000
18 9728.954167
19 9713.262500
20 9740.129167
21 9706.512500
22 9657.129167
23 9604.891667
24 9543.750000
25 9516.562500
26 9584.200000
27 9642.033333
28 9702.625000
29 9729.297778
30 9734.954545
31 9581.285714
为此,我们需要创建多级索引,来避免每月的日期被计算
print(birthday)
print(type(birthday.iloc[0,1]),'\t',type(birthday.iloc[0,2]),type(birthday.iloc[0,3]))
birthday['decade']=birthday['year']//10*10
birthday.index=pd.to_datetime(10000*birthday['year']+100*birthday['month']+birthday['day'],format='%Y%m%d')
print(birthday)
birthday_by_date=birthday.pivot_table('births',index=[birthday.index.month,birthday.index.day],aggfunc='mean')
print(birthday_by_date)
>>>
year month day births day_of_year day_of_week
0 1969 1 1 8486 1 3
1 1969 1 2 9002 2 4
2 1969 1 3 9542 3 5
3 1969 1 4 8960 4 6
4 1969 1 5 8390 5 7
... ... ... ... ... ... ...
7300 1988 12 27 11528 362 2
7301 1988 12 28 11847 363 3
7302 1988 12 29 11704 364 4
7303 1988 12 30 11837 365 5
7304 1988 12 31 9133 366 6
[7305 rows x 6 columns]
<class 'numpy.int64'> <class 'numpy.int64'> <class 'numpy.int64'>
year month day births day_of_year day_of_week decade
1969-01-01 1969 1 1 8486 1 3 1960
1969-01-02 1969 1 2 9002 2 4 1960
1969-01-03 1969 1 3 9542 3 5 1960
1969-01-04 1969 1 4 8960 4 6 1960
1969-01-05 1969 1 5 8390 5 7 1960
... ... ... ... ... ... ... ...
1988-12-27 1988 12 27 11528 362 2 1980
1988-12-28 1988 12 28 11847 363 3 1980
1988-12-29 1988 12 29 11704 364 4 1980
1988-12-30 1988 12 30 11837 365 5 1980
1988-12-31 1988 12 31 9133 366 6 1980
[7305 rows x 7 columns]
births
1 1 8018.45
2 8494.80
3 9001.80
4 9142.70
5 9207.25
... ...
12 27 9700.30
28 10088.40
29 10240.30
30 10344.70
31 9718.40
[366 rows x 1 columns
最后绘图即可
print(birthday)
print(type(birthday.iloc[0,1]),'\t',type(birthday.iloc[0,2]),type(birthday.iloc[0,3]))
birthday['decade']=birthday['year']//10*10
birthday.index=pd.to_datetime(10000*birthday['year']+100*birthday['month']+birthday['day'],format='%Y%m%d')
print(birthday)
birthday_by_date=birthday.pivot_table('births',index=[birthday.index.month,birthday.index.day],aggfunc='mean')
print(birthday_by_date)
fig,ax=plt.subplots(figsize=(12,4))
birthday_by_date.plot(ax=ax)
plt.show()
>>>
year month day births day_of_year day_of_week
0 1969 1 1 8486 1 3
1 1969 1 2 9002 2 4
2 1969 1 3 9542 3 5
3 1969 1 4 8960 4 6
4 1969 1 5 8390 5 7
... ... ... ... ... ... ...
7300 1988 12 27 11528 362 2
7301 1988 12 28 11847 363 3
7302 1988 12 29 11704 364 4
7303 1988 12 30 11837 365 5
7304 1988 12 31 9133 366 6
[7305 rows x 6 columns]
<class 'numpy.int64'> <class 'numpy.int64'> <class 'numpy.int64'>
year month day births day_of_year day_of_week decade
1969-01-01 1969 1 1 8486 1 3 1960
1969-01-02 1969 1 2 9002 2 4 1960
1969-01-03 1969 1 3 9542 3 5 1960
1969-01-04 1969 1 4 8960 4 6 1960
1969-01-05 1969 1 5 8390 5 7 1960
... ... ... ... ... ... ... ...
1988-12-27 1988 12 27 11528 362 2 1980
1988-12-28 1988 12 28 11847 363 3 1980
1988-12-29 1988 12 29 11704 364 4 1980
1988-12-30 1988 12 30 11837 365 5 1980
1988-12-31 1988 12 31 9133 366 6 1980
[7305 rows x 7 columns]
births
1 1 8018.45
2 8494.80
3 9001.80
4 9142.70
5 9207.25
... ...
12 27 9700.30
28 10088.40
29 10240.30
30 10344.70
31 9718.40
[366 rows x 1 columns]
我们发现,每逢美国的节假日出生率就会降低,例如:独立日,劳动节,圣诞节和新年
这种现象可能是由于医院放假到底到医院接生的人变少所导致的