利用python进行数据分析(4)

第十章数据聚合与分组操作

  • 使用一个或多个键(以函数、数组或DataFrame列名的形式)将pandas对象拆分为多块
  • 计算组汇总统计信息,如计数、平均值或标准偏差或用户定义的函数
  • 应用组内变换或其他操作,如标准化、线性回归、排位或子集选择
  • 计算数据透视表和交叉表
  • 执行分位数分析和其他统计组分析

10.1 GroupBy机制

import numpy as np
import pandas as pd
df = pd.DataFrame({'data1':np.random.randn(5),
                  'data2':np.random.randn(5),
                  'key1':['a','a','b','b','a'],
                  'key2':['one','two','one','two','one']})
df
data1data2key1key2
00.4051210.233504aone
1-0.4193400.207110atwo
20.871916-0.476933bone
3-0.3112630.614344btwo
40.4989250.502884aone
grouped = df['data1'].groupby(df['key1'])
grouped.mean()
key1
a    0.161569
b    0.280326
Name: data1, dtype: float64
means = df['data1'].groupby([df['key1'],df['key2']]).mean()
means
key1  key2
a     one     0.452023
      two    -0.419340
b     one     0.871916
      two    -0.311263
Name: data1, dtype: float64
means.unstack()
key2onetwo
key1
a0.452023-0.419340
b0.871916-0.311263
states = np.array(['ohio','california','california','ohio','ohio'])
years = np.array([2005,2005,2006,2005,2006])
df['data1'].groupby([states,years]).mean()
california  2005   -0.419340
            2006    0.871916
ohio        2005    0.046929
            2006    0.498925
Name: data1, dtype: float64
#可以传递列名(无论那些列名是字符串、数字或其他Python对象)作为分组键
df.groupby('key1').mean()
data1data2
key1
a0.1615690.314499
b0.2803260.068706
df.groupby(['key1','key2']).mean()
data1data2
key1key2
aone0.4520230.368194
two-0.4193400.207110
bone0.871916-0.476933
two-0.3112630.614344
#通用的GroupBy方法是size,size方法返回一个包含组大小信息的Series
#请注意,分组键中的任何缺失值将被排除在结果之外。
df.groupby(['key1','key2']).size()
key1  key2
a     one     2
      two     1
b     one     1
      two     1
dtype: int64

10.1.1 遍历各分组

  • GroupBy对象支持迭代,会生成一个包含组名和数据块的2维元组序列。
for name,group in df.groupby('key1'):
    print(name)
    print(group)
a
      data1     data2 key1 key2
0  0.405121  0.233504    a  one
1 -0.419340  0.207110    a  two
4  0.498925  0.502884    a  one
b
      data1     data2 key1 key2
2  0.871916 -0.476933    b  one
3 -0.311263  0.614344    b  two
df
data1data2key1key2
00.4051210.233504aone
1-0.4193400.207110atwo
20.871916-0.476933bone
3-0.3112630.614344btwo
40.4989250.502884aone
for (k1,k2),group in df.groupby(['key1','key2']):
    print((k1,k2))
    print(group)
('a', 'one')
      data1     data2 key1 key2
0  0.405121  0.233504    a  one
4  0.498925  0.502884    a  one
('a', 'two')
     data1    data2 key1 key2
1 -0.41934  0.20711    a  two
('b', 'one')
      data1     data2 key1 key2
2  0.871916 -0.476933    b  one
('b', 'two')
      data1     data2 key1 key2
3 -0.311263  0.614344    b  two
pieces = dict(list(df.groupby('key1')))
pieces['b']
data1data2key1key2
20.871916-0.476933bone
3-0.3112630.614344btwo
#groupby在axis=0的轴向上分组,但你也可以在其他任意轴向上进行分组。
df.dtypes
data1    float64
data2    float64
key1      object
key2      object
dtype: object
for dtype,group in grouped:
    print(dtype)
    print(group)
a
0    0.405121
1   -0.419340
4    0.498925
Name: data1, dtype: float64
b
2    0.871916
3   -0.311263
Name: data1, dtype: float64

10.1.2 选择一列或所有列的子集

  • 将从DataFrame创建的GroupBy对象用列名称或列名称数组进行索引时,会产生用于聚合的列子集的效果。
  • “语法糖(Syntactic sugar),也译为糖衣语法,是由英国计算机科学家彼得·约翰·兰达(Peter J. Landin)发明的一个术语,指计算机语言中添加的某种语法,这种语法对语言的功能并没有影响,但是更方便程序员使用。通常来说使用语法糖能够增加程序的可读性,从而减少程序代码出错的机会。”
df.groupby('key1')['data1']
<pandas.core.groupby.generic.SeriesGroupBy object at 0x00000259B9B6D760>
df.groupby('key1')[['data2']]
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x00000259B9B70370>
df['data1'].groupby(df['key1'])
<pandas.core.groupby.generic.SeriesGroupBy object at 0x00000259B9B70610>
df[['data2']].groupby(df['key1'])
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x00000259B9B70FA0>
df.groupby(['key1','key2'])[['data2']].mean()
data2
key1key2
aone0.368194
two0.207110
bone-0.476933
two0.614344
s_grouped = df.groupby(['key1','key2'])['data2']
s_grouped.mean()
key1  key2
a     one     0.368194
      two     0.207110
b     one    -0.476933
      two     0.614344
Name: data2, dtype: float64

10.1.3 使用字典和Series分组

people = pd.DataFrame(np.random.randn(5,5),
                     columns = ['a','b','c','d','e'],
                     index = ['joe','steve','wes','jim','travis'])
people
abcde
joe1.176984-0.7790431.2326030.811508-0.114978
steve-0.1629640.936905-0.671911-0.6606970.297713
wes-1.154320-0.351915-2.286200-0.2006030.990375
jim-0.4967400.4741510.0588872.775079-0.615440
travis-0.7370611.470497-0.1730870.766692-0.234068
people.iloc[2:3,[1,4]] = np.nan
people
abcde
joe1.176984-0.7790431.2326030.811508-0.114978
steve-0.1629640.936905-0.671911-0.6606970.297713
wes-1.154320NaN-2.286200-0.200603NaN
jim-0.4967400.4741510.0588872.775079-0.615440
travis-0.7370611.470497-0.1730870.766692-0.234068
mapping = {'a':'red','b':'red','c':'blue','d':'blue','e':'red','f':'orange'}
by_column = people.groupby(mapping,axis=1)
by_column.sum()
bluered
joe2.0441110.282963
steve-1.3326081.071654
wes-2.486803-1.154320
jim2.833967-0.638030
travis0.5936050.499368
map_series = pd.Series(mapping)
map_series
a       red
b       red
c      blue
d      blue
e       red
f    orange
dtype: object
people.groupby(map_series,axis=1).count()
bluered
joe23
steve23
wes21
jim23
travis23

10.1.4 使用函数分组

people
abcde
joe1.176984-0.7790431.2326030.811508-0.114978
steve-0.1629640.936905-0.671911-0.6606970.297713
wes-1.154320NaN-2.286200-0.200603NaN
jim-0.4967400.4741510.0588872.775079-0.615440
travis-0.7370611.470497-0.1730870.766692-0.234068
people.groupby(len).sum()
abcde
3-0.474076-0.304892-0.9947103.385984-0.730418
5-0.1629640.936905-0.671911-0.6606970.297713
6-0.7370611.470497-0.1730870.766692-0.234068
key_list = ['one','one','one','two','two']
people.groupby([len,key_list]).min()
abcde
3one-1.154320-0.779043-2.286200-0.200603-0.114978
two-0.4967400.4741510.0588872.775079-0.615440
5one-0.1629640.936905-0.671911-0.6606970.297713
6two-0.7370611.470497-0.1730870.766692-0.234068

10.1.5 根据索引层级分组

hire_df = pd.DataFrame(np.random.randn(4,5),
                      columns = pd.MultiIndex.from_arrays([['US','US','US','JP','JP']
                                                           ,[1,3,5,1,3]],names = ['cty','tenor']))
hire_df
ctyUSJP
tenor13513
00.8402452.187988-0.393045-0.1508971.573875
11.3291240.070597-0.246468-0.147710-0.397490
20.3621980.404290-0.233373-0.4062610.812492
31.104010-0.4854810.439878-1.9480480.784266
hire_df.groupby(level='cty',axis=1).count()
ctyJPUS
023
123
223
323

10.2 数据聚合

  • 优化的groupby方法
函数名描述
count分组中的非NA值数量
sum非NA值的累和
mean非NA值的均值
median非NA值的算术中位数
std.var无偏的(n-1 分母)标准差和方差
min,max非NA值的最小值.最大值
prod非NA值的乘积
first、last非NA值的第-个和最后-个值
df
data1data2key1key2
00.4051210.233504aone
1-0.4193400.207110atwo
20.871916-0.476933bone
3-0.3112630.614344btwo
40.4989250.502884aone
grouped = df.groupby('key1')
grouped['data1'].quantile(0.9)
key1
a    0.480164
b    0.753598
Name: data1, dtype: float64
def peak_to_peak(arr):
    return arr.max() - arr.min()

grouped.agg(peak_to_peak)
data1data2
key1
a0.9182650.295774
b1.1831791.091277
grouped.describe()
data1data2
countmeanstdmin25%50%75%maxcountmeanstdmin25%50%75%max
key1
a3.00.1615690.505263-0.419340-0.0071090.4051210.4520230.4989253.00.3144990.1636790.2071100.2203070.2335040.3681940.502884
b2.00.2803260.836634-0.311263-0.0154680.2803260.5761210.8719162.00.0687060.771649-0.476933-0.2041130.0687060.3415250.614344

10.2.1 逐列及多函数应用

tips = pd.read_csv('examples/tips.csv')
tips.head()
total_billtipsmokerdaytimesize
016.991.01NoSunDinner2
110.341.66NoSunDinner3
221.013.50NoSunDinner3
323.683.31NoSunDinner2
424.593.61NoSunDinner4
tips['tip_pct'] = tips['tip']/tips['total_bill']
tips[:8]
total_billtipsmokerdaytimesizetip_pct
016.991.01NoSunDinner20.059447
110.341.66NoSunDinner30.160542
221.013.50NoSunDinner30.166587
323.683.31NoSunDinner20.139780
424.593.61NoSunDinner40.146808
525.294.71NoSunDinner40.186240
68.772.00NoSunDinner20.228050
726.883.12NoSunDinner40.116071
grouped = tips.groupby(['day','smoker'])
grouped
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x00000259B96733A0>
grouped_pct = grouped['tip_pct']
grouped_pct.agg('mean')
day   smoker
Fri   No        0.151650
      Yes       0.174783
Sat   No        0.158048
      Yes       0.147906
Sun   No        0.160113
      Yes       0.187250
Thur  No        0.160298
      Yes       0.163863
Name: tip_pct, dtype: float64
#如果你传递的是函数或者函数名的列表,你会获得一个列名是这些函数名的DataFrame
grouped_pct.agg(['mean','std',peak_to_peak])
meanstdpeak_to_peak
daysmoker
FriNo0.1516500.0281230.067349
Yes0.1747830.0512930.159925
SatNo0.1580480.0397670.235193
Yes0.1479060.0613750.290095
SunNo0.1601130.0423470.193226
Yes0.1872500.1541340.644685
ThurNo0.1602980.0387740.193350
Yes0.1638630.0393890.151240
grouped_pct.agg([('foo','mean'),('bar',np.std)])
foobar
daysmoker
FriNo0.1516500.028123
Yes0.1747830.051293
SatNo0.1580480.039767
Yes0.1479060.061375
SunNo0.1601130.042347
Yes0.1872500.154134
ThurNo0.1602980.038774
Yes0.1638630.039389
functions = ['count','mean','max']
result = grouped['tip_pct','total_bill'].agg(functions)
result
<ipython-input-74-05231bb63e58>:2: FutureWarning: Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.
  result = grouped['tip_pct','total_bill'].agg(functions)
tip_pcttotal_bill
countmeanmaxcountmeanmax
daysmoker
FriNo40.1516500.187735418.42000022.75
Yes150.1747830.2634801516.81333340.17
SatNo450.1580480.2919904519.66177848.33
Yes420.1479060.3257334221.27666750.81
SunNo570.1601130.2526725720.50666748.17
Yes190.1872500.7103451924.12000045.35
ThurNo450.1602980.2663124517.11311141.19
Yes170.1638630.2412551719.19058843.11
result['tip_pct']
countmeanmax
daysmoker
FriNo40.1516500.187735
Yes150.1747830.263480
SatNo450.1580480.291990
Yes420.1479060.325733
SunNo570.1601130.252672
Yes190.1872500.710345
ThurNo450.1602980.266312
Yes170.1638630.241255
ftuples = [('Durch','mean'),('abwe',np.var)]
grouped['tip_pct','total_bill'].agg(ftuples)
<ipython-input-76-ce86ef210451>:2: FutureWarning: Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.
  grouped['tip_pct','total_bill'].agg(ftuples)
tip_pcttotal_bill
DurchabweDurchabwe
daysmoker
FriNo0.1516500.00079118.42000025.596333
Yes0.1747830.00263116.81333382.562438
SatNo0.1580480.00158119.66177879.908965
Yes0.1479060.00376721.276667101.387535
SunNo0.1601130.00179320.50666766.099980
Yes0.1872500.02375724.120000109.046044
ThurNo0.1602980.00150317.11311159.625081
Yes0.1638630.00155119.19058869.808518
#只有多个函数应用于至少一个列时,DataFrame才具有分层列。
grouped.agg({'tip':np.max,'size':'sum'})
tipsize
daysmoker
FriNo3.509
Yes4.7331
SatNo9.00115
Yes10.00104
SunNo6.00167
Yes6.5049
ThurNo6.70112
Yes5.0040

10.2.2 返回不含行索引的聚合数据

  • 通过在结果上调用reset_index也可以获得同样的结果。使用as_index=False可以避免一些不必要的计算。
tips.groupby(['day','smoker'],as_index=False).mean()
daysmokertotal_billtipsizetip_pct
0FriNo18.4200002.8125002.2500000.151650
1FriYes16.8133332.7140002.0666670.174783
2SatNo19.6617783.1028892.5555560.158048
3SatYes21.2766672.8754762.4761900.147906
4SunNo20.5066673.1678952.9298250.160113
5SunYes24.1200003.5168422.5789470.187250
6ThurNo17.1131112.6737782.4888890.160298
7ThurYes19.1905883.0300002.3529410.163863

10.3 应用:通用拆分-应用-联合

import pandas as pd
tips = pd.read_csv('examples/tips.csv')
tips.head()
total_billtipsmokerdaytimesize
016.991.01NoSunDinner2
110.341.66NoSunDinner3
221.013.50NoSunDinner3
323.683.31NoSunDinner2
424.593.61NoSunDinner4
tips['tip_pct'] = tips['tip']/(tips['total_bill']-tips['tip'])
#假设你想要按组选出小费百分比(tip-pct)最高的五组。
#首先,写一个可以在特定列中选出最大值所在行的函数
def top(df,n=5,column='tip_pct'):
    return df.sort_values(by=column)[-n:]
top(tips,n=6)
total_billtipsmokerdaytimesizetip_pct
10914.314.00YesSatDinner20.387973
18323.176.50YesSunDinner40.389922
23211.613.39NoSatDinner20.412409
673.071.00YesSatDinner10.483092
1789.604.00YesSunDinner20.714286
1727.255.15YesSunDinner22.452381
tips.groupby('smoker').apply(top)
#top函数在DataFrame的每一行分组上被调用,之后使用pandas. 
#concat将函数结果粘贴在一起,并使用分组名作为各组的标签。
#因此结果包含一个分层索引,该分层索引的内部层级包含原DataFrame的索引值
total_billtipsmokerdaytimesizetip_pct
smoker
No8824.715.85NoThurLunch20.310180
18520.695.00NoSunDinner50.318674
5110.292.60NoSunDinner20.338101
1497.512.00NoThurLunch20.362976
23211.613.39NoSatDinner20.412409
Yes10914.314.00YesSatDinner20.387973
18323.176.50YesSunDinner40.389922
673.071.00YesSatDinner10.483092
1789.604.00YesSunDinner20.714286
1727.255.15YesSunDinner22.452381
#如果你除了向apply传递函数,还传递其他参数或关键字的话,你可以把这些放在函数后进行传递
tips.groupby(['smoker','day']).apply(top,n=1,column='total_bill')
total_billtipsmokerdaytimesizetip_pct
smokerday
NoFri9422.753.25NoFriDinner20.166667
Sat21248.339.00NoSatDinner40.228833
Sun15648.175.00NoSunDinner60.115821
Thur14241.195.00NoThurLunch50.138160
YesFri9540.174.73YesFriDinner40.133465
Sat17050.8110.00YesSatDinner30.245038
Sun18245.353.50YesSunDinner30.083632
Thur19743.115.00YesThurLunch40.131199
result = tips.groupby('smoker')['tip_pct'].describe()
result
countmeanstdmin25%50%75%max
smoker
No151.00.1922370.0576650.0602170.1586220.1843080.2270150.412409
Yes93.00.2181760.2542950.0369550.1195340.1818180.2423262.452381
result.unstack('smoker')
       smoker
count  No        151.000000
       Yes        93.000000
mean   No          0.192237
       Yes         0.218176
std    No          0.057665
       Yes         0.254295
min    No          0.060217
       Yes         0.036955
25%    No          0.158622
       Yes         0.119534
50%    No          0.184308
       Yes         0.181818
75%    No          0.227015
       Yes         0.242326
max    No          0.412409
       Yes         2.452381
dtype: float64
#在GroupBy对象的内部,当你调用像describe这样的方法时,实际上是以下代码的简写
f = lambda x :x.describe()
grouped.apply(f)
total_billtipsizetip_pct
daysmoker
FriNocount4.0000004.0000004.004.000000
mean18.4200002.8125002.250.151650
std5.0592820.8984940.500.028123
min12.4600001.5000002.000.120385
25%15.1000002.6250002.000.137239
.....................
ThurYesmin10.3400002.0000002.000.090014
25%13.5100002.0000002.000.148038
50%16.4700002.5600002.000.153846
75%19.8100004.0000002.000.194837
max43.1100005.0000004.000.241255

64 rows × 4 columns

10.3.1 压缩分组键

  • 你可以看到所得到的对象具有分组键所形成的分层索引以及每个原始对象的索引。
  • 你可以通过向groupby传递group_keys=False来禁用这个功能
tips.groupby('smoker').apply(top)
total_billtipsmokerdaytimesizetip_pct
smoker
No8824.715.85NoThurLunch20.310180
18520.695.00NoSunDinner50.318674
5110.292.60NoSunDinner20.338101
1497.512.00NoThurLunch20.362976
23211.613.39NoSatDinner20.412409
Yes10914.314.00YesSatDinner20.387973
18323.176.50YesSunDinner40.389922
673.071.00YesSatDinner10.483092
1789.604.00YesSunDinner20.714286
1727.255.15YesSunDinner22.452381
tips.groupby('smoker',group_keys=False).apply(top)
total_billtipsmokerdaytimesizetip_pct
8824.715.85NoThurLunch20.310180
18520.695.00NoSunDinner50.318674
5110.292.60NoSunDinner20.338101
1497.512.00NoThurLunch20.362976
23211.613.39NoSatDinner20.412409
10914.314.00YesSatDinner20.387973
18323.176.50YesSunDinner40.389922
673.071.00YesSatDinner10.483092
1789.604.00YesSunDinner20.714286
1727.255.15YesSunDinner22.452381

10.3.2 分位数与桶分析

  • pandas有一些工具,尤其是cut和qcut,用于将数据按照你选择的箱位或样本分位数进行分桶。
  • 与groupby方法一起使用这些函数可以对数据集更方便地进行分桶或分位分析
frame = pd.DataFrame({'data1':np.random.randn(1000),
                     'data2':np.random.randn(1000)})
frame.head()
data1data2
01.038225-0.225594
1-1.1420480.762311
2-0.3837510.319326
30.122939-1.205274
4-1.557777-0.495632
quertitles = pd.cut(frame.data1,4)
quertitles[:10]
0     (-0.175, 1.544]
1    (-1.894, -0.175]
2    (-1.894, -0.175]
3     (-0.175, 1.544]
4    (-1.894, -0.175]
5    (-1.894, -0.175]
6    (-1.894, -0.175]
7     (-3.62, -1.894]
8    (-1.894, -0.175]
9     (-0.175, 1.544]
Name: data1, dtype: category
Categories (4, interval[float64]): [(-3.62, -1.894] < (-1.894, -0.175] < (-0.175, 1.544] < (1.544, 3.263]]
def get_stats(group):
    return {'min':group.min(),'max':group.max(),'count':group.count(),'mean':group.mean()}
grouped = frame.data2.groupby(quertitles)
grouped.apply(get_stats).unstack()
minmaxcountmean
data1
(-3.62, -1.894]-1.6878711.88755830.0-0.210904
(-1.894, -0.175]-2.8981823.015219412.00.043604
(-0.175, 1.544]-2.8295332.779859495.0-0.040127
(1.544, 3.263]-1.9666232.05985963.00.051666
#返回分位数数值
grouping = pd.cut(frame.data1,10,labels=False)
grouped = frame.data2.groupby(grouping)
grouped.apply(get_stats).unstack()
minmaxcountmean
data1
0-1.3516790.4795513.0-0.162551
1-1.3965841.88755811.00.134897
2-2.1228741.73494147.0-0.178326
3-2.0201753.015219136.0-0.004133
4-2.8981822.468129245.00.079938
5-2.7589062.779859253.0-0.035156
6-2.8295332.439201187.0-0.045802
7-2.2507962.12809686.0-0.033331
8-1.6051252.05985929.00.083334
9-0.0299290.6921383.00.433789

10.3.3 示例:使用指定分组值填充缺失值

  • fillna是一个可以使用的正确工具
s = pd.Series(np.random.randn(6))
s
0   -1.825708
1   -0.406503
2    0.319659
3    0.077313
4   -0.304391
5    0.369055
dtype: float64
s[::2] = np.nan
s
0         NaN
1   -0.406503
2         NaN
3    0.077313
4         NaN
5    0.369055
dtype: float64
s.fillna(s.mean())
0    0.013288
1   -0.406503
2    0.013288
3    0.077313
4    0.013288
5    0.369055
dtype: float64
#假设你需要填充值按组来变化。
#一个方法是对数据分组后使用apply和一个在每个数据块上都调用fillna的函数。
states = ['ohio','new york','vermont','florida','oregon','nevada','california','idaho']
states
['ohio',
 'new york',
 'vermont',
 'florida',
 'oregon',
 'nevada',
 'california',
 'idaho']
group_key = ['east']*4+['west']*4
group_key
['east', 'east', 'east', 'east', 'west', 'west', 'west', 'west']
data = pd.Series(np.random.randn(8),index=states)
data
ohio         -1.196387
new york      1.116637
vermont      -2.096959
florida      -0.482581
oregon       -0.426209
nevada        0.494309
california   -1.444239
idaho         2.388043
dtype: float64
#将数据中的一些值设置为缺失值
data[['vermont','nevada','idaho']] = np.nan
data
ohio         -1.196387
new york      1.116637
vermont            NaN
florida      -0.482581
oregon       -0.426209
nevada             NaN
california   -1.444239
idaho              NaN
dtype: float64
data.groupby(group_key).mean()
east   -0.187444
west   -0.935224
dtype: float64
#使用平均值来填充NA值
fill_mean = lambda g :g.fillna(g.mean())
data.groupby(group_key).apply(fill_mean)
ohio         -1.196387
new york      1.116637
vermont      -0.187444
florida      -0.482581
oregon       -0.426209
nevada       -0.935224
california   -1.444239
idaho        -0.935224
dtype: float64
fill_values = {'east':0.5,'west':-1}
fill_func = lambda g :g.fillna(fill_values[g.name])
data.groupby(group_key).apply(fill_func)
ohio         -1.196387
new york      1.116637
vermont       0.500000
florida      -0.482581
oregon       -0.426209
nevada       -1.000000
california   -1.444239
idaho        -1.000000
dtype: float64

10.3.4 示例:随机采样与排列

#红桃、黑桃、梅花、方块
suits = ['H','S','C','D']
card_val = (list(range(1,11))+[10]*3)*4
card_val[-5:]
[9, 10, 10, 10, 10]
base_names = ['A']+list(range(2,11))+['J','K','Q']
base_names
['A', 2, 3, 4, 5, 6, 7, 8, 9, 10, 'J', 'K', 'Q']
cards = []
for suit in suits:
    cards.extend(str(num)+suit for num in base_names)
    
deck = pd.Series(card_val,index=cards)
deck.head()
AH    1
2H    2
3H    3
4H    4
5H    5
dtype: int64
deck[:13]
AH      1
2H      2
3H      3
4H      4
5H      5
6H      6
7H      7
8H      8
9H      9
10H    10
JH     10
KH     10
QH     10
dtype: int64
def draw(deck,n=5):
    return deck.sample(5)
draw(deck)
9S    9
8S    8
3S    3
2D    2
9H    9
dtype: int64
#假设你想要从每个花色中随机抽取两张牌
get_suit = lambda card:card[-1]
deck.groupby(get_suit).apply(draw,n=2)
C  JC     10
   3C      3
   9C      9
   QC     10
   8C      8
D  6D      6
   QD     10
   7D      7
   9D      9
   4D      4
H  4H      4
   6H      6
   2H      2
   10H    10
   9H      9
S  10S    10
   QS     10
   8S      8
   KS     10
   5S      5
dtype: int64
deck.groupby(get_suit,group_keys=False).apply(draw,n=2)
QC    10
AC     1
JC    10
KC    10
4C     4
JD    10
AD     1
6D     6
5D     5
7D     7
KH    10
2H     2
6H     6
5H     5
8H     8
2S     2
5S     5
QS    10
4S     4
JS    10
dtype: int64

10.3.5 示例:分组加权平均和相关性

df = pd.DataFrame({'category':['a','a','a','a','b','b','b','b'],
                  'data':np.random.randn(8),
                  'weights':np.random.rand(8)})
df
categorydataweights
0a-0.5382130.890005
1a1.0368310.119786
2a-1.7061400.086685
3a-1.6872880.418367
4b0.5744430.372295
5b-1.7910630.994798
6b1.4196610.715527
7b0.6746220.338912
#通过category进行分组加权平均如下
grouped = df.groupby('category')
get_wavg = lambda g :np.average(g['data'],weights=g['weights'])
grouped.apply(get_wavg)
category
a   -0.797849
b   -0.133568
dtype: float64
close_px = pd.read_csv(r'D:\PythonFlie\python\利用python进行数据分析(书籍笔记)\examples\stock_px_2.csv',
                      parse_dates=True,
                      index_col=0)
close_px.head()
AAPLMSFTXOMSPX
2003-01-027.4021.1129.22909.03
2003-01-037.4521.1429.24908.59
2003-01-067.4521.5229.96929.01
2003-01-077.4321.9328.95922.93
2003-01-087.2821.3128.83909.93
close_px.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2214 entries, 2003-01-02 to 2011-10-14
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   AAPL    2214 non-null   float64
 1   MSFT    2214 non-null   float64
 2   XOM     2214 non-null   float64
 3   SPX     2214 non-null   float64
dtypes: float64(4)
memory usage: 86.5 KB
close_px[-4:]
AAPLMSFTXOMSPX
2011-10-11400.2927.0076.271195.54
2011-10-12402.1926.9677.161207.25
2011-10-13408.4327.1876.371203.66
2011-10-14422.0027.2778.111224.58
spx_corr = lambda x :x.corrwith(x['SPX'])
rets = close_px.pct_change().dropna()
get_year = lambda x :x.year
by_year = rets.groupby(get_year)
by_year.apply(spx_corr)
AAPLMSFTXOMSPX
20030.5411240.7451740.6612651.0
20040.3742830.5885310.5577421.0
20050.4675400.5623740.6310101.0
20060.4282670.4061260.5185141.0
20070.5081180.6587700.7862641.0
20080.6814340.8046260.8283031.0
20090.7071030.6549020.7979211.0
20100.7101050.7301180.8390571.0
20110.6919310.8009960.8599751.0
by_year.apply(lambda g:g['AAPL'].corr(g['MSFT']))
2003    0.480868
2004    0.259024
2005    0.300093
2006    0.161735
2007    0.417738
2008    0.611901
2009    0.432738
2010    0.571946
2011    0.581987
dtype: float64

10.3.6 示例:逐组线性回归

import statsmodels.api as sm
def regress(data,yvar,xvars):
    Y = data[yvar]
    X = data[xvars]
    X['intercept'] = 1.
    result = sm.OLS(Y,X).fit()
    return result.params
by_year.apply(regress,'AAPL',['SPX'])
SPXintercept
20031.1954060.000710
20041.3634630.004201
20051.7664150.003246
20061.6454960.000080
20071.1987610.003438
20080.968016-0.001110
20090.8791030.002954
20101.0526080.001261
20110.8066050.001514

10.4 数据透视表与交叉表

  • Python中的pandas透视表是通过本章所介绍的groupby工具以及使用分层索引的重塑操作实现的。
  • DataFrame拥有一个pivot_table方法,并且还有一个顶层的pandas.pivot_table函数。
  • 除了为groupby提供一个方便接口,pivot_table还可以添加部分总计,也称作边距。
  • pivot_table选项
选项名描述
values需要聚合的列名:默认情况下聚合所有数值型的列
index在结果透视表的行.上进行分组的列名或其他分组键
#假设你想要计算一张在行方向上按day和smoker排列的分组平均值(默认的pivot_table聚合类型)的表
tips.pivot_table(index=['day','smoker'])
sizetiptip_pcttotal_bill
daysmoker
FriNo2.2500002.8125000.17974018.420000
Yes2.0666672.7140000.21629316.813333
SatNo2.5555563.1028890.19041219.661778
Yes2.4761902.8754760.17983321.276667
SunNo2.9298253.1678950.19361720.506667
Yes2.5789473.5168420.32202124.120000
ThurNo2.4888892.6737780.19342417.113111
Yes2.3529413.0300000.19850819.190588
tips.pivot_table(['tip_pct','size'],index=['time','day'],columns='smoker')
sizetip_pct
smokerNoYesNoYes
timeday
DinnerFri2.0000002.2222220.1626120.202545
Sat2.5555562.4761900.1904120.179833
Sun2.9298252.5789470.1936170.322021
Thur2.000000NaN0.190114NaN
LunchFri3.0000001.8333330.2311250.236915
Thur2.5000002.3529410.1934990.198508
#通过传递margins=True来扩充这个表来包含部分总计。
#All的值是均值,且该均值是不考虑吸烟者与非吸烟者(All列)或行分组中任何两级的(All行)。
tips.pivot_table(['tip_pct','size'],index=['time','day'],columns='smoker',margins=True)
sizetip_pct
smokerNoYesAllNoYesAll
timeday
DinnerFri2.0000002.2222222.1666670.1626120.2025450.192562
Sat2.5555562.4761902.5172410.1904120.1798330.185305
Sun2.9298252.5789472.8421050.1936170.3220210.225718
Thur2.000000NaN2.0000000.190114NaN0.190114
LunchFri3.0000001.8333332.0000000.2311250.2369150.236088
Thur2.5000002.3529412.4590160.1934990.1985080.194895
All2.6688742.4086022.5696720.1922370.2181760.202123
tips.pivot_table('tip_pct',index=['time','smoker'],aggfunc=len,margins=True)
tip_pct
timesmoker
DinnerNo106.0
Yes70.0
LunchNo45.0
Yes23.0
All244.0
#如果某些情况下产生了空值(或者NA),你可能想要传递一个fill_value
tips.pivot_table('tip_pct',index=['time','size','smoker'],aggfunc='mean',fill_value=0)
tip_pct
timesizesmoker
Dinner1No0.160000
Yes0.483092
2No0.200221
Yes0.239672
3No0.183436
Yes0.179271
4No0.176033
Yes0.169406
5No0.263344
Yes0.094779
6No0.115821
Lunch1No0.222087
Yes0.288288
2No0.201503
Yes0.201043
3No0.138483
Yes0.257941
4No0.161573
Yes0.186592
5No0.138160
6No0.211191

10.4.1 交叉表:crosstab

  • 交叉表(简写为crosstab)是数据透视表的一个特殊情况,计算的是分组中的频率

10.5 本章小结

data = pd.DataFrame({'Sample':range(1,11),
                    'nationality':['USA','Japan','USA','Japan','Japan','Japan','USA','USA','Japan','USA'],
                    'handedness':['right','left','right','right','left','right','right','left','right','right']})
data
Samplenationalityhandedness
01USAright
12Japanleft
23USAright
34Japanright
45Japanleft
56Japanright
67USAright
78USAleft
89Japanright
910USAright
data.pivot_table(index = 'nationality',columns='handedness',margins=True,aggfunc='count')
Sample
handednessleftrightAll
nationality
Japan235
USA145
All3710
pd.crosstab(data.nationality,data.handedness,margins=True)
handednessleftrightAll
nationality
Japan235
USA145
All3710
#crosstab的前两个参数可是数组、Series或数组的列表
pd.crosstab([tips.time,tips.day],tips.smoker,margins=True)
smokerNoYesAll
timeday
DinnerFri3912
Sat454287
Sun571976
Thur101
LunchFri167
Thur441761
All15193244

第十一章时间序列

11.1 日期和时间数据的类型及工具

  • Python标准库包含了日期和时间数据的类型,也包括日历相关的功能。
  • datetime、time和calendar模块是开始处理时间数据的主要内容。
  • datetime.datetime类型,或简写为datetime,是广泛使用的
  • datetime模块中的类型
类型描述
date使用公历日历存储日历日期(年, 月,日)
time将时间存储为小时,分钟,秒和微秒
datetime存储日期和时间
timedelta表示两个datetime值之间的差(如日,秒和微秒)
tzinfo用于存储时区信息的基本类型
from datetime import datetime
#datetime既存储了日期,也存储了细化到微秒的时间。
now = datetime.now()
now
datetime.datetime(2021, 12, 28, 14, 25, 24, 8294)
now.year,now.month,now.day
(2021, 12, 28)
#timedelta表示两个datetime对象的时间差
delta = datetime(2011,1,7) - datetime(2008,6,24,8,15)
delta
datetime.timedelta(days=926, seconds=56700)
delta.days,delta.seconds
(926, 56700)
#你可以为一个datetime对象加上(或减去)一个timedelta或其整数倍来产生一个新的datetime对象
from datetime import timedelta
start = datetime(2011,1,7)
start+timedelta(12)
datetime.datetime(2011, 1, 19, 0, 0)
start-2*timedelta(12)
datetime.datetime(2010, 12, 14, 0, 0)

11.1.1 字符串与datetime互相转换

  • 可以使用str方法或传递一个指定的格式给strftime方法来对datetime对象和pandas的Timestamp对象进行格式化
  • datetime格式说明(兼容ISO C89)
类型描述
%Y四位的年份
%y两位的年份
%m两位的月份[01, 12]
%d两位的日期号[01,31]
%H小时,24小时制[00, 23]
%I小时,12小时制[01, 12]
%M两位的分钟[00, 59]
%S秒[00,61] (60、61 是闰秒)
%w星期日期[0 (星期天),6]
%U一年中的星期数[00, 53]。以星期天为每周的第一天,-年中第一个星期天前的日期作为"第0周"
%W一年中的星期數[00, 53]。以星期一为每周的第一-天,-年中第一个星期一前的日期作为"第0周"
%z格式为+HHMM或-HHMM的UTC时区偏移;如果没有时区则为空
%F%Y-%m-%d的简写(例如,2012-4-18)
%D%m/%d/%y的简写(例如,04/18/12)
stamp = datetime(2011,1,3)
stamp
datetime.datetime(2011, 1, 3, 0, 0)
str(stamp)
'2011-01-03 00:00:00'
stamp.strftime('%Y-%m-%d')
'2011-01-03'
#可以使用datetime.srtptime和这些格式代码,将字符串转换日期
value = '2011-01-03'
datetime.strptime(value,'%Y-%m-%d')
datetime.datetime(2011, 1, 3, 0, 0)
datestrs = ['7/6/2011','8/6/2011']
[datetime.strptime(x,'%m/%d/%Y') for x in datestrs]
[datetime.datetime(2011, 7, 6, 0, 0), datetime.datetime(2011, 8, 6, 0, 0)]
from dateutil.parser import parse
parse('2011-01-03')
datetime.datetime(2011, 1, 3, 0, 0)
#dateutil能够解析大部分人类可理解的日期表示
parse('jan 31,1997 10:45 am')
datetime.datetime(2021, 1, 31, 10, 45)
#在国际场合下,日期出现在月份之前很常见,因此你可以传递dayfirst=True来表明这种情况
parse('6/12/2011',dayfirst=True)
datetime.datetime(2011, 12, 6, 0, 0)
#to_datetime方法可以转换很多不同的日期表示格式
datestrs = ['2011-07-06 12:00:00','2011-08-06 00:00:00']
pd.to_datetime(datestrs)
DatetimeIndex(['2011-07-06 12:00:00', '2011-08-06 00:00:00'], dtype='datetime64[ns]', freq=None)
#to_datetime方法还可以处理那些被认为是缺失值的值(None、空字符串等)
idx = pd.to_datetime(datestrs+[None])
idx
DatetimeIndex(['2011-07-06 12:00:00', '2011-08-06 00:00:00', 'NaT'], dtype='datetime64[ns]', freq=None)
#NaT(Not a time)是pandas中时间戳数据的是null值。
idx[2]
NaT
pd.isnull(idx)
array([False, False,  True])
  • 特定地区日期格式化选项
类型描述
%a缩写的工作日名称
%A全写的工作日名称
%b简写的月份名称
%B全写的月份名称
%c完整的日期和时间(例如,‘ Tue 01 May 2012 04:20:57 PM’)
%pAM或PM的地区等效
%x适合地区的格式化日期(例如,在美国,May 1, 2012会生成’05 101/2012)
%X适合地区的时间(例如,’ 04:24:12 PM’ )

11.2 时间序列基础

  • pandas中的基础时间序列种类是由时间戳索引的Series,在pandas外部则通常表示为Python字符串或datetime对象
from datetime import datetime
dates = [datetime(2011,1,2),datetime(2011,1,5),datetime(2011,1,7),
        datetime(2011,1,8),datetime(2011,1,10),datetime(2011,1,12)]
dates
[datetime.datetime(2011, 1, 2, 0, 0),
 datetime.datetime(2011, 1, 5, 0, 0),
 datetime.datetime(2011, 1, 7, 0, 0),
 datetime.datetime(2011, 1, 8, 0, 0),
 datetime.datetime(2011, 1, 10, 0, 0),
 datetime.datetime(2011, 1, 12, 0, 0)]
ts = pd.Series(np.random.randn(6),index=dates)
ts
2011-01-02   -0.388904
2011-01-05    1.029932
2011-01-07   -0.186521
2011-01-08   -0.310343
2011-01-10    0.710121
2011-01-12   -0.053242
dtype: float64
#这些datetime对象可以被放入DatetimeIndex中
ts.index
DatetimeIndex(['2011-01-02', '2011-01-05', '2011-01-07', '2011-01-08',
               '2011-01-10', '2011-01-12'],
              dtype='datetime64[ns]', freq=None)
ts[::2]+ts
2011-01-02   -0.777807
2011-01-05         NaN
2011-01-07   -0.373041
2011-01-08         NaN
2011-01-10    1.420241
2011-01-12         NaN
dtype: float64
#pandas使用NumPy的datetime64数据类型在纳秒级的分辨率下存储时间戳
ts.index.dtype
dtype('<M8[ns]')
#DatetimeIndex中的标量值是pandas的Timestamp对象
stamp = ts.index[0]
stamp
#所有使用datetime对象的地方都可以用Timestamp。
#此外,Timestamp还可以存储频率信息(如果有的话)并了解如何进行时区转换和其他类型操作
Timestamp('2011-01-02 00:00:00')

11.2.1 索引、选择、子集

  • 当你基于标签进行索引和选择时,时间序列的行为和其他的pandas.Series类似
stamp = ts.index[2]
ts[stamp]
-0.18652074734702304
#为了方便,你还可以传递一个能解释为日期的字符串
ts['1/10/2011']
0.7101205694131453
ts['20110110']
0.7101205694131453
#对一个长的时间序列,可以传递一个年份或一个年份和月份来轻松地选择数据的切片
longer_ts = pd.Series(np.random.randn(1000),index=pd.date_range('1/1/2000',periods=1000))
longer_ts
2000-01-01    1.399510
2000-01-02    1.167288
2000-01-03   -1.557553
2000-01-04    0.905812
2000-01-05   -1.204531
                ...   
2002-09-22   -0.055722
2002-09-23    0.913953
2002-09-24    3.299072
2002-09-25   -1.384872
2002-09-26    0.034663
Freq: D, Length: 1000, dtype: float64
longer_ts['2001']
2001-01-01   -0.098773
2001-01-02    0.299508
2001-01-03   -0.111169
2001-01-04    0.545366
2001-01-05    0.559263
                ...   
2001-12-27    0.383230
2001-12-28    0.274157
2001-12-29   -1.571527
2001-12-30   -0.066029
2001-12-31    0.860201
Freq: D, Length: 365, dtype: float64
longer_ts['2001-05']
2001-05-01    0.952182
2001-05-02   -1.637248
2001-05-03   -1.673597
2001-05-04   -1.257016
2001-05-05   -0.992064
2001-05-06    0.874060
2001-05-07   -0.491860
2001-05-08    0.479921
2001-05-09    0.286766
2001-05-10    2.026310
2001-05-11   -0.547724
2001-05-12    1.313638
2001-05-13   -0.624659
2001-05-14    0.003374
2001-05-15   -0.097401
2001-05-16   -0.733808
2001-05-17   -0.557165
2001-05-18    1.592470
2001-05-19   -1.030133
2001-05-20    1.789641
2001-05-21    0.821066
2001-05-22    0.954955
2001-05-23    0.537921
2001-05-24   -0.016609
2001-05-25    1.806582
2001-05-26    0.323395
2001-05-27   -0.144737
2001-05-28    0.783111
2001-05-29    0.673603
2001-05-30   -1.775568
2001-05-31   -0.553773
Freq: D, dtype: float64
ts[datetime(2011,1,7):]
2011-01-07   -0.186521
2011-01-08   -0.310343
2011-01-10    0.710121
2011-01-12   -0.053242
dtype: float64
ts
2011-01-02   -0.388904
2011-01-05    1.029932
2011-01-07   -0.186521
2011-01-08   -0.310343
2011-01-10    0.710121
2011-01-12   -0.053242
dtype: float64
ts['1/6/2011':'1/11/2011']
2011-01-07   -0.186521
2011-01-08   -0.310343
2011-01-10    0.710121
dtype: float64
#请记住通过这种方式的切片产生了原时间序列的视图,类似于NumPy的数组。
#这意味着没有数据被复制,并且在切片上的修改会反映在原始数据上。
#有一个等价实例方法,truncate,它可以在两个日期间对Series进行切片
ts.truncate(after='1/9/2011')
2011-01-02   -0.388904
2011-01-05    1.029932
2011-01-07   -0.186521
2011-01-08   -0.310343
dtype: float64
dates = pd.date_range('1/1/2000',periods=100,freq='W-WED')
dates[:5]
DatetimeIndex(['2000-01-05', '2000-01-12', '2000-01-19', '2000-01-26',
               '2000-02-02'],
              dtype='datetime64[ns]', freq='W-WED')
long_df = pd.DataFrame(np.random.randn(100,4),index=dates,columns = ['colorado','texas','new york','ohio'])
long_df
coloradotexasnew yorkohio
2000-01-05-0.401749-0.717368-1.2818741.274557
2000-01-121.0380780.915171-0.080030-0.280146
2000-01-19-1.0807900.255669-0.987855-0.662729
2000-01-26-0.3605131.1337010.6787800.421880
2000-02-02-0.6360871.338383-0.2473951.089950
...............
2001-10-311.464741-0.3903812.867647-0.299401
2001-11-07-0.341461-0.358244-0.7665870.728394
2001-11-14-1.010205-0.055786-0.574341-0.652109
2001-11-21-0.1305170.9291900.102579-0.677333
2001-11-28-0.3491231.238194-0.058024-3.170347

100 rows × 4 columns

long_df.loc['5-2001']
coloradotexasnew yorkohio
2001-05-020.054055-0.4911460.2412490.310428
2001-05-09-1.556219-0.7161960.5142381.694993
2001-05-16-1.3826580.4593620.7328550.644502
2001-05-23-0.3428850.544961-1.1335611.351170
2001-05-30-2.0513970.5958070.5368940.584369

11.2.2 含有重复索引的时间序列

dates = pd.DatetimeIndex(['1/1/2000','1/2/2000','1/2/2000','1/2/2000','1/3/2000'])
dates
DatetimeIndex(['2000-01-01', '2000-01-02', '2000-01-02', '2000-01-02',
               '2000-01-03'],
              dtype='datetime64[ns]', freq=None)
dup_ts = pd.Series(np.arange(5),index=dates)
dup_ts
2000-01-01    0
2000-01-02    1
2000-01-02    2
2000-01-02    3
2000-01-03    4
dtype: int32
#通过检查索引的is_unique属性,我们可以看出索引并不是唯一的
dup_ts.index.is_unique
False
#对上面的Series进行索引,结果是标量值还是Series切片取决于是否有时间戳是重复的
dup_ts['1/3/2000']
4
dup_ts['1/2/2000']
2000-01-02    1
2000-01-02    2
2000-01-02    3
dtype: int32
#假设你想要聚合含有非唯一时间戳的数据。一种方式就是使用groupby并传递level=0
grouped = dup_ts.groupby(level=0)
grouped.mean()
2000-01-01    0
2000-01-02    2
2000-01-03    4
dtype: int32
dup_ts.groupby(level=-1).mean()
2000-01-01    0
2000-01-02    2
2000-01-03    4
dtype: int32
grouped.count()
2000-01-01    1
2000-01-02    3
2000-01-03    1
dtype: int64

11.3 日期范围、频率和移位

  • pandas拥有一整套标准的时间序列频率和工具用于重新采样、推断频率以及生成固定频率的数据范围。
ts
2011-01-02   -0.388904
2011-01-05    1.029932
2011-01-07   -0.186521
2011-01-08   -0.310343
2011-01-10    0.710121
2011-01-12   -0.053242
dtype: float64
#字符串’D’被解释为每日频率。
resampler = ts.resample('D')

11.3.1 生成日期范围

  • pandas.date_range是用于根据特定频率生成指定长度的DatetimeIndex
#默认情况下,date_range生成的是每日的时间戳。
index = pd.date_range('2012-04-01','2012-06-01')
index
DatetimeIndex(['2012-04-01', '2012-04-02', '2012-04-03', '2012-04-04',
               '2012-04-05', '2012-04-06', '2012-04-07', '2012-04-08',
               '2012-04-09', '2012-04-10', '2012-04-11', '2012-04-12',
               '2012-04-13', '2012-04-14', '2012-04-15', '2012-04-16',
               '2012-04-17', '2012-04-18', '2012-04-19', '2012-04-20',
               '2012-04-21', '2012-04-22', '2012-04-23', '2012-04-24',
               '2012-04-25', '2012-04-26', '2012-04-27', '2012-04-28',
               '2012-04-29', '2012-04-30', '2012-05-01', '2012-05-02',
               '2012-05-03', '2012-05-04', '2012-05-05', '2012-05-06',
               '2012-05-07', '2012-05-08', '2012-05-09', '2012-05-10',
               '2012-05-11', '2012-05-12', '2012-05-13', '2012-05-14',
               '2012-05-15', '2012-05-16', '2012-05-17', '2012-05-18',
               '2012-05-19', '2012-05-20', '2012-05-21', '2012-05-22',
               '2012-05-23', '2012-05-24', '2012-05-25', '2012-05-26',
               '2012-05-27', '2012-05-28', '2012-05-29', '2012-05-30',
               '2012-05-31', '2012-06-01'],
              dtype='datetime64[ns]', freq='D')
#如果你只传递一个起始或结尾日期,你必须传递一个用于生成范围的数字
pd.date_range(start='2012-04-01',periods=200)
DatetimeIndex(['2012-04-01', '2012-04-02', '2012-04-03', '2012-04-04',
               '2012-04-05', '2012-04-06', '2012-04-07', '2012-04-08',
               '2012-04-09', '2012-04-10',
               ...
               '2012-10-08', '2012-10-09', '2012-10-10', '2012-10-11',
               '2012-10-12', '2012-10-13', '2012-10-14', '2012-10-15',
               '2012-10-16', '2012-10-17'],
              dtype='datetime64[ns]', length=200, freq='D')
#可以传递’BM’频率(business end of month ,月度业务结尾;参考表11-4的频率列表)
pd.date_range('2000-01-01','2000-12-01',freq='BM')
DatetimeIndex(['2000-01-31', '2000-02-29', '2000-03-31', '2000-04-28',
               '2000-05-31', '2000-06-30', '2000-07-31', '2000-08-31',
               '2000-09-29', '2000-10-31', '2000-11-30'],
              dtype='datetime64[ns]', freq='BM')
  • 基础时间序列频率(不全)
别名偏置类型描述
DDay日历日的每天
BBusinessDay工作日的每天
HHour每小时
T或minMinute每分钟
SSecond每秒
L或msMilli每毫秒(1/1,000 秒)
UMicro每微秒(1/1,00,000秒)
MMonthEnd日历日的月底日期
BMBusines sMonthEnd工作日的月底日期
MSMonthBegin日历日的月初日期
BMSBusinessMonthBegin工作日的月初日期
W-MON, W-TUE, …Week按照给定的星期日期按每周取日期(MON, TUE, WED, THU,FRI, SAT或SUN)
WOM- 1MON , WOM- 2MON,…Wee kofMonth在本月的第一,二,三或四周创建按周分隔的日期(例如WOM-3FRI代表每月的第三个星期五)
Q-JAN,Q-FEB, …QuarterEnd每个月最后一个日历日的季度日期,以表示月份结束的年份(JAN, FEB,MAR, APR, MAY,JUN, JUL, AUG, SEP, OCT, NOV或DEC)
BQ-JAN,BQ-FEB,…BusinessQuarterEnd每个月最后一个工作日对应的季度日期,以表示月份结束的年份
QS-JAN, QS-FEB, …QuarterBegin每个月第一个日历日对应的季度日期,以表示月份结束的年份
BQS-JAN,BQS-FEB, …YearEnd给定月份所在的月的最后一个日历日所对应的年度日期
A-JAN,A-FEB, …BusinessYearEnd给定月份所在的月的最后一个.工作日所对应的年度日期(JAN,FEB, MAR, APR, MAY, JUN,JUL,AUG, SEP, OCT,NOV或DEC)
BA-JAN,BA-FEB, …YearBegin给定月份所在的月的第一个日历日所对应的年度日期
AS-JAN,AS-FEB, …BusinessYearBegin给定月份所在的月的第一个工作日所对应的年度日期
BAS-JAN, BAS-FEB, …usinessYearBegin给定月份所在月的第一个工作日所对应的年度日期
#默认情况下,date_range保留开始或结束时间戳的时间(如果有的话)
pd.date_range('2012-05-02 12:56:31',periods=5)
DatetimeIndex(['2012-05-02 12:56:31', '2012-05-03 12:56:31',
               '2012-05-04 12:56:31', '2012-05-05 12:56:31',
               '2012-05-06 12:56:31'],
              dtype='datetime64[ns]', freq='D')
pd.date_range('2012-05-02 12:56:31',periods=5,normalize=True)
DatetimeIndex(['2012-05-02', '2012-05-03', '2012-05-04', '2012-05-05',
               '2012-05-06'],
              dtype='datetime64[ns]', freq='D')

11.3.2 频率和日期偏置

  • pandas中的频率是由基础频率和倍数组成的。
  • 基础频率通常会有字符串别名,例如’M’代表每月,'H’代表每小时。对于每个基础频率,都有一个对象可以被用于定义日期偏置。
from pandas.tseries.offsets import Hour,Minute
hour = Hour()
hour
<Hour>
#你可以传递一个整数来定义偏置量的倍数
four_hours = Hour(4)
four_hours
<4 * Hours>
pd.date_range('2000-01-01','2000-01-03 23:59',freq='4h')
DatetimeIndex(['2000-01-01 00:00:00', '2000-01-01 04:00:00',
               '2000-01-01 08:00:00', '2000-01-01 12:00:00',
               '2000-01-01 16:00:00', '2000-01-01 20:00:00',
               '2000-01-02 00:00:00', '2000-01-02 04:00:00',
               '2000-01-02 08:00:00', '2000-01-02 12:00:00',
               '2000-01-02 16:00:00', '2000-01-02 20:00:00',
               '2000-01-03 00:00:00', '2000-01-03 04:00:00',
               '2000-01-03 08:00:00', '2000-01-03 12:00:00',
               '2000-01-03 16:00:00', '2000-01-03 20:00:00'],
              dtype='datetime64[ns]', freq='4H')
#多个偏置可以通过加法进行联合
Hour(2)+Minute(30)
<150 * Minutes>
pd.date_range('2000-01-01','2000-01-03 23:59',freq='1h30min')
DatetimeIndex(['2000-01-01 00:00:00', '2000-01-01 01:30:00',
               '2000-01-01 03:00:00', '2000-01-01 04:30:00',
               '2000-01-01 06:00:00', '2000-01-01 07:30:00',
               '2000-01-01 09:00:00', '2000-01-01 10:30:00',
               '2000-01-01 12:00:00', '2000-01-01 13:30:00',
               '2000-01-01 15:00:00', '2000-01-01 16:30:00',
               '2000-01-01 18:00:00', '2000-01-01 19:30:00',
               '2000-01-01 21:00:00', '2000-01-01 22:30:00',
               '2000-01-02 00:00:00', '2000-01-02 01:30:00',
               '2000-01-02 03:00:00', '2000-01-02 04:30:00',
               '2000-01-02 06:00:00', '2000-01-02 07:30:00',
               '2000-01-02 09:00:00', '2000-01-02 10:30:00',
               '2000-01-02 12:00:00', '2000-01-02 13:30:00',
               '2000-01-02 15:00:00', '2000-01-02 16:30:00',
               '2000-01-02 18:00:00', '2000-01-02 19:30:00',
               '2000-01-02 21:00:00', '2000-01-02 22:30:00',
               '2000-01-03 00:00:00', '2000-01-03 01:30:00',
               '2000-01-03 03:00:00', '2000-01-03 04:30:00',
               '2000-01-03 06:00:00', '2000-01-03 07:30:00',
               '2000-01-03 09:00:00', '2000-01-03 10:30:00',
               '2000-01-03 12:00:00', '2000-01-03 13:30:00',
               '2000-01-03 15:00:00', '2000-01-03 16:30:00',
               '2000-01-03 18:00:00', '2000-01-03 19:30:00',
               '2000-01-03 21:00:00', '2000-01-03 22:30:00'],
              dtype='datetime64[ns]', freq='90T')
#有些频率描述点的时间并不是均匀分隔的。
#例如,'M'(日历月末)和’BM'(月内最后工作日)取决于当月天数,以及像之后的例子中,取决于月末是否是周末。
#我们将这些日期称为锚定偏置量。
11.3.2.1 月中某星期的日期
#允许你可以获取每月第三个星期五这样的日期
rng = pd.date_range('2012-01-01','2012-09-01',freq='WOM-3FRI')
rng
DatetimeIndex(['2012-01-20', '2012-02-17', '2012-03-16', '2012-04-20',
               '2012-05-18', '2012-06-15', '2012-07-20', '2012-08-17'],
              dtype='datetime64[ns]', freq='WOM-3FRI')

11.3.3 移位(前向和后向)日期

  • "移位"是指将日期按时间向前移动或向后移动。
  • Series和DataFrame都有一个shift方法用于进行简单的前向或后向移位,而不改变索引
ts = pd.Series(np.random.randn(4),index=pd.date_range('1/1/2000',freq='M',periods=4))
ts
2000-01-31   -0.209127
2000-02-29    0.998652
2000-03-31    0.006782
2000-04-30    0.558995
Freq: M, dtype: float64
ts.shift(2)
2000-01-31         NaN
2000-02-29         NaN
2000-03-31   -0.209127
2000-04-30    0.998652
Freq: M, dtype: float64
ts.shift(-2)
2000-01-31    0.006782
2000-02-29    0.558995
2000-03-31         NaN
2000-04-30         NaN
Freq: M, dtype: float64
#shift常用于计算时间序列或DataFrame多列时间序列的百分比变化
ts/ts.shift(1)-1
2000-01-31          NaN
2000-02-29    -5.775325
2000-03-31    -0.993209
2000-04-30    81.428620
Freq: M, dtype: float64
#如果频率是已知的,则可以将频率传递给shift来推移时间戳而不是简单的数据
ts.shift(2,freq='M')
2000-03-31   -0.209127
2000-04-30    0.998652
2000-05-31    0.006782
2000-06-30    0.558995
Freq: M, dtype: float64
#其他的频率也可以传递,为你前移和后移数据提供灵活性
ts.shift(3,freq='D')
2000-02-03   -0.209127
2000-03-03    0.998652
2000-04-03    0.006782
2000-05-03    0.558995
dtype: float64
#这里的T代表分钟
ts.shift(1,freq='90T')
2000-01-31 01:30:00   -0.209127
2000-02-29 01:30:00    0.998652
2000-03-31 01:30:00    0.006782
2000-04-30 01:30:00    0.558995
dtype: float64
11.3.3.1 使用偏置进行移位日期
  • pandas日期偏置也可以使用datetime或Timestamp对象完成
from pandas.tseries.offsets import Day,MonthEnd
now = datetime(2011,11,17)
now+3*Day()
Timestamp('2011-11-20 00:00:00')
now+MonthEnd()
Timestamp('2011-11-30 00:00:00')
now+MonthEnd(2)
Timestamp('2011-12-31 00:00:00')
#锚定偏置可以使用rollforward和rollback分别显式地将日期向前或向后"滚动"
offset = MonthEnd()
offset.rollforward(now)
Timestamp('2011-11-30 00:00:00')
offset.rollback(now)
Timestamp('2011-10-31 00:00:00')
ts = pd.Series(np.random.randn(20),index=pd.date_range('1/15/2000',periods=20,freq='4d'))
ts
2000-01-15    0.165827
2000-01-19   -0.564001
2000-01-23   -1.297098
2000-01-27    0.461758
2000-01-31    0.104934
2000-02-04   -0.197272
2000-02-08   -0.238585
2000-02-12    1.822792
2000-02-16    1.053136
2000-02-20    0.114542
2000-02-24   -0.719873
2000-02-28    0.938727
2000-03-03    0.250604
2000-03-07    1.692985
2000-03-11    0.249253
2000-03-15   -2.015027
2000-03-19    1.274159
2000-03-23    0.593201
2000-03-27   -0.340764
2000-03-31   -0.806581
Freq: 4D, dtype: float64
ts.resample('M').mean()
2000-01-31   -0.225716
2000-02-29    0.396209
2000-03-31    0.112229
Freq: M, dtype: float64

11.4 时区处理

  • 在Python语言中,时区信息来源于第三方库pytz(可以使用pip或conda安装),其中公开了Olson数据库,这是世界时区信息的汇编。
import pytz
pytz.common_timezones[-5:]
['US/Eastern', 'US/Hawaii', 'US/Mountain', 'US/Pacific', 'UTC']
#要获得pytz的时区对象,可使用pytz.timezone
tz = pytz.timezone('America/New_York')
tz
<DstTzInfo 'America/New_York' LMT-1 day, 19:04:00 STD>

11.4.1 时区的本地化和转换

rng = pd.date_range('3/9/2012 9:30',periods=6,freq='D')
rng
DatetimeIndex(['2012-03-09 09:30:00', '2012-03-10 09:30:00',
               '2012-03-11 09:30:00', '2012-03-12 09:30:00',
               '2012-03-13 09:30:00', '2012-03-14 09:30:00'],
              dtype='datetime64[ns]', freq='D')
ts = pd.Series(np.random.randn(len(rng)),index=rng)
ts 
2012-03-09 09:30:00   -0.804325
2012-03-10 09:30:00    0.060626
2012-03-11 09:30:00   -1.531345
2012-03-12 09:30:00    0.133690
2012-03-13 09:30:00    0.173941
2012-03-14 09:30:00   -1.416870
Freq: D, dtype: float64
#索引的tz属性是None
print(ts.index.tz)
None
pd.date_range('3/9/2012 9:30',periods=10,freq='D',tz='UTC')
DatetimeIndex(['2012-03-09 09:30:00+00:00', '2012-03-10 09:30:00+00:00',
               '2012-03-11 09:30:00+00:00', '2012-03-12 09:30:00+00:00',
               '2012-03-13 09:30:00+00:00', '2012-03-14 09:30:00+00:00',
               '2012-03-15 09:30:00+00:00', '2012-03-16 09:30:00+00:00',
               '2012-03-17 09:30:00+00:00', '2012-03-18 09:30:00+00:00'],
              dtype='datetime64[ns, UTC]', freq='D')
#使用tz_localize方法可以从简单时区转换到本地化时区
ts_utc = ts.tz_localize('UTC')
ts_utc
2012-03-09 09:30:00+00:00   -0.804325
2012-03-10 09:30:00+00:00    0.060626
2012-03-11 09:30:00+00:00   -1.531345
2012-03-12 09:30:00+00:00    0.133690
2012-03-13 09:30:00+00:00    0.173941
2012-03-14 09:30:00+00:00   -1.416870
Freq: D, dtype: float64
ts_utc.index
DatetimeIndex(['2012-03-09 09:30:00+00:00', '2012-03-10 09:30:00+00:00',
               '2012-03-11 09:30:00+00:00', '2012-03-12 09:30:00+00:00',
               '2012-03-13 09:30:00+00:00', '2012-03-14 09:30:00+00:00'],
              dtype='datetime64[ns, UTC]', freq='D')
#一旦时间序列被本地化为某个特定的时区,则可以通过tz_convert将其转换为另一个时区
ts_utc.tz_convert('America/New_York')
2012-03-09 04:30:00-05:00   -0.804325
2012-03-10 04:30:00-05:00    0.060626
2012-03-11 05:30:00-04:00   -1.531345
2012-03-12 05:30:00-04:00    0.133690
2012-03-13 05:30:00-04:00    0.173941
2012-03-14 05:30:00-04:00   -1.416870
Freq: D, dtype: float64
ts_eastern = ts.tz_localize('America/New_York')
ts_eastern.tz_convert('UTC')
2012-03-09 14:30:00+00:00   -0.804325
2012-03-10 14:30:00+00:00    0.060626
2012-03-11 13:30:00+00:00   -1.531345
2012-03-12 13:30:00+00:00    0.133690
2012-03-13 13:30:00+00:00    0.173941
2012-03-14 13:30:00+00:00   -1.416870
dtype: float64
ts_eastern.tz_convert('Europe/Berlin')
2012-03-09 15:30:00+01:00   -0.804325
2012-03-10 15:30:00+01:00    0.060626
2012-03-11 14:30:00+01:00   -1.531345
2012-03-12 14:30:00+01:00    0.133690
2012-03-13 14:30:00+01:00    0.173941
2012-03-14 14:30:00+01:00   -1.416870
dtype: float64
#tz_localize和tz_convert也是DatetimeIndex的实例方法
ts.index.tz_localize('Asia/Shanghai')
DatetimeIndex(['2012-03-09 09:30:00+08:00', '2012-03-10 09:30:00+08:00',
               '2012-03-11 09:30:00+08:00', '2012-03-12 09:30:00+08:00',
               '2012-03-13 09:30:00+08:00', '2012-03-14 09:30:00+08:00'],
              dtype='datetime64[ns, Asia/Shanghai]', freq=None)

11.4.2 时区感知时间戳对象的操作

stamp = pd.Timestamp('2011-03-12 04:00')
stamp_utc = stamp.tz_localize('utc')
stamp_utc.tz_convert('America/New_York')
Timestamp('2011-03-11 23:00:00-0500', tz='America/New_York')
stamp_moscow = pd.Timestamp('2011-03-12 04:00',tz='Europe/Moscow')
stamp_moscow
Timestamp('2011-03-12 04:00:00+0300', tz='Europe/Moscow')
stamp_utc.value
1299902400000000000
stamp_utc.tz_convert('America/New_York').value
1299902400000000000
from pandas.tseries.offsets import Hour
stamp = pd.Timestamp('2012-03-12 01:30',tz='US/Eastern')
stamp
Timestamp('2012-03-12 01:30:00-0400', tz='US/Eastern')
stamp+Hour()
Timestamp('2012-03-12 02:30:00-0400', tz='US/Eastern')
stamp = pd.Timestamp('2012-11-04 00:30',tz='US/Eastern')
stamp
Timestamp('2012-11-04 00:30:00-0400', tz='US/Eastern')
stamp+2*Hour()
Timestamp('2012-11-04 01:30:00-0500', tz='US/Eastern')

11.4.3 不同时区间的操作

rng = pd.date_range('3/7/2012 9:30',periods=10,freq='B')
ts = pd.Series(np.random.randn(len(rng)),index=rng)
ts
2012-03-07 09:30:00    0.386445
2012-03-08 09:30:00   -0.300690
2012-03-09 09:30:00    0.759709
2012-03-12 09:30:00    1.257558
2012-03-13 09:30:00   -0.505203
2012-03-14 09:30:00   -0.002910
2012-03-15 09:30:00   -0.196304
2012-03-16 09:30:00    0.440398
2012-03-19 09:30:00    0.218168
2012-03-20 09:30:00    1.265953
Freq: B, dtype: float64
ts1 = ts[:7].tz_localize('Europe/London')
ts1
2012-03-07 09:30:00+00:00    0.386445
2012-03-08 09:30:00+00:00   -0.300690
2012-03-09 09:30:00+00:00    0.759709
2012-03-12 09:30:00+00:00    1.257558
2012-03-13 09:30:00+00:00   -0.505203
2012-03-14 09:30:00+00:00   -0.002910
2012-03-15 09:30:00+00:00   -0.196304
dtype: float64
ts2 = ts[:7].tz_localize('Europe/Moscow')
ts2
2012-03-07 09:30:00+04:00    0.386445
2012-03-08 09:30:00+04:00   -0.300690
2012-03-09 09:30:00+04:00    0.759709
2012-03-12 09:30:00+04:00    1.257558
2012-03-13 09:30:00+04:00   -0.505203
2012-03-14 09:30:00+04:00   -0.002910
2012-03-15 09:30:00+04:00   -0.196304
dtype: float64
result = ts1 + ts2
result.index
DatetimeIndex(['2012-03-07 05:30:00+00:00', '2012-03-07 09:30:00+00:00',
               '2012-03-08 05:30:00+00:00', '2012-03-08 09:30:00+00:00',
               '2012-03-09 05:30:00+00:00', '2012-03-09 09:30:00+00:00',
               '2012-03-12 05:30:00+00:00', '2012-03-12 09:30:00+00:00',
               '2012-03-13 05:30:00+00:00', '2012-03-13 09:30:00+00:00',
               '2012-03-14 05:30:00+00:00', '2012-03-14 09:30:00+00:00',
               '2012-03-15 05:30:00+00:00', '2012-03-15 09:30:00+00:00'],
              dtype='datetime64[ns, UTC]', freq=None)

11.5 时间区间和区间算术

p = pd.Period(2007,freq='A-DEC')
p
Period('2007', 'A-DEC')
p+5
Period('2012', 'A-DEC')
p-2
Period('2005', 'A-DEC')
#如果两个区间拥有相同的频率,则它们的差是它们之间的单位数
pd.Period(2014,freq='A-DEC')-p
<7 * YearEnds: month=12>
#使用period_range函数可以构造规则区间序列
rng = pd.period_range('2000-01-01','2000-06-30',freq='M')
rng
PeriodIndex(['2000-01', '2000-02', '2000-03', '2000-04', '2000-05', '2000-06'], dtype='period[M]', freq='M')
#PeriodIndex类存储的是区间的序列,可以作为任意pandas数据结构的轴索引
pd.Series(np.random.randn(6),index=rng)
2000-01   -1.474403
2000-02    0.146328
2000-03    1.247877
2000-04    0.797235
2000-05   -1.583312
2000-06    1.525540
Freq: M, dtype: float64
#如果你有一个字符串数组,你也可以使用PeriodIndex类
values = ['2001Q3','2002Q2','2003Q1']
index = pd.PeriodIndex(values,freq='Q-DEC')
index
PeriodIndex(['2001Q3', '2002Q2', '2003Q1'], dtype='period[Q-DEC]', freq='Q-DEC')

11.5.1 区间频率转换

  • 使用asfreq可以将区间和PeriodIndex对象转换为其他的频率。
p = pd.Period(2007,freq='A-DEC')
p
Period('2007', 'A-DEC')
p.asfreq('M',how='start')
Period('2007-01', 'M')
p.asfreq('M',how='end')
Period('2007-12', 'M')
p = pd.Period(2007,freq='A-JUN')
p
Period('2007', 'A-JUN')
p.asfreq('M',how='start')
Period('2006-07', 'M')
p.asfreq('M',how='end')
Period('2007-06', 'M')
p = pd.Period('Aug-2007','M')
p
Period('2007-08', 'M')
p.asfreq('A-JUN')
Period('2008', 'A-JUN')
rng = pd.period_range('2006','2009',freq='A-DEC')
rng
PeriodIndex(['2006', '2007', '2008', '2009'], dtype='period[A-DEC]', freq='A-DEC')
ts = pd.Series(np.random.randn(len(rng)),index=rng)
ts
2006   -0.784284
2007   -1.058489
2008   -0.151931
2009    0.443975
Freq: A-DEC, dtype: float64
ts.asfreq('M',how='start')
2006-01   -0.784284
2007-01   -1.058489
2008-01   -0.151931
2009-01    0.443975
Freq: M, dtype: float64
ts.asfreq('B',how='end')
2006-12-29   -0.784284
2007-12-31   -1.058489
2008-12-31   -0.151931
2009-12-31    0.443975
Freq: B, dtype: float64

11.5.2 季度区间频率

p = pd.Period(2007,freq='Q-JAN')
p
Period('2007Q4', 'Q-JAN')
p.asfreq('D','start')
Period('2006-11-01', 'D')
p.asfreq('D','end')
Period('2007-01-31', 'D')
#获取在季度倒数第二个工作日下午4点的时间戳
p4pm = (p.asfreq('B','e')-1).asfreq('T','S')+16*60
p4pm
Period('2007-01-30 16:00', 'T')
p4pm.to_timestamp()
Timestamp('2007-01-30 16:00:00')
rng = pd.period_range('2011Q3','2012Q4',freq='Q-JAN')
ts = pd.Series(np.arange(len(rng)),index=rng)
ts
2011Q3    0
2011Q4    1
2012Q1    2
2012Q2    3
2012Q3    4
2012Q4    5
Freq: Q-JAN, dtype: int32
new_rng = (rng.asfreq('B','e')-1).asfreq('T','S')+16*60
ts.index = new_rng.to_timestamp()
ts
2010-10-28 16:00:00    0
2011-01-28 16:00:00    1
2011-04-28 16:00:00    2
2011-07-28 16:00:00    3
2011-10-28 16:00:00    4
2012-01-30 16:00:00    5
dtype: int32

11.5.3 将时间戳转换为区间(以及逆转换)

rng = pd.date_range('2000-01-01',periods=3,freq='M')
ts = pd.Series(np.random.randn(3),index=rng)
ts
2000-01-31   -0.622352
2000-02-29   -1.830166
2000-03-31   -0.601708
Freq: M, dtype: float64
#通过时间戳索引的Series和DataFrame可以被to_period方法转换为区间
pts = ts.to_period()
pts
2000-01   -0.622352
2000-02   -1.830166
2000-03   -0.601708
Freq: M, dtype: float64
rng = pd.date_range('1/29/2000',periods=6,freq='D')
rng
DatetimeIndex(['2000-01-29', '2000-01-30', '2000-01-31', '2000-02-01',
               '2000-02-02', '2000-02-03'],
              dtype='datetime64[ns]', freq='D')
ts2 = pd.Series(np.random.randn(6),index=rng)
ts2
2000-01-29    0.056749
2000-01-30   -0.310291
2000-01-31    1.369577
2000-02-01   -0.361215
2000-02-02    0.426119
2000-02-03   -0.010565
Freq: D, dtype: float64
ts2.to_period('M')
2000-01    0.056749
2000-01   -0.310291
2000-01    1.369577
2000-02   -0.361215
2000-02    0.426119
2000-02   -0.010565
Freq: M, dtype: float64
#使用to_timestamp可以将区间再转换为时间戳
pts = ts2.to_period()
pts 
2000-01-29    0.056749
2000-01-30   -0.310291
2000-01-31    1.369577
2000-02-01   -0.361215
2000-02-02    0.426119
2000-02-03   -0.010565
Freq: D, dtype: float64
pts.to_timestamp(how='end')
2000-01-29 23:59:59.999999999    0.056749
2000-01-30 23:59:59.999999999   -0.310291
2000-01-31 23:59:59.999999999    1.369577
2000-02-01 23:59:59.999999999   -0.361215
2000-02-02 23:59:59.999999999    0.426119
2000-02-03 23:59:59.999999999   -0.010565
Freq: D, dtype: float64

11.5.4 从数组生成PeriodIndex

data = pd.read_csv('examples/macrodata.csv')
data.head(5)
yearquarterrealgdprealconsrealinvrealgovtrealdpicpim1tbilrateunemppopinflrealint
01959.01.02710.3491707.4286.898470.0451886.928.98139.72.825.8177.1460.000.00
11959.02.02778.8011733.7310.859481.3011919.729.15141.73.085.1177.8302.340.74
21959.03.02775.4881751.8289.226491.2601916.429.35140.53.825.3178.6572.741.09
31959.04.02785.2041753.7299.356484.0521931.329.37140.04.335.6179.3860.274.06
41960.01.02847.6991770.5331.722462.1991955.529.54139.63.505.2180.0072.311.19
data.year
0      1959.0
1      1959.0
2      1959.0
3      1959.0
4      1960.0
        ...  
198    2008.0
199    2008.0
200    2009.0
201    2009.0
202    2009.0
Name: year, Length: 203, dtype: float64
data.quarter
0      1.0
1      2.0
2      3.0
3      4.0
4      1.0
      ... 
198    3.0
199    4.0
200    1.0
201    2.0
202    3.0
Name: quarter, Length: 203, dtype: float64
#通过将这些数组和频率传递给PeriodIndex,你可以联合这些数组形成DataFrame的索引
index = pd.PeriodIndex(year=data.year,quarter=data.quarter,freq='Q-DEC')
index
PeriodIndex(['1959Q1', '1959Q2', '1959Q3', '1959Q4', '1960Q1', '1960Q2',
             '1960Q3', '1960Q4', '1961Q1', '1961Q2',
             ...
             '2007Q2', '2007Q3', '2007Q4', '2008Q1', '2008Q2', '2008Q3',
             '2008Q4', '2009Q1', '2009Q2', '2009Q3'],
            dtype='period[Q-DEC]', length=203, freq='Q-DEC')
data.index = index
data.head()
yearquarterrealgdprealconsrealinvrealgovtrealdpicpim1tbilrateunemppopinflrealint
1959Q11959.01.02710.3491707.4286.898470.0451886.928.98139.72.825.8177.1460.000.00
1959Q21959.02.02778.8011733.7310.859481.3011919.729.15141.73.085.1177.8302.340.74
1959Q31959.03.02775.4881751.8289.226491.2601916.429.35140.53.825.3178.6572.741.09
1959Q41959.04.02785.2041753.7299.356484.0521931.329.37140.04.335.6179.3860.274.06
1960Q11960.01.02847.6991770.5331.722462.1991955.529.54139.63.505.2180.0072.311.19
data.infl
1959Q1    0.00
1959Q2    2.34
1959Q3    2.74
1959Q4    0.27
1960Q1    2.31
          ... 
2008Q3   -3.16
2008Q4   -8.79
2009Q1    0.94
2009Q2    3.37
2009Q3    3.56
Freq: Q-DEC, Name: infl, Length: 203, dtype: float64

11.6 重新采样与频率转换

  • 重新采样是指将时间序列从一个频率转换为另一个频率的过程。
  • 将更高频率的数据聚合到低频率被称为向下采样,而从低频率转换到高频率称为向上采样。
  • 并不是所有的重新采样都属于上面说的两类
  • pandas对象都配有resample方法,该方法是所有频率转换的工具函数。
  • resample拥有类似于groupby的API;你调用resample对数据分组,之后再调用聚合函数
rng = pd.date_range('2000-01-01',periods=100,freq='D')
ts = pd.Series(np.random.randn(len(rng)),index=rng)
ts
2000-01-01    0.037518
2000-01-02   -0.067833
2000-01-03   -0.704365
2000-01-04   -0.368819
2000-01-05   -1.208229
                ...   
2000-04-05    1.085903
2000-04-06    0.746400
2000-04-07    0.517810
2000-04-08   -0.363853
2000-04-09    0.057838
Freq: D, Length: 100, dtype: float64
ts.resample('M').mean()
2000-01-31   -0.108144
2000-02-29   -0.055158
2000-03-31    0.358646
2000-04-30    0.773431
Freq: M, dtype: float64
ts.resample('M',kind='period').mean()
2000-01   -0.108144
2000-02   -0.055158
2000-03    0.358646
2000-04    0.773431
Freq: M, dtype: float64
  • resample方法参数
参数描述
freq表明所需采样频率的字符串或DateOffset对象(例如,‘M’. ‘5min’ 或Second(1))
axis需要采样的轴向;默认是axis=0
fill_method向上采样时的插值方式,‘ffill’ 或’bf11’;默认是不插值的
closed向下采样中,每段间隔的哪-段是封闭的(包含的),’ right’或’1eft’
label向下采样中,如何用’right’或’left’的箱标签标记聚合结果(例如,9:30到9:35的五分钟间隔可以被标记为9:30或9:35)
loffset对箱标签进行时间调校,例如’-1s’ /Second (-1) 可以将聚合标签向前移动一秒
limit在前向或后向填充时,填充区间的最大值
kind对区间(’ period’)或时间戳(‘timestamp’) 的聚合:默认为时间序列索引的类型
convention在对区间重新采样时,用于将低频周期转换为高频的约定(‘start’ 或’end’);默认为’end’

11.6.1 向下采样

  • 在使用resample进行向下采样数据时有些事情需要考虑:
    • 每段间隔的哪一边是闭合的
    • 如何在间隔的起始或结束位置标记每个已聚合的箱体
rng = pd.date_range('2000-01-01',periods=12,freq='T')
ts = pd.Series(np.random.randn(len(rng)),index=rng)
ts
2000-01-01 00:00:00    0.579555
2000-01-01 00:01:00    0.210504
2000-01-01 00:02:00    0.135227
2000-01-01 00:03:00   -0.125649
2000-01-01 00:04:00    1.511186
2000-01-01 00:05:00   -1.384490
2000-01-01 00:06:00   -0.938106
2000-01-01 00:07:00    2.057013
2000-01-01 00:08:00   -0.737751
2000-01-01 00:09:00    1.469319
2000-01-01 00:10:00   -0.257609
2000-01-01 00:11:00   -0.975761
Freq: T, dtype: float64
ts.resample('5min',closed='right',label='right').sum()
2000-01-01 00:00:00    0.579555
2000-01-01 00:05:00    0.346778
2000-01-01 00:10:00    1.592866
2000-01-01 00:15:00   -0.975761
Freq: 5T, dtype: float64
ts.resample('5min',closed='left').sum()
2000-01-01 00:00:00    2.310823
2000-01-01 00:05:00    0.465985
2000-01-01 00:10:00   -1.233370
Freq: 5T, dtype: float64
ts.resample('5min').sum()
2000-01-01 00:00:00    2.310823
2000-01-01 00:05:00    0.465985
2000-01-01 00:10:00   -1.233370
Freq: 5T, dtype: float64
#你可能需要将结果索引移动一定的数量,例如从右边缘减去一秒,以使其更清楚地表明时间戳所指的间隔。
#要实现这个功能,向loffset传递字符串或日期偏置
ts.resample('5min',closed='right',label='right',loffset='-1s').sum()
#你也可以通过在结果上调用shift方法来完成loffset的效果。
<ipython-input-407-0b956efa5584>:3: FutureWarning: 'loffset' in .resample() and in Grouper() is deprecated.

>>> df.resample(freq="3s", loffset="8H")

becomes:

>>> from pandas.tseries.frequencies import to_offset
>>> df = df.resample(freq="3s").mean()
>>> df.index = df.index.to_timestamp() + to_offset("8H")

  ts.resample('5min',closed='right',label='right',loffset='-1s').sum()





1999-12-31 23:59:59    0.579555
2000-01-01 00:04:59    0.346778
2000-01-01 00:09:59    1.592866
2000-01-01 00:14:59   -0.975761
Freq: 5T, dtype: float64
11.6.1.1 开端-峰值-谷值-结束(OHLC)重新采样
  • 在金融中,为每个数据桶计算四个值是一种流行的时间序列聚合方法:第一个值(开端)、最后一个值(结束)、最大值(峰值)和最小值(谷值)。通过使用ohlc聚合函数你将会获得包含四种聚合值列的DataFrame,这些值在数据的单次扫描中被高效计算
ts.resample('5min').ohlc()
openhighlowclose
2000-01-01 00:00:000.5795551.511186-0.1256491.511186
2000-01-01 00:05:00-1.3844902.057013-1.3844901.469319
2000-01-01 00:10:00-0.257609-0.257609-0.975761-0.975761

11.6.2 向上采样与插值

  • 当从低频率转换为高频率时,并不需要任何聚合
  • 由于区间涉及时间范围,向上采样和向下采样就更为严格:
    • 在向下采样中,目标频率必须是原频率的子区间。
    • 在向上采样中,目标频率必须是原频率的父区间。
frame = pd.DataFrame(np.random.randn(2,4),
                     index=pd.date_range('1/1/2000',periods=2,freq='W-WED'),
                     columns=['colorado','texas','new york','ohio'])
frame
coloradotexasnew yorkohio
2000-01-05-1.0506560.063228-1.1790701.070880
2000-01-120.1606831.629834-0.340546-0.767973
#使用asfreq方法在不聚合的情况下转换到高频率
df_daily = frame.resample('D').asfreq()
df_daily
coloradotexasnew yorkohio
2000-01-05-1.0506560.063228-1.1790701.070880
2000-01-06NaNNaNNaNNaN
2000-01-07NaNNaNNaNNaN
2000-01-08NaNNaNNaNNaN
2000-01-09NaNNaNNaNNaN
2000-01-10NaNNaNNaNNaN
2000-01-11NaNNaNNaNNaN
2000-01-120.1606831.629834-0.340546-0.767973
#fillna和reindex方法中可用的填充或插值方法可用于重采样
frame.resample('D').ffill()
coloradotexasnew yorkohio
2000-01-05-1.0506560.063228-1.1790701.070880
2000-01-06-1.0506560.063228-1.1790701.070880
2000-01-07-1.0506560.063228-1.1790701.070880
2000-01-08-1.0506560.063228-1.1790701.070880
2000-01-09-1.0506560.063228-1.1790701.070880
2000-01-10-1.0506560.063228-1.1790701.070880
2000-01-11-1.0506560.063228-1.1790701.070880
2000-01-120.1606831.629834-0.340546-0.767973
#你可以同样选择仅向前填充一定数量的区间,以限制继续使用观测值的时距
frame.resample('D').ffill(limit=2)
coloradotexasnew yorkohio
2000-01-05-1.0506560.063228-1.1790701.070880
2000-01-06-1.0506560.063228-1.1790701.070880
2000-01-07-1.0506560.063228-1.1790701.070880
2000-01-08NaNNaNNaNNaN
2000-01-09NaNNaNNaNNaN
2000-01-10NaNNaNNaNNaN
2000-01-11NaNNaNNaNNaN
2000-01-120.1606831.629834-0.340546-0.767973
#请注意,新的日期索引根本不需要与旧的索引重叠
frame.resample('W-THU').ffill()
coloradotexasnew yorkohio
2000-01-06-1.0506560.063228-1.1790701.070880
2000-01-130.1606831.629834-0.340546-0.767973

11.6.3 使用区间进行重新采样

frame = pd.DataFrame(np.random.randn(24,4),
                    index=pd.period_range('1-2000','12-2001',freq='M'),
                    columns=['colorado','texas','new york','ohio'])
frame.head()
coloradotexasnew yorkohio
2000-010.8279550.2120080.136026-1.278525
2000-020.7454991.9098230.819960-0.569550
2000-030.8930310.5124171.035980-0.923079
2000-04-0.2233430.536235-1.872214-0.210844
2000-05-2.059027-2.296580-1.082294-0.352297
annual_frame = frame.resample('A-DEC').mean()
annual_frame
coloradotexasnew yorkohio
20000.0629470.204868-0.1490460.072700
20010.4334170.014037-0.4835940.242176
annual_frame.resample('Q-DEC').ffill()
coloradotexasnew yorkohio
2000Q10.0629470.204868-0.1490460.072700
2000Q20.0629470.204868-0.1490460.072700
2000Q30.0629470.204868-0.1490460.072700
2000Q40.0629470.204868-0.1490460.072700
2001Q10.4334170.014037-0.4835940.242176
2001Q20.4334170.014037-0.4835940.242176
2001Q30.4334170.014037-0.4835940.242176
2001Q40.4334170.014037-0.4835940.242176
annual_frame.resample('Q-MAR').ffill()
coloradotexasnew yorkohio
2000Q40.0629470.204868-0.1490460.072700
2001Q10.0629470.204868-0.1490460.072700
2001Q20.0629470.204868-0.1490460.072700
2001Q30.0629470.204868-0.1490460.072700
2001Q40.4334170.014037-0.4835940.242176
2002Q10.4334170.014037-0.4835940.242176
2002Q20.4334170.014037-0.4835940.242176
2002Q30.4334170.014037-0.4835940.242176

11.7 移动窗口函数

close_px_all = pd.read_csv('examples/stock_px_2.csv',parse_dates=True,index_col=0)
close_px_all
AAPLMSFTXOMSPX
2003-01-027.4021.1129.22909.03
2003-01-037.4521.1429.24908.59
2003-01-067.4521.5229.96929.01
2003-01-077.4321.9328.95922.93
2003-01-087.2821.3128.83909.93
...............
2011-10-10388.8126.9476.281194.89
2011-10-11400.2927.0076.271195.54
2011-10-12402.1926.9677.161207.25
2011-10-13408.4327.1876.371203.66
2011-10-14422.0027.2778.111224.58

2214 rows × 4 columns

close_px = close_px_all[['AAPL','MSFT','XOM']]
close_px
AAPLMSFTXOM
2003-01-027.4021.1129.22
2003-01-037.4521.1429.24
2003-01-067.4521.5229.96
2003-01-077.4321.9328.95
2003-01-087.2821.3128.83
............
2011-10-10388.8126.9476.28
2011-10-11400.2927.0076.27
2011-10-12402.1926.9677.16
2011-10-13408.4327.1876.37
2011-10-14422.0027.2778.11

2214 rows × 3 columns

close_px = close_px.resample('B').ffill()
close_px
AAPLMSFTXOM
2003-01-027.4021.1129.22
2003-01-037.4521.1429.24
2003-01-067.4521.5229.96
2003-01-077.4321.9328.95
2003-01-087.2821.3128.83
............
2011-10-10388.8126.9476.28
2011-10-11400.2927.0076.27
2011-10-12402.1926.9677.16
2011-10-13408.4327.1876.37
2011-10-14422.0027.2778.11

2292 rows × 3 columns

close_px.AAPL.plot()
close_px.AAPL.rolling(250).mean().plot()
app1_std250 = close_px.AAPL.rolling(250,min_periods=10).std()
app1_std250[5:12]
2003-01-09         NaN
2003-01-10         NaN
2003-01-13         NaN
2003-01-14         NaN
2003-01-15    0.077496
2003-01-16    0.074760
2003-01-17    0.112368
Freq: B, Name: AAPL, dtype: float64
app1_std250.plot()
expanding_mean = app1_std250.expanding().mean()
close_px.rolling(60).mean().plot(logy=True)
close_px.rolling('20D').mean()
AAPLMSFTXOM
2003-01-027.40000021.11000029.220000
2003-01-037.42500021.12500029.230000
2003-01-067.43333321.25666729.473333
2003-01-077.43250021.42500029.342500
2003-01-087.40200021.40200029.240000
............
2011-10-10389.35142925.60214372.527857
2011-10-11388.50500025.67428672.835000
2011-10-12388.53142925.81000073.400714
2011-10-13388.82642925.96142973.905000
2011-10-14391.03800026.04866774.185333

2292 rows × 3 columns

11.7.1 指数加权函数

  • 指定一个常数衰减因子以向更多近期观测值提供更多权重,可以替代使用具有相等加权观察值的静态窗口尺寸的方法
appl_px = close_px.AAPL['2006':'2007']
ma60 = appl_px.rolling(30,min_periods=20).mean()
ewma60 = appl_px.ewm(span=30).mean()
import matplotlib.pyplot as plt
ma60.plot(style='k--',label='simple ma')
ewma60.plot(style='k-',label='ew ma')
plt.legend()
<matplotlib.legend.Legend at 0x259c45d36d0>


[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-N4fO6y4C-1640758272713)(output_335_1.png)]

11.7.2 二元移动窗口函数

spx_px = close_px_all['SPX']
spx_rets = close_px.pct_change()
returns = close_px.pct_change()
corr = returns.AAPL.rolling(125,min_periods=100).corr(spx_rets)
corr.plot()

11.7.3 用户自定义的移动窗口函数

from scipy.stats import percentileofscore
score_at_2percent = lambda x:percentileofscore(x,0.2)
result = returns.AAPL.rolling(250).apply(score_at_2percent)
result.plot()

第十二章 高阶pandas

12.1 分类数据

12.1.1 背景和目标

import numpy as np
import pandas as pd
values = pd.Series(['apple','orange','apple','apple']*2)
values
0     apple
1    orange
2     apple
3     apple
4     apple
5    orange
6     apple
7     apple
dtype: object
pd.unique(values)
array(['apple', 'orange'], dtype=object)
pd.value_counts(values)
apple     6
orange    2
dtype: int64
values = pd.Series([0,1,0,0]*2)
values
0    0
1    1
2    0
3    0
4    0
5    1
6    0
7    0
dtype: int64
dim = pd.Series(['apple', 'orange'])
dim
0     apple
1    orange
dtype: object
#可以使用take方法来恢复原来的字符串Series
dim.take(values)
0     apple
1    orange
0     apple
0     apple
0     apple
1    orange
0     apple
0     apple
dtype: object
#这种按照整数展现的方式被称为分类或字典编码展现

12.1.2 pandas中的Categorical类型

  • pandas拥有特殊的Categorical类型,用于承载基于整数的类别展示或编码的数据。
fruits = ['apple','orange','apple','apple']*2
fruits
['apple', 'orange', 'apple', 'apple', 'apple', 'orange', 'apple', 'apple']
N = len(fruits)
N
8
df = pd.DataFrame({'fruit':fruits,
                  'basker_id':np.arange(N),
                  'count':np.random.randint(3,15,size=N),
                  'weight':np.random.uniform(0,4,size=N)})
df
fruitbasker_idcountweight
0apple0143.004288
1orange191.432892
2apple2102.859906
3apple362.328174
4apple4123.558704
5orange5131.199922
6apple650.090920
7apple782.336066
#df['fruit']是一个Python字符串对象组成的数组。
#我们可以通过调用函数将它转换为Categorical对象
fruit_cat = df['fruit'].astype('category')
fruit_cat
#fruit_cat的值并不是NumPy数组,而是pandas.Categorical的实例
0     apple
1    orange
2     apple
3     apple
4     apple
5    orange
6     apple
7     apple
Name: fruit, dtype: category
Categories (2, object): ['apple', 'orange']
c = fruit_cat.values
c
['apple', 'orange', 'apple', 'apple', 'apple', 'orange', 'apple', 'apple']
Categories (2, object): ['apple', 'orange']
type(c)
pandas.core.arrays.categorical.Categorical
#Categorical对象拥有categories和codes属性
c.categories
Index(['apple', 'orange'], dtype='object')
c.codes
array([0, 1, 0, 0, 0, 1, 0, 0], dtype=int8)
#你可以通过分配已转换的结果将DataFrame的一列转换为Categorical对象
df['fruit'] = df['fruit'].astype('category')
df.fruit
0     apple
1    orange
2     apple
3     apple
4     apple
5    orange
6     apple
7     apple
Name: fruit, dtype: category
Categories (2, object): ['apple', 'orange']
#可以从其他Python序列类型直接生成pandas.Categorical
my_categories = pd.Categorical(['foo','bar','baz','foo','bar'])
my_categories
['foo', 'bar', 'baz', 'foo', 'bar']
Categories (3, object): ['bar', 'baz', 'foo']
#如果你已经从另一个数据源获得了分类编码数据,你可以使用from_codes构造函数
categories = ['foo','bar','baz']
codes = [0,1,2,0,0,1]
my_cats_2 = pd.Categorical.from_codes(codes,categories)
my_cats_2
['foo', 'bar', 'baz', 'foo', 'foo', 'bar']
Categories (3, object): ['foo', 'bar', 'baz']
#当使用from_codes或其他任意构造函数时,你可以为类别指定一个有意义的顺序
ordered_cat = pd.Categorical.from_codes(codes,categories,ordered=True)
ordered_cat
#输出的[foo<bar<baz]表明’foo’的顺序在’bar’之前,以此类推
['foo', 'bar', 'baz', 'foo', 'foo', 'bar']
Categories (3, object): ['foo' < 'bar' < 'baz']
#一个未排序的分类实例可以使用as_ordered进行排序
my_cats_2.as_ordered()
['foo', 'bar', 'baz', 'foo', 'foo', 'bar']
Categories (3, object): ['foo' < 'bar' < 'baz']

12.1.3 使用Categorical对象进行计算

np.random.seed(12345)
draws = np.random.randn(1000)
draws[:5]
array([-0.20470766,  0.47894334, -0.51943872, -0.5557303 ,  1.96578057])
#让我们计算上面数据的四分位分箱,并提取一些统计值
bins = pd.cut(draws,4)
bins
[(-1.23, 0.489], (-1.23, 0.489], (-1.23, 0.489], (-1.23, 0.489], (0.489, 2.208], ..., (-1.23, 0.489], (-1.23, 0.489], (-1.23, 0.489], (0.489, 2.208], (0.489, 2.208]]
Length: 1000
Categories (4, interval[float64]): [(-2.956, -1.23] < (-1.23, 0.489] < (0.489, 2.208] < (2.208, 3.928]]
bins = pd.qcut(draws,4,labels=['Q1','Q2','Q3','Q4'])
bins
['Q2', 'Q3', 'Q2', 'Q2', 'Q4', ..., 'Q3', 'Q2', 'Q1', 'Q3', 'Q4']
Length: 1000
Categories (4, object): ['Q1' < 'Q2' < 'Q3' < 'Q4']
bins = pd.Series(bins,name='quartile')
results = (pd.Series(draws)
          .groupby(bins)
          .agg(['count','min','max'])
          .reset_index())
results
quartilecountminmax
0Q1250-2.949343-0.685484
1Q2250-0.683066-0.010115
2Q3250-0.0100320.628894
3Q42500.6342383.927528
results['quartile']
0    Q1
1    Q2
2    Q3
3    Q4
Name: quartile, dtype: category
Categories (4, object): ['Q1' < 'Q2' < 'Q3' < 'Q4']
12.1.3.1 使用分类获得更高性能
N = 10000000
draws = pd.Series(np.random.randn(N))
labels = pd.Series(['foo','bar','baz','qux']*(N//4))
categories = labels.astype('category')
#labels比categories使用了明显更多的内存
labels.memory_usage()
80000128
categories.memory_usage()
10000332
#分类转换当然不是免费的,但是它是一次性开销       
%timeit labels.astype('category')
299 ms ± 1.05 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

12.1.4 分类方法

  • pandas中Series的分类方法
方法描述
add_ categories将新的类别(未使用过的)添加到已有类别的尾部
as_ordered对类别排序
as_unordered使类别无序
remove_categories去除类别,将被移除的值置为null
remove_unused_categories去除所有没有出现在数据中的类别
rename_categories使用新的类别名称替代现有的类别,不会改变类别的数量
reorder_categories与rename_categories 类似,但结果是经过排序的类别
set_categories用指定的一组新类别替换现有类别,可以添加或删除类别
s = pd.Series(['a','b','c','d']*2)
cat_s = s.astype('category')
cat_s
0    a
1    b
2    c
3    d
4    a
5    b
6    c
7    d
dtype: category
Categories (4, object): ['a', 'b', 'c', 'd']
#特殊属性cat提供了对分类方法的访问
cat_s.cat.codes
0    0
1    1
2    2
3    3
4    0
5    1
6    2
7    3
dtype: int8
cat_s.cat.categories
Index(['a', 'b', 'c', 'd'], dtype='object')
#可以使用set_categories方法来改变类别
actual_categories = ['a','b','c','d','e']
cat_s2 = cat_s.cat.set_categories(actual_categories)
cat_s2
0    a
1    b
2    c
3    d
4    a
5    b
6    c
7    d
dtype: category
Categories (5, object): ['a', 'b', 'c', 'd', 'e']
cat_s.value_counts()
a    2
b    2
c    2
d    2
dtype: int64
cat_s2.value_counts()
a    2
b    2
c    2
d    2
e    0
dtype: int64
#可以使用remove_unused_categories方法来去除未观察到的类别
cat_s3 = cat_s[cat_s2.isin(['a','b'])]
cat_s3
0    a
1    b
4    a
5    b
dtype: category
Categories (4, object): ['a', 'b', 'c', 'd']
cat_s3.cat.remove_unused_categories()
0    a
1    b
4    a
5    b
dtype: category
Categories (2, object): ['a', 'b']
12.1.4.1 创建用于建模的虚拟变量
  • 当你使用统计数据或机器学习工具时,通常会将分类数据转换为虚拟变量,也称为one-hot编码。
  • 这会产生一个DataFrame,每个不同的类别都是它的一列。这些列包含一个特定类别的出现次数,否则为0。
cat_s = pd.Series(['a','b','c','d']*2,dtype='category')
cat_s
0    a
1    b
2    c
3    d
4    a
5    b
6    c
7    d
dtype: category
Categories (4, object): ['a', 'b', 'c', 'd']
#pandas.get_dummies函数将一维的分类数据转换为一个包含虚拟变量的DataFrame
pd.get_dummies(cat_s)
abcd
01000
10100
20010
30001
41000
50100
60010
70001

12.2 高阶GroupBy应用

12.2.1 分组转换和“展开”GroupBy

  • 内建方法transform
    • transform可以产生一个标量值,并广播到各分组的尺寸数据中
    • transform可以产生一个与输入分组尺寸相同的对象
    • transform不可改变它的输入
df = pd.DataFrame({'key':['a','b','c']*4,
                  'value':np.arange(12.)})
df
keyvalue
0a0.0
1b1.0
2c2.0
3a3.0
4b4.0
5c5.0
6a6.0
7b7.0
8c8.0
9a9.0
10b10.0
11c11.0
g = df.groupby('key').value
g.mean()
key
a    4.5
b    5.5
c    6.5
Name: value, dtype: float64
g.transform(lambda x:x.mean())
0     4.5
1     5.5
2     6.5
3     4.5
4     5.5
5     6.5
6     4.5
7     5.5
8     6.5
9     4.5
10    5.5
11    6.5
Name: value, dtype: float64
#对于内建的聚合函数,我们可以像GroupBy的agg方法一样传递一个字符串别名
g.transform('mean')
0     4.5
1     5.5
2     6.5
3     4.5
4     5.5
5     6.5
6     4.5
7     5.5
8     6.5
9     4.5
10    5.5
11    6.5
Name: value, dtype: float64
g.transform(lambda x:x*2)
0      0.0
1      2.0
2      4.0
3      6.0
4      8.0
5     10.0
6     12.0
7     14.0
8     16.0
9     18.0
10    20.0
11    22.0
Name: value, dtype: float64
#可以按照每个组的降序计算排名
g.transform(lambda x:x.rank(ascending=False))
0     4.0
1     4.0
2     4.0
3     3.0
4     3.0
5     3.0
6     2.0
7     2.0
8     2.0
9     1.0
10    1.0
11    1.0
Name: value, dtype: float64
#考虑一个由简单聚合构成的分组转换函数
def normalize(x):
    return (x-x.mean())/x.std()
g.transform(normalize)
0    -1.161895
1    -1.161895
2    -1.161895
3    -0.387298
4    -0.387298
5    -0.387298
6     0.387298
7     0.387298
8     0.387298
9     1.161895
10    1.161895
11    1.161895
Name: value, dtype: float64
g.apply(normalize)
0    -1.161895
1    -1.161895
2    -1.161895
3    -0.387298
4    -0.387298
5    -0.387298
6     0.387298
7     0.387298
8     0.387298
9     1.161895
10    1.161895
11    1.161895
Name: value, dtype: float64
g.transform('mean')
0     4.5
1     5.5
2     6.5
3     4.5
4     5.5
5     6.5
6     4.5
7     5.5
8     6.5
9     4.5
10    5.5
11    6.5
Name: value, dtype: float64
normalized = (df['value']-g.transform('mean'))/g.transform('std')
normalized
0    -1.161895
1    -1.161895
2    -1.161895
3    -0.387298
4    -0.387298
5    -0.387298
6     0.387298
7     0.387298
8     0.387298
9     1.161895
10    1.161895
11    1.161895
Name: value, dtype: float64

12.2.2 分组的时间重新采样

  • 对于时间序列数据,resample方法在语义上是一种基于时间分段的分组操作
N = 15
times = pd.date_range('2017-05-20 00:00',freq='1min',periods=N)
times
DatetimeIndex(['2017-05-20 00:00:00', '2017-05-20 00:01:00',
               '2017-05-20 00:02:00', '2017-05-20 00:03:00',
               '2017-05-20 00:04:00', '2017-05-20 00:05:00',
               '2017-05-20 00:06:00', '2017-05-20 00:07:00',
               '2017-05-20 00:08:00', '2017-05-20 00:09:00',
               '2017-05-20 00:10:00', '2017-05-20 00:11:00',
               '2017-05-20 00:12:00', '2017-05-20 00:13:00',
               '2017-05-20 00:14:00'],
              dtype='datetime64[ns]', freq='T')
df = pd.DataFrame({'time':times,
                  'value':np.arange(N)})
df
timevalue
02017-05-20 00:00:000
12017-05-20 00:01:001
22017-05-20 00:02:002
32017-05-20 00:03:003
42017-05-20 00:04:004
52017-05-20 00:05:005
62017-05-20 00:06:006
72017-05-20 00:07:007
82017-05-20 00:08:008
92017-05-20 00:09:009
102017-05-20 00:10:0010
112017-05-20 00:11:0011
122017-05-20 00:12:0012
132017-05-20 00:13:0013
142017-05-20 00:14:0014
df.set_index('time').resample('5min').count()
value
time
2017-05-20 00:00:005
2017-05-20 00:05:005
2017-05-20 00:10:005
df2 = pd.DataFrame({'time':times.repeat(3),
                   'key':np.tile(['a','b','c'],N),
                   'value':np.arange(N*3.0)})
df2[:7]
timekeyvalue
02017-05-20 00:00:00a0.0
12017-05-20 00:00:00b1.0
22017-05-20 00:00:00c2.0
32017-05-20 00:01:00a3.0
42017-05-20 00:01:00b4.0
52017-05-20 00:01:00c5.0
62017-05-20 00:02:00a6.0
  • TimeGrouper调用不了
  • https://blog.csdn.net/weixin_51701683/article/details/110710765
  • from pandas.core import resample
from pandas.core import resample
#要为每个’key’的值进行相同的重新采样,我们可以使用pandas.TimeGrouper对象
#使用TimeGrouper的一个限制是时间必须是Series或DataFrame的索引。
time_key = resample.TimeGrouper('5min')
time_key
TimeGrouper(freq=<5 * Minutes>, axis=0, sort=True, closed='left', label='left', how='mean', convention='e', origin='start_day')
resampled = (df2.set_index('time')
             .groupby(['key',time_key])
             .sum())
resampled
value
keytime
a2017-05-20 00:00:0030.0
2017-05-20 00:05:00105.0
2017-05-20 00:10:00180.0
b2017-05-20 00:00:0035.0
2017-05-20 00:05:00110.0
2017-05-20 00:10:00185.0
c2017-05-20 00:00:0040.0
2017-05-20 00:05:00115.0
2017-05-20 00:10:00190.0
resampled.reset_index()
keytimevalue
0a2017-05-20 00:00:0030.0
1a2017-05-20 00:05:00105.0
2a2017-05-20 00:10:00180.0
3b2017-05-20 00:00:0035.0
4b2017-05-20 00:05:00110.0
5b2017-05-20 00:10:00185.0
6c2017-05-20 00:00:0040.0
7c2017-05-20 00:05:00115.0
8c2017-05-20 00:10:00190.0

12.3 方法链技术

  • 原位赋值可能比使用assign更为快速,但assign可以实现更方便的方法链

12.3.1 pipe方法

  • 表达式f(df)和df.pipe(f)是等价的,但是pipe使得链式调用更为方便

12.4 本章小结

利用Python进行数据分析.pdf》是一本介绍如何使用Python进行数据分析的书籍。Python是一种通用的编程语言,它具有简单易学、功能强大和丰富的生态系统等特点,因此被广泛应用于数据分析领域。 本书首先介绍了Python基础知识,包括Python的安装和配置、如何使用Python进行数据处理和分析等内容。然后,书中详细介绍了Python中一些常用的数据处理库和数据分析工具,如NumPy、Pandas和Matplotlib等。读者可以学习如何使用这些库进行数据的读取、清理、转换和可视化等操作。 此外,本书还介绍了一些统计分析和机器学习算法的应用,如线性回归、逻辑回归、决策树和聚类等。读者可以学习如何使用Python实现这些算法,并使用它们进行数据分析和预测。 对于想要从事数据分析工作的人来说,《利用Python进行数据分析.pdf》是一本不可或缺的参考书。通过学习本书,读者可以掌握使用Python进行数据处理和分析的基本技能,提高自己在数据分析领域的竞争力。此外,由于Python具有广泛的应用领域,通过学习一种通用的编程语言,读者还可以在其他领域使用Python进行开发和编程。 总之,《利用Python进行数据分析.pdf》是一本系统全面介绍如何使用Python进行数据分析的书籍,能够帮助读者快速入门并掌握数据分析的基本技能。它对于想要从事数据分析工作的人来说是一本宝贵的资料。
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值