第十章数据聚合与分组操作
使用一个或多个键(以函数、数组或DataFrame列名的形式)将pandas对象拆分为多块 计算组汇总统计信息,如计数、平均值或标准偏差或用户定义的函数 应用组内变换或其他操作,如标准化、线性回归、排位或子集选择 计算数据透视表和交叉表 执行分位数分析和其他统计组分析
10.1 GroupBy机制
import numpy as np
import pandas as pd
df = pd. DataFrame( { 'data1' : np. random. randn( 5 ) ,
'data2' : np. random. randn( 5 ) ,
'key1' : [ 'a' , 'a' , 'b' , 'b' , 'a' ] ,
'key2' : [ 'one' , 'two' , 'one' , 'two' , 'one' ] } )
df
data1 data2 key1 key2 0 0.405121 0.233504 a one 1 -0.419340 0.207110 a two 2 0.871916 -0.476933 b one 3 -0.311263 0.614344 b two 4 0.498925 0.502884 a one
grouped = df[ 'data1' ] . groupby( df[ 'key1' ] )
grouped. mean( )
key1
a 0.161569
b 0.280326
Name: data1, dtype: float64
means = df[ 'data1' ] . groupby( [ df[ 'key1' ] , df[ 'key2' ] ] ) . mean( )
means
key1 key2
a one 0.452023
two -0.419340
b one 0.871916
two -0.311263
Name: data1, dtype: float64
means. unstack( )
key2 one two key1 a 0.452023 -0.419340 b 0.871916 -0.311263
states = np. array( [ 'ohio' , 'california' , 'california' , 'ohio' , 'ohio' ] )
years = np. array( [ 2005 , 2005 , 2006 , 2005 , 2006 ] )
df[ 'data1' ] . groupby( [ states, years] ) . mean( )
california 2005 -0.419340
2006 0.871916
ohio 2005 0.046929
2006 0.498925
Name: data1, dtype: float64
df. groupby( 'key1' ) . mean( )
data1 data2 key1 a 0.161569 0.314499 b 0.280326 0.068706
df. groupby( [ 'key1' , 'key2' ] ) . mean( )
data1 data2 key1 key2 a one 0.452023 0.368194 two -0.419340 0.207110 b one 0.871916 -0.476933 two -0.311263 0.614344
df. groupby( [ 'key1' , 'key2' ] ) . size( )
key1 key2
a one 2
two 1
b one 1
two 1
dtype: int64
10.1.1 遍历各分组
GroupBy对象支持迭代,会生成一个包含组名和数据块的2维元组序列。
for name, group in df. groupby( 'key1' ) :
print ( name)
print ( group)
a
data1 data2 key1 key2
0 0.405121 0.233504 a one
1 -0.419340 0.207110 a two
4 0.498925 0.502884 a one
b
data1 data2 key1 key2
2 0.871916 -0.476933 b one
3 -0.311263 0.614344 b two
df
data1 data2 key1 key2 0 0.405121 0.233504 a one 1 -0.419340 0.207110 a two 2 0.871916 -0.476933 b one 3 -0.311263 0.614344 b two 4 0.498925 0.502884 a one
for ( k1, k2) , group in df. groupby( [ 'key1' , 'key2' ] ) :
print ( ( k1, k2) )
print ( group)
('a', 'one')
data1 data2 key1 key2
0 0.405121 0.233504 a one
4 0.498925 0.502884 a one
('a', 'two')
data1 data2 key1 key2
1 -0.41934 0.20711 a two
('b', 'one')
data1 data2 key1 key2
2 0.871916 -0.476933 b one
('b', 'two')
data1 data2 key1 key2
3 -0.311263 0.614344 b two
pieces = dict ( list ( df. groupby( 'key1' ) ) )
pieces[ 'b' ]
data1 data2 key1 key2 2 0.871916 -0.476933 b one 3 -0.311263 0.614344 b two
df. dtypes
data1 float64
data2 float64
key1 object
key2 object
dtype: object
for dtype, group in grouped:
print ( dtype)
print ( group)
a
0 0.405121
1 -0.419340
4 0.498925
Name: data1, dtype: float64
b
2 0.871916
3 -0.311263
Name: data1, dtype: float64
10.1.2 选择一列或所有列的子集
将从DataFrame创建的GroupBy对象用列名称或列名称数组进行索引时,会产生用于聚合的列子集的效果。 “语法糖(Syntactic sugar),也译为糖衣语法,是由英国计算机科学家彼得·约翰·兰达(Peter J. Landin)发明的一个术语,指计算机语言中添加的某种语法,这种语法对语言的功能并没有影响,但是更方便程序员使用。通常来说使用语法糖能够增加程序的可读性,从而减少程序代码出错的机会。”
df. groupby( 'key1' ) [ 'data1' ]
<pandas.core.groupby.generic.SeriesGroupBy object at 0x00000259B9B6D760>
df. groupby( 'key1' ) [ [ 'data2' ] ]
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x00000259B9B70370>
df[ 'data1' ] . groupby( df[ 'key1' ] )
<pandas.core.groupby.generic.SeriesGroupBy object at 0x00000259B9B70610>
df[ [ 'data2' ] ] . groupby( df[ 'key1' ] )
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x00000259B9B70FA0>
df. groupby( [ 'key1' , 'key2' ] ) [ [ 'data2' ] ] . mean( )
data2 key1 key2 a one 0.368194 two 0.207110 b one -0.476933 two 0.614344
s_grouped = df. groupby( [ 'key1' , 'key2' ] ) [ 'data2' ]
s_grouped. mean( )
key1 key2
a one 0.368194
two 0.207110
b one -0.476933
two 0.614344
Name: data2, dtype: float64
10.1.3 使用字典和Series分组
people = pd. DataFrame( np. random. randn( 5 , 5 ) ,
columns = [ 'a' , 'b' , 'c' , 'd' , 'e' ] ,
index = [ 'joe' , 'steve' , 'wes' , 'jim' , 'travis' ] )
people
a b c d e joe 1.176984 -0.779043 1.232603 0.811508 -0.114978 steve -0.162964 0.936905 -0.671911 -0.660697 0.297713 wes -1.154320 -0.351915 -2.286200 -0.200603 0.990375 jim -0.496740 0.474151 0.058887 2.775079 -0.615440 travis -0.737061 1.470497 -0.173087 0.766692 -0.234068
people. iloc[ 2 : 3 , [ 1 , 4 ] ] = np. nan
people
a b c d e joe 1.176984 -0.779043 1.232603 0.811508 -0.114978 steve -0.162964 0.936905 -0.671911 -0.660697 0.297713 wes -1.154320 NaN -2.286200 -0.200603 NaN jim -0.496740 0.474151 0.058887 2.775079 -0.615440 travis -0.737061 1.470497 -0.173087 0.766692 -0.234068
mapping = { 'a' : 'red' , 'b' : 'red' , 'c' : 'blue' , 'd' : 'blue' , 'e' : 'red' , 'f' : 'orange' }
by_column = people. groupby( mapping, axis= 1 )
by_column. sum ( )
blue red joe 2.044111 0.282963 steve -1.332608 1.071654 wes -2.486803 -1.154320 jim 2.833967 -0.638030 travis 0.593605 0.499368
map_series = pd. Series( mapping)
map_series
a red
b red
c blue
d blue
e red
f orange
dtype: object
people. groupby( map_series, axis= 1 ) . count( )
blue red joe 2 3 steve 2 3 wes 2 1 jim 2 3 travis 2 3
10.1.4 使用函数分组
people
a b c d e joe 1.176984 -0.779043 1.232603 0.811508 -0.114978 steve -0.162964 0.936905 -0.671911 -0.660697 0.297713 wes -1.154320 NaN -2.286200 -0.200603 NaN jim -0.496740 0.474151 0.058887 2.775079 -0.615440 travis -0.737061 1.470497 -0.173087 0.766692 -0.234068
people. groupby( len ) . sum ( )
a b c d e 3 -0.474076 -0.304892 -0.994710 3.385984 -0.730418 5 -0.162964 0.936905 -0.671911 -0.660697 0.297713 6 -0.737061 1.470497 -0.173087 0.766692 -0.234068
key_list = [ 'one' , 'one' , 'one' , 'two' , 'two' ]
people. groupby( [ len , key_list] ) . min ( )
a b c d e 3 one -1.154320 -0.779043 -2.286200 -0.200603 -0.114978 two -0.496740 0.474151 0.058887 2.775079 -0.615440 5 one -0.162964 0.936905 -0.671911 -0.660697 0.297713 6 two -0.737061 1.470497 -0.173087 0.766692 -0.234068
10.1.5 根据索引层级分组
hire_df = pd. DataFrame( np. random. randn( 4 , 5 ) ,
columns = pd. MultiIndex. from_arrays( [ [ 'US' , 'US' , 'US' , 'JP' , 'JP' ]
, [ 1 , 3 , 5 , 1 , 3 ] ] , names = [ 'cty' , 'tenor' ] ) )
hire_df
cty US JP tenor 1 3 5 1 3 0 0.840245 2.187988 -0.393045 -0.150897 1.573875 1 1.329124 0.070597 -0.246468 -0.147710 -0.397490 2 0.362198 0.404290 -0.233373 -0.406261 0.812492 3 1.104010 -0.485481 0.439878 -1.948048 0.784266
hire_df. groupby( level= 'cty' , axis= 1 ) . count( )
10.2 数据聚合
函数名 描述 count 分组中的非NA值数量 sum 非NA值的累和 mean 非NA值的均值 median 非NA值的算术中位数 std.var 无偏的(n-1 分母)标准差和方差 min,max 非NA值的最小值.最大值 prod 非NA值的乘积 first、last 非NA值的第-个和最后-个值
df
data1 data2 key1 key2 0 0.405121 0.233504 a one 1 -0.419340 0.207110 a two 2 0.871916 -0.476933 b one 3 -0.311263 0.614344 b two 4 0.498925 0.502884 a one
grouped = df. groupby( 'key1' )
grouped[ 'data1' ] . quantile( 0.9 )
key1
a 0.480164
b 0.753598
Name: data1, dtype: float64
def peak_to_peak ( arr) :
return arr. max ( ) - arr. min ( )
grouped. agg( peak_to_peak)
data1 data2 key1 a 0.918265 0.295774 b 1.183179 1.091277
grouped. describe( )
data1 data2 count mean std min 25% 50% 75% max count mean std min 25% 50% 75% max key1 a 3.0 0.161569 0.505263 -0.419340 -0.007109 0.405121 0.452023 0.498925 3.0 0.314499 0.163679 0.207110 0.220307 0.233504 0.368194 0.502884 b 2.0 0.280326 0.836634 -0.311263 -0.015468 0.280326 0.576121 0.871916 2.0 0.068706 0.771649 -0.476933 -0.204113 0.068706 0.341525 0.614344
10.2.1 逐列及多函数应用
tips = pd. read_csv( 'examples/tips.csv' )
tips. head( )
total_bill tip smoker day time size 0 16.99 1.01 No Sun Dinner 2 1 10.34 1.66 No Sun Dinner 3 2 21.01 3.50 No Sun Dinner 3 3 23.68 3.31 No Sun Dinner 2 4 24.59 3.61 No Sun Dinner 4
tips[ 'tip_pct' ] = tips[ 'tip' ] / tips[ 'total_bill' ]
tips[ : 8 ]
total_bill tip smoker day time size tip_pct 0 16.99 1.01 No Sun Dinner 2 0.059447 1 10.34 1.66 No Sun Dinner 3 0.160542 2 21.01 3.50 No Sun Dinner 3 0.166587 3 23.68 3.31 No Sun Dinner 2 0.139780 4 24.59 3.61 No Sun Dinner 4 0.146808 5 25.29 4.71 No Sun Dinner 4 0.186240 6 8.77 2.00 No Sun Dinner 2 0.228050 7 26.88 3.12 No Sun Dinner 4 0.116071
grouped = tips. groupby( [ 'day' , 'smoker' ] )
grouped
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x00000259B96733A0>
grouped_pct = grouped[ 'tip_pct' ]
grouped_pct. agg( 'mean' )
day smoker
Fri No 0.151650
Yes 0.174783
Sat No 0.158048
Yes 0.147906
Sun No 0.160113
Yes 0.187250
Thur No 0.160298
Yes 0.163863
Name: tip_pct, dtype: float64
grouped_pct. agg( [ 'mean' , 'std' , peak_to_peak] )
mean std peak_to_peak day smoker Fri No 0.151650 0.028123 0.067349 Yes 0.174783 0.051293 0.159925 Sat No 0.158048 0.039767 0.235193 Yes 0.147906 0.061375 0.290095 Sun No 0.160113 0.042347 0.193226 Yes 0.187250 0.154134 0.644685 Thur No 0.160298 0.038774 0.193350 Yes 0.163863 0.039389 0.151240
grouped_pct. agg( [ ( 'foo' , 'mean' ) , ( 'bar' , np. std) ] )
foo bar day smoker Fri No 0.151650 0.028123 Yes 0.174783 0.051293 Sat No 0.158048 0.039767 Yes 0.147906 0.061375 Sun No 0.160113 0.042347 Yes 0.187250 0.154134 Thur No 0.160298 0.038774 Yes 0.163863 0.039389
functions = [ 'count' , 'mean' , 'max' ]
result = grouped[ 'tip_pct' , 'total_bill' ] . agg( functions)
result
<ipython-input-74-05231bb63e58>:2: FutureWarning: Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.
result = grouped['tip_pct','total_bill'].agg(functions)
tip_pct total_bill count mean max count mean max day smoker Fri No 4 0.151650 0.187735 4 18.420000 22.75 Yes 15 0.174783 0.263480 15 16.813333 40.17 Sat No 45 0.158048 0.291990 45 19.661778 48.33 Yes 42 0.147906 0.325733 42 21.276667 50.81 Sun No 57 0.160113 0.252672 57 20.506667 48.17 Yes 19 0.187250 0.710345 19 24.120000 45.35 Thur No 45 0.160298 0.266312 45 17.113111 41.19 Yes 17 0.163863 0.241255 17 19.190588 43.11
result[ 'tip_pct' ]
count mean max day smoker Fri No 4 0.151650 0.187735 Yes 15 0.174783 0.263480 Sat No 45 0.158048 0.291990 Yes 42 0.147906 0.325733 Sun No 57 0.160113 0.252672 Yes 19 0.187250 0.710345 Thur No 45 0.160298 0.266312 Yes 17 0.163863 0.241255
ftuples = [ ( 'Durch' , 'mean' ) , ( 'abwe' , np. var) ]
grouped[ 'tip_pct' , 'total_bill' ] . agg( ftuples)
<ipython-input-76-ce86ef210451>:2: FutureWarning: Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.
grouped['tip_pct','total_bill'].agg(ftuples)
tip_pct total_bill Durch abwe Durch abwe day smoker Fri No 0.151650 0.000791 18.420000 25.596333 Yes 0.174783 0.002631 16.813333 82.562438 Sat No 0.158048 0.001581 19.661778 79.908965 Yes 0.147906 0.003767 21.276667 101.387535 Sun No 0.160113 0.001793 20.506667 66.099980 Yes 0.187250 0.023757 24.120000 109.046044 Thur No 0.160298 0.001503 17.113111 59.625081 Yes 0.163863 0.001551 19.190588 69.808518
grouped. agg( { 'tip' : np. max , 'size' : 'sum' } )
tip size day smoker Fri No 3.50 9 Yes 4.73 31 Sat No 9.00 115 Yes 10.00 104 Sun No 6.00 167 Yes 6.50 49 Thur No 6.70 112 Yes 5.00 40
10.2.2 返回不含行索引的聚合数据
通过在结果上调用reset_index也可以获得同样的结果。使用as_index=False可以避免一些不必要的计算。
tips. groupby( [ 'day' , 'smoker' ] , as_index= False ) . mean( )
day smoker total_bill tip size tip_pct 0 Fri No 18.420000 2.812500 2.250000 0.151650 1 Fri Yes 16.813333 2.714000 2.066667 0.174783 2 Sat No 19.661778 3.102889 2.555556 0.158048 3 Sat Yes 21.276667 2.875476 2.476190 0.147906 4 Sun No 20.506667 3.167895 2.929825 0.160113 5 Sun Yes 24.120000 3.516842 2.578947 0.187250 6 Thur No 17.113111 2.673778 2.488889 0.160298 7 Thur Yes 19.190588 3.030000 2.352941 0.163863
10.3 应用:通用拆分-应用-联合
import pandas as pd
tips = pd. read_csv( 'examples/tips.csv' )
tips. head( )
total_bill tip smoker day time size 0 16.99 1.01 No Sun Dinner 2 1 10.34 1.66 No Sun Dinner 3 2 21.01 3.50 No Sun Dinner 3 3 23.68 3.31 No Sun Dinner 2 4 24.59 3.61 No Sun Dinner 4
tips[ 'tip_pct' ] = tips[ 'tip' ] / ( tips[ 'total_bill' ] - tips[ 'tip' ] )
def top ( df, n= 5 , column= 'tip_pct' ) :
return df. sort_values( by= column) [ - n: ]
top( tips, n= 6 )
total_bill tip smoker day time size tip_pct 109 14.31 4.00 Yes Sat Dinner 2 0.387973 183 23.17 6.50 Yes Sun Dinner 4 0.389922 232 11.61 3.39 No Sat Dinner 2 0.412409 67 3.07 1.00 Yes Sat Dinner 1 0.483092 178 9.60 4.00 Yes Sun Dinner 2 0.714286 172 7.25 5.15 Yes Sun Dinner 2 2.452381
tips. groupby( 'smoker' ) . apply ( top)
total_bill tip smoker day time size tip_pct smoker No 88 24.71 5.85 No Thur Lunch 2 0.310180 185 20.69 5.00 No Sun Dinner 5 0.318674 51 10.29 2.60 No Sun Dinner 2 0.338101 149 7.51 2.00 No Thur Lunch 2 0.362976 232 11.61 3.39 No Sat Dinner 2 0.412409 Yes 109 14.31 4.00 Yes Sat Dinner 2 0.387973 183 23.17 6.50 Yes Sun Dinner 4 0.389922 67 3.07 1.00 Yes Sat Dinner 1 0.483092 178 9.60 4.00 Yes Sun Dinner 2 0.714286 172 7.25 5.15 Yes Sun Dinner 2 2.452381
tips. groupby( [ 'smoker' , 'day' ] ) . apply ( top, n= 1 , column= 'total_bill' )
total_bill tip smoker day time size tip_pct smoker day No Fri 94 22.75 3.25 No Fri Dinner 2 0.166667 Sat 212 48.33 9.00 No Sat Dinner 4 0.228833 Sun 156 48.17 5.00 No Sun Dinner 6 0.115821 Thur 142 41.19 5.00 No Thur Lunch 5 0.138160 Yes Fri 95 40.17 4.73 Yes Fri Dinner 4 0.133465 Sat 170 50.81 10.00 Yes Sat Dinner 3 0.245038 Sun 182 45.35 3.50 Yes Sun Dinner 3 0.083632 Thur 197 43.11 5.00 Yes Thur Lunch 4 0.131199
result = tips. groupby( 'smoker' ) [ 'tip_pct' ] . describe( )
result
count mean std min 25% 50% 75% max smoker No 151.0 0.192237 0.057665 0.060217 0.158622 0.184308 0.227015 0.412409 Yes 93.0 0.218176 0.254295 0.036955 0.119534 0.181818 0.242326 2.452381
result. unstack( 'smoker' )
smoker
count No 151.000000
Yes 93.000000
mean No 0.192237
Yes 0.218176
std No 0.057665
Yes 0.254295
min No 0.060217
Yes 0.036955
25% No 0.158622
Yes 0.119534
50% No 0.184308
Yes 0.181818
75% No 0.227015
Yes 0.242326
max No 0.412409
Yes 2.452381
dtype: float64
f = lambda x : x. describe( )
grouped. apply ( f)
total_bill tip size tip_pct day smoker Fri No count 4.000000 4.000000 4.00 4.000000 mean 18.420000 2.812500 2.25 0.151650 std 5.059282 0.898494 0.50 0.028123 min 12.460000 1.500000 2.00 0.120385 25% 15.100000 2.625000 2.00 0.137239 ... ... ... ... ... ... ... Thur Yes min 10.340000 2.000000 2.00 0.090014 25% 13.510000 2.000000 2.00 0.148038 50% 16.470000 2.560000 2.00 0.153846 75% 19.810000 4.000000 2.00 0.194837 max 43.110000 5.000000 4.00 0.241255
64 rows × 4 columns
10.3.1 压缩分组键
你可以看到所得到的对象具有分组键所形成的分层索引以及每个原始对象的索引。 你可以通过向groupby传递group_keys=False来禁用这个功能
tips. groupby( 'smoker' ) . apply ( top)
total_bill tip smoker day time size tip_pct smoker No 88 24.71 5.85 No Thur Lunch 2 0.310180 185 20.69 5.00 No Sun Dinner 5 0.318674 51 10.29 2.60 No Sun Dinner 2 0.338101 149 7.51 2.00 No Thur Lunch 2 0.362976 232 11.61 3.39 No Sat Dinner 2 0.412409 Yes 109 14.31 4.00 Yes Sat Dinner 2 0.387973 183 23.17 6.50 Yes Sun Dinner 4 0.389922 67 3.07 1.00 Yes Sat Dinner 1 0.483092 178 9.60 4.00 Yes Sun Dinner 2 0.714286 172 7.25 5.15 Yes Sun Dinner 2 2.452381
tips. groupby( 'smoker' , group_keys= False ) . apply ( top)
total_bill tip smoker day time size tip_pct 88 24.71 5.85 No Thur Lunch 2 0.310180 185 20.69 5.00 No Sun Dinner 5 0.318674 51 10.29 2.60 No Sun Dinner 2 0.338101 149 7.51 2.00 No Thur Lunch 2 0.362976 232 11.61 3.39 No Sat Dinner 2 0.412409 109 14.31 4.00 Yes Sat Dinner 2 0.387973 183 23.17 6.50 Yes Sun Dinner 4 0.389922 67 3.07 1.00 Yes Sat Dinner 1 0.483092 178 9.60 4.00 Yes Sun Dinner 2 0.714286 172 7.25 5.15 Yes Sun Dinner 2 2.452381
10.3.2 分位数与桶分析
pandas有一些工具,尤其是cut和qcut,用于将数据按照你选择的箱位或样本分位数进行分桶。 与groupby方法一起使用这些函数可以对数据集更方便地进行分桶或分位分析
frame = pd. DataFrame( { 'data1' : np. random. randn( 1000 ) ,
'data2' : np. random. randn( 1000 ) } )
frame. head( )
data1 data2 0 1.038225 -0.225594 1 -1.142048 0.762311 2 -0.383751 0.319326 3 0.122939 -1.205274 4 -1.557777 -0.495632
quertitles = pd. cut( frame. data1, 4 )
quertitles[ : 10 ]
0 (-0.175, 1.544]
1 (-1.894, -0.175]
2 (-1.894, -0.175]
3 (-0.175, 1.544]
4 (-1.894, -0.175]
5 (-1.894, -0.175]
6 (-1.894, -0.175]
7 (-3.62, -1.894]
8 (-1.894, -0.175]
9 (-0.175, 1.544]
Name: data1, dtype: category
Categories (4, interval[float64]): [(-3.62, -1.894] < (-1.894, -0.175] < (-0.175, 1.544] < (1.544, 3.263]]
def get_stats ( group) :
return { 'min' : group. min ( ) , 'max' : group. max ( ) , 'count' : group. count( ) , 'mean' : group. mean( ) }
grouped = frame. data2. groupby( quertitles)
grouped. apply ( get_stats) . unstack( )
min max count mean data1 (-3.62, -1.894] -1.687871 1.887558 30.0 -0.210904 (-1.894, -0.175] -2.898182 3.015219 412.0 0.043604 (-0.175, 1.544] -2.829533 2.779859 495.0 -0.040127 (1.544, 3.263] -1.966623 2.059859 63.0 0.051666
grouping = pd. cut( frame. data1, 10 , labels= False )
grouped = frame. data2. groupby( grouping)
grouped. apply ( get_stats) . unstack( )
min max count mean data1 0 -1.351679 0.479551 3.0 -0.162551 1 -1.396584 1.887558 11.0 0.134897 2 -2.122874 1.734941 47.0 -0.178326 3 -2.020175 3.015219 136.0 -0.004133 4 -2.898182 2.468129 245.0 0.079938 5 -2.758906 2.779859 253.0 -0.035156 6 -2.829533 2.439201 187.0 -0.045802 7 -2.250796 2.128096 86.0 -0.033331 8 -1.605125 2.059859 29.0 0.083334 9 -0.029929 0.692138 3.0 0.433789
10.3.3 示例:使用指定分组值填充缺失值
s = pd. Series( np. random. randn( 6 ) )
s
0 -1.825708
1 -0.406503
2 0.319659
3 0.077313
4 -0.304391
5 0.369055
dtype: float64
s[ : : 2 ] = np. nan
s
0 NaN
1 -0.406503
2 NaN
3 0.077313
4 NaN
5 0.369055
dtype: float64
s. fillna( s. mean( ) )
0 0.013288
1 -0.406503
2 0.013288
3 0.077313
4 0.013288
5 0.369055
dtype: float64
states = [ 'ohio' , 'new york' , 'vermont' , 'florida' , 'oregon' , 'nevada' , 'california' , 'idaho' ]
states
['ohio',
'new york',
'vermont',
'florida',
'oregon',
'nevada',
'california',
'idaho']
group_key = [ 'east' ] * 4 + [ 'west' ] * 4
group_key
['east', 'east', 'east', 'east', 'west', 'west', 'west', 'west']
data = pd. Series( np. random. randn( 8 ) , index= states)
data
ohio -1.196387
new york 1.116637
vermont -2.096959
florida -0.482581
oregon -0.426209
nevada 0.494309
california -1.444239
idaho 2.388043
dtype: float64
data[ [ 'vermont' , 'nevada' , 'idaho' ] ] = np. nan
data
ohio -1.196387
new york 1.116637
vermont NaN
florida -0.482581
oregon -0.426209
nevada NaN
california -1.444239
idaho NaN
dtype: float64
data. groupby( group_key) . mean( )
east -0.187444
west -0.935224
dtype: float64
fill_mean = lambda g : g. fillna( g. mean( ) )
data. groupby( group_key) . apply ( fill_mean)
ohio -1.196387
new york 1.116637
vermont -0.187444
florida -0.482581
oregon -0.426209
nevada -0.935224
california -1.444239
idaho -0.935224
dtype: float64
fill_values = { 'east' : 0.5 , 'west' : - 1 }
fill_func = lambda g : g. fillna( fill_values[ g. name] )
data. groupby( group_key) . apply ( fill_func)
ohio -1.196387
new york 1.116637
vermont 0.500000
florida -0.482581
oregon -0.426209
nevada -1.000000
california -1.444239
idaho -1.000000
dtype: float64
10.3.4 示例:随机采样与排列
suits = [ 'H' , 'S' , 'C' , 'D' ]
card_val = ( list ( range ( 1 , 11 ) ) + [ 10 ] * 3 ) * 4
card_val[ - 5 : ]
[9, 10, 10, 10, 10]
base_names = [ 'A' ] + list ( range ( 2 , 11 ) ) + [ 'J' , 'K' , 'Q' ]
base_names
['A', 2, 3, 4, 5, 6, 7, 8, 9, 10, 'J', 'K', 'Q']
cards = [ ]
for suit in suits:
cards. extend( str ( num) + suit for num in base_names)
deck = pd. Series( card_val, index= cards)
deck. head( )
AH 1
2H 2
3H 3
4H 4
5H 5
dtype: int64
deck[ : 13 ]
AH 1
2H 2
3H 3
4H 4
5H 5
6H 6
7H 7
8H 8
9H 9
10H 10
JH 10
KH 10
QH 10
dtype: int64
def draw ( deck, n= 5 ) :
return deck. sample( 5 )
draw( deck)
9S 9
8S 8
3S 3
2D 2
9H 9
dtype: int64
get_suit = lambda card: card[ - 1 ]
deck. groupby( get_suit) . apply ( draw, n= 2 )
C JC 10
3C 3
9C 9
QC 10
8C 8
D 6D 6
QD 10
7D 7
9D 9
4D 4
H 4H 4
6H 6
2H 2
10H 10
9H 9
S 10S 10
QS 10
8S 8
KS 10
5S 5
dtype: int64
deck. groupby( get_suit, group_keys= False ) . apply ( draw, n= 2 )
QC 10
AC 1
JC 10
KC 10
4C 4
JD 10
AD 1
6D 6
5D 5
7D 7
KH 10
2H 2
6H 6
5H 5
8H 8
2S 2
5S 5
QS 10
4S 4
JS 10
dtype: int64
10.3.5 示例:分组加权平均和相关性
df = pd. DataFrame( { 'category' : [ 'a' , 'a' , 'a' , 'a' , 'b' , 'b' , 'b' , 'b' ] ,
'data' : np. random. randn( 8 ) ,
'weights' : np. random. rand( 8 ) } )
df
category data weights 0 a -0.538213 0.890005 1 a 1.036831 0.119786 2 a -1.706140 0.086685 3 a -1.687288 0.418367 4 b 0.574443 0.372295 5 b -1.791063 0.994798 6 b 1.419661 0.715527 7 b 0.674622 0.338912
grouped = df. groupby( 'category' )
get_wavg = lambda g : np. average( g[ 'data' ] , weights= g[ 'weights' ] )
grouped. apply ( get_wavg)
category
a -0.797849
b -0.133568
dtype: float64
close_px = pd. read_csv( r'D:\PythonFlie\python\利用python进行数据分析(书籍笔记)\examples\stock_px_2.csv' ,
parse_dates= True ,
index_col= 0 )
close_px. head( )
AAPL MSFT XOM SPX 2003-01-02 7.40 21.11 29.22 909.03 2003-01-03 7.45 21.14 29.24 908.59 2003-01-06 7.45 21.52 29.96 929.01 2003-01-07 7.43 21.93 28.95 922.93 2003-01-08 7.28 21.31 28.83 909.93
close_px. info( )
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2214 entries, 2003-01-02 to 2011-10-14
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 AAPL 2214 non-null float64
1 MSFT 2214 non-null float64
2 XOM 2214 non-null float64
3 SPX 2214 non-null float64
dtypes: float64(4)
memory usage: 86.5 KB
close_px[ - 4 : ]
AAPL MSFT XOM SPX 2011-10-11 400.29 27.00 76.27 1195.54 2011-10-12 402.19 26.96 77.16 1207.25 2011-10-13 408.43 27.18 76.37 1203.66 2011-10-14 422.00 27.27 78.11 1224.58
spx_corr = lambda x : x. corrwith( x[ 'SPX' ] )
rets = close_px. pct_change( ) . dropna( )
get_year = lambda x : x. year
by_year = rets. groupby( get_year)
by_year. apply ( spx_corr)
AAPL MSFT XOM SPX 2003 0.541124 0.745174 0.661265 1.0 2004 0.374283 0.588531 0.557742 1.0 2005 0.467540 0.562374 0.631010 1.0 2006 0.428267 0.406126 0.518514 1.0 2007 0.508118 0.658770 0.786264 1.0 2008 0.681434 0.804626 0.828303 1.0 2009 0.707103 0.654902 0.797921 1.0 2010 0.710105 0.730118 0.839057 1.0 2011 0.691931 0.800996 0.859975 1.0
by_year. apply ( lambda g: g[ 'AAPL' ] . corr( g[ 'MSFT' ] ) )
2003 0.480868
2004 0.259024
2005 0.300093
2006 0.161735
2007 0.417738
2008 0.611901
2009 0.432738
2010 0.571946
2011 0.581987
dtype: float64
10.3.6 示例:逐组线性回归
import statsmodels. api as sm
def regress ( data, yvar, xvars) :
Y = data[ yvar]
X = data[ xvars]
X[ 'intercept' ] = 1 .
result = sm. OLS( Y, X) . fit( )
return result. params
by_year. apply ( regress, 'AAPL' , [ 'SPX' ] )
SPX intercept 2003 1.195406 0.000710 2004 1.363463 0.004201 2005 1.766415 0.003246 2006 1.645496 0.000080 2007 1.198761 0.003438 2008 0.968016 -0.001110 2009 0.879103 0.002954 2010 1.052608 0.001261 2011 0.806605 0.001514
10.4 数据透视表与交叉表
Python中的pandas透视表是通过本章所介绍的groupby工具以及使用分层索引的重塑操作实现的。 DataFrame拥有一个pivot_table方法,并且还有一个顶层的pandas.pivot_table函数。 除了为groupby提供一个方便接口,pivot_table还可以添加部分总计,也称作边距。 pivot_table选项
选项名 描述 values 需要聚合的列名:默认情况下聚合所有数值型的列 index 在结果透视表的行.上进行分组的列名或其他分组键
tips. pivot_table( index= [ 'day' , 'smoker' ] )
size tip tip_pct total_bill day smoker Fri No 2.250000 2.812500 0.179740 18.420000 Yes 2.066667 2.714000 0.216293 16.813333 Sat No 2.555556 3.102889 0.190412 19.661778 Yes 2.476190 2.875476 0.179833 21.276667 Sun No 2.929825 3.167895 0.193617 20.506667 Yes 2.578947 3.516842 0.322021 24.120000 Thur No 2.488889 2.673778 0.193424 17.113111 Yes 2.352941 3.030000 0.198508 19.190588
tips. pivot_table( [ 'tip_pct' , 'size' ] , index= [ 'time' , 'day' ] , columns= 'smoker' )
size tip_pct smoker No Yes No Yes time day Dinner Fri 2.000000 2.222222 0.162612 0.202545 Sat 2.555556 2.476190 0.190412 0.179833 Sun 2.929825 2.578947 0.193617 0.322021 Thur 2.000000 NaN 0.190114 NaN Lunch Fri 3.000000 1.833333 0.231125 0.236915 Thur 2.500000 2.352941 0.193499 0.198508
tips. pivot_table( [ 'tip_pct' , 'size' ] , index= [ 'time' , 'day' ] , columns= 'smoker' , margins= True )
size tip_pct smoker No Yes All No Yes All time day Dinner Fri 2.000000 2.222222 2.166667 0.162612 0.202545 0.192562 Sat 2.555556 2.476190 2.517241 0.190412 0.179833 0.185305 Sun 2.929825 2.578947 2.842105 0.193617 0.322021 0.225718 Thur 2.000000 NaN 2.000000 0.190114 NaN 0.190114 Lunch Fri 3.000000 1.833333 2.000000 0.231125 0.236915 0.236088 Thur 2.500000 2.352941 2.459016 0.193499 0.198508 0.194895 All 2.668874 2.408602 2.569672 0.192237 0.218176 0.202123
tips. pivot_table( 'tip_pct' , index= [ 'time' , 'smoker' ] , aggfunc= len , margins= True )
tip_pct time smoker Dinner No 106.0 Yes 70.0 Lunch No 45.0 Yes 23.0 All 244.0
tips. pivot_table( 'tip_pct' , index= [ 'time' , 'size' , 'smoker' ] , aggfunc= 'mean' , fill_value= 0 )
tip_pct time size smoker Dinner 1 No 0.160000 Yes 0.483092 2 No 0.200221 Yes 0.239672 3 No 0.183436 Yes 0.179271 4 No 0.176033 Yes 0.169406 5 No 0.263344 Yes 0.094779 6 No 0.115821 Lunch 1 No 0.222087 Yes 0.288288 2 No 0.201503 Yes 0.201043 3 No 0.138483 Yes 0.257941 4 No 0.161573 Yes 0.186592 5 No 0.138160 6 No 0.211191
10.4.1 交叉表:crosstab
交叉表(简写为crosstab)是数据透视表的一个特殊情况,计算的是分组中的频率
10.5 本章小结
data = pd. DataFrame( { 'Sample' : range ( 1 , 11 ) ,
'nationality' : [ 'USA' , 'Japan' , 'USA' , 'Japan' , 'Japan' , 'Japan' , 'USA' , 'USA' , 'Japan' , 'USA' ] ,
'handedness' : [ 'right' , 'left' , 'right' , 'right' , 'left' , 'right' , 'right' , 'left' , 'right' , 'right' ] } )
data
Sample nationality handedness 0 1 USA right 1 2 Japan left 2 3 USA right 3 4 Japan right 4 5 Japan left 5 6 Japan right 6 7 USA right 7 8 USA left 8 9 Japan right 9 10 USA right
data. pivot_table( index = 'nationality' , columns= 'handedness' , margins= True , aggfunc= 'count' )
Sample handedness left right All nationality Japan 2 3 5 USA 1 4 5 All 3 7 10
pd. crosstab( data. nationality, data. handedness, margins= True )
handedness left right All nationality Japan 2 3 5 USA 1 4 5 All 3 7 10
pd. crosstab( [ tips. time, tips. day] , tips. smoker, margins= True )
smoker No Yes All time day Dinner Fri 3 9 12 Sat 45 42 87 Sun 57 19 76 Thur 1 0 1 Lunch Fri 1 6 7 Thur 44 17 61 All 151 93 244
第十一章时间序列
11.1 日期和时间数据的类型及工具
Python标准库包含了日期和时间数据的类型,也包括日历相关的功能。 datetime、time和calendar模块是开始处理时间数据的主要内容。 datetime.datetime类型,或简写为datetime,是广泛使用的 datetime模块中的类型
类型 描述 date 使用公历日历存储日历日期(年, 月,日) time 将时间存储为小时,分钟,秒和微秒 datetime 存储日期和时间 timedelta 表示两个datetime值之间的差(如日,秒和微秒) tzinfo 用于存储时区信息的基本类型
from datetime import datetime
now = datetime. now( )
now
datetime.datetime(2021, 12, 28, 14, 25, 24, 8294)
now. year, now. month, now. day
(2021, 12, 28)
delta = datetime( 2011 , 1 , 7 ) - datetime( 2008 , 6 , 24 , 8 , 15 )
delta
datetime.timedelta(days=926, seconds=56700)
delta. days, delta. seconds
(926, 56700)
from datetime import timedelta
start = datetime( 2011 , 1 , 7 )
start+ timedelta( 12 )
datetime.datetime(2011, 1, 19, 0, 0)
start- 2 * timedelta( 12 )
datetime.datetime(2010, 12, 14, 0, 0)
11.1.1 字符串与datetime互相转换
可以使用str方法或传递一个指定的格式给strftime方法来对datetime对象和pandas的Timestamp对象进行格式化 datetime格式说明(兼容ISO C89)
类型 描述 %Y 四位的年份 %y 两位的年份 %m 两位的月份[01, 12] %d 两位的日期号[01,31] %H 小时,24小时制[00, 23] %I 小时,12小时制[01, 12] %M 两位的分钟[00, 59] %S 秒[00,61] (60、61 是闰秒) %w 星期日期[0 (星期天),6] %U 一年中的星期数[00, 53]。以星期天为每周的第一天,-年中第一个星期天前的日期作为"第0周" %W 一年中的星期數[00, 53]。以星期一为每周的第一-天,-年中第一个星期一前的日期作为"第0周" %z 格式为+HHMM或-HHMM的UTC时区偏移;如果没有时区则为空 %F %Y-%m-%d的简写(例如,2012-4-18) %D %m/%d/%y的简写(例如,04/18/12)
stamp = datetime( 2011 , 1 , 3 )
stamp
datetime.datetime(2011, 1, 3, 0, 0)
str ( stamp)
'2011-01-03 00:00:00'
stamp. strftime( '%Y-%m-%d' )
'2011-01-03'
value = '2011-01-03'
datetime. strptime( value, '%Y-%m-%d' )
datetime.datetime(2011, 1, 3, 0, 0)
datestrs = [ '7/6/2011' , '8/6/2011' ]
[ datetime. strptime( x, '%m/%d/%Y' ) for x in datestrs]
[datetime.datetime(2011, 7, 6, 0, 0), datetime.datetime(2011, 8, 6, 0, 0)]
from dateutil. parser import parse
parse( '2011-01-03' )
datetime.datetime(2011, 1, 3, 0, 0)
parse( 'jan 31,1997 10:45 am' )
datetime.datetime(2021, 1, 31, 10, 45)
parse( '6/12/2011' , dayfirst= True )
datetime.datetime(2011, 12, 6, 0, 0)
datestrs = [ '2011-07-06 12:00:00' , '2011-08-06 00:00:00' ]
pd. to_datetime( datestrs)
DatetimeIndex(['2011-07-06 12:00:00', '2011-08-06 00:00:00'], dtype='datetime64[ns]', freq=None)
idx = pd. to_datetime( datestrs+ [ None ] )
idx
DatetimeIndex(['2011-07-06 12:00:00', '2011-08-06 00:00:00', 'NaT'], dtype='datetime64[ns]', freq=None)
idx[ 2 ]
NaT
pd. isnull( idx)
array([False, False, True])
类型 描述 %a 缩写的工作日名称 %A 全写的工作日名称 %b 简写的月份名称 %B 全写的月份名称 %c 完整的日期和时间(例如,‘ Tue 01 May 2012 04:20:57 PM’) %p AM或PM的地区等效 %x 适合地区的格式化日期(例如,在美国,May 1, 2012会生成’05 101/2012) %X 适合地区的时间(例如,’ 04:24:12 PM’ )
11.2 时间序列基础
pandas中的基础时间序列种类是由时间戳索引的Series,在pandas外部则通常表示为Python字符串或datetime对象
from datetime import datetime
dates = [ datetime( 2011 , 1 , 2 ) , datetime( 2011 , 1 , 5 ) , datetime( 2011 , 1 , 7 ) ,
datetime( 2011 , 1 , 8 ) , datetime( 2011 , 1 , 10 ) , datetime( 2011 , 1 , 12 ) ]
dates
[datetime.datetime(2011, 1, 2, 0, 0),
datetime.datetime(2011, 1, 5, 0, 0),
datetime.datetime(2011, 1, 7, 0, 0),
datetime.datetime(2011, 1, 8, 0, 0),
datetime.datetime(2011, 1, 10, 0, 0),
datetime.datetime(2011, 1, 12, 0, 0)]
ts = pd. Series( np. random. randn( 6 ) , index= dates)
ts
2011-01-02 -0.388904
2011-01-05 1.029932
2011-01-07 -0.186521
2011-01-08 -0.310343
2011-01-10 0.710121
2011-01-12 -0.053242
dtype: float64
ts. index
DatetimeIndex(['2011-01-02', '2011-01-05', '2011-01-07', '2011-01-08',
'2011-01-10', '2011-01-12'],
dtype='datetime64[ns]', freq=None)
ts[ : : 2 ] + ts
2011-01-02 -0.777807
2011-01-05 NaN
2011-01-07 -0.373041
2011-01-08 NaN
2011-01-10 1.420241
2011-01-12 NaN
dtype: float64
ts. index. dtype
dtype('<M8[ns]')
stamp = ts. index[ 0 ]
stamp
Timestamp('2011-01-02 00:00:00')
11.2.1 索引、选择、子集
当你基于标签进行索引和选择时,时间序列的行为和其他的pandas.Series类似
stamp = ts. index[ 2 ]
ts[ stamp]
-0.18652074734702304
ts[ '1/10/2011' ]
0.7101205694131453
ts[ '20110110' ]
0.7101205694131453
longer_ts = pd. Series( np. random. randn( 1000 ) , index= pd. date_range( '1/1/2000' , periods= 1000 ) )
longer_ts
2000-01-01 1.399510
2000-01-02 1.167288
2000-01-03 -1.557553
2000-01-04 0.905812
2000-01-05 -1.204531
...
2002-09-22 -0.055722
2002-09-23 0.913953
2002-09-24 3.299072
2002-09-25 -1.384872
2002-09-26 0.034663
Freq: D, Length: 1000, dtype: float64
longer_ts[ '2001' ]
2001-01-01 -0.098773
2001-01-02 0.299508
2001-01-03 -0.111169
2001-01-04 0.545366
2001-01-05 0.559263
...
2001-12-27 0.383230
2001-12-28 0.274157
2001-12-29 -1.571527
2001-12-30 -0.066029
2001-12-31 0.860201
Freq: D, Length: 365, dtype: float64
longer_ts[ '2001-05' ]
2001-05-01 0.952182
2001-05-02 -1.637248
2001-05-03 -1.673597
2001-05-04 -1.257016
2001-05-05 -0.992064
2001-05-06 0.874060
2001-05-07 -0.491860
2001-05-08 0.479921
2001-05-09 0.286766
2001-05-10 2.026310
2001-05-11 -0.547724
2001-05-12 1.313638
2001-05-13 -0.624659
2001-05-14 0.003374
2001-05-15 -0.097401
2001-05-16 -0.733808
2001-05-17 -0.557165
2001-05-18 1.592470
2001-05-19 -1.030133
2001-05-20 1.789641
2001-05-21 0.821066
2001-05-22 0.954955
2001-05-23 0.537921
2001-05-24 -0.016609
2001-05-25 1.806582
2001-05-26 0.323395
2001-05-27 -0.144737
2001-05-28 0.783111
2001-05-29 0.673603
2001-05-30 -1.775568
2001-05-31 -0.553773
Freq: D, dtype: float64
ts[ datetime( 2011 , 1 , 7 ) : ]
2011-01-07 -0.186521
2011-01-08 -0.310343
2011-01-10 0.710121
2011-01-12 -0.053242
dtype: float64
ts
2011-01-02 -0.388904
2011-01-05 1.029932
2011-01-07 -0.186521
2011-01-08 -0.310343
2011-01-10 0.710121
2011-01-12 -0.053242
dtype: float64
ts[ '1/6/2011' : '1/11/2011' ]
2011-01-07 -0.186521
2011-01-08 -0.310343
2011-01-10 0.710121
dtype: float64
ts. truncate( after= '1/9/2011' )
2011-01-02 -0.388904
2011-01-05 1.029932
2011-01-07 -0.186521
2011-01-08 -0.310343
dtype: float64
dates = pd. date_range( '1/1/2000' , periods= 100 , freq= 'W-WED' )
dates[ : 5 ]
DatetimeIndex(['2000-01-05', '2000-01-12', '2000-01-19', '2000-01-26',
'2000-02-02'],
dtype='datetime64[ns]', freq='W-WED')
long_df = pd. DataFrame( np. random. randn( 100 , 4 ) , index= dates, columns = [ 'colorado' , 'texas' , 'new york' , 'ohio' ] )
long_df
colorado texas new york ohio 2000-01-05 -0.401749 -0.717368 -1.281874 1.274557 2000-01-12 1.038078 0.915171 -0.080030 -0.280146 2000-01-19 -1.080790 0.255669 -0.987855 -0.662729 2000-01-26 -0.360513 1.133701 0.678780 0.421880 2000-02-02 -0.636087 1.338383 -0.247395 1.089950 ... ... ... ... ... 2001-10-31 1.464741 -0.390381 2.867647 -0.299401 2001-11-07 -0.341461 -0.358244 -0.766587 0.728394 2001-11-14 -1.010205 -0.055786 -0.574341 -0.652109 2001-11-21 -0.130517 0.929190 0.102579 -0.677333 2001-11-28 -0.349123 1.238194 -0.058024 -3.170347
100 rows × 4 columns
long_df. loc[ '5-2001' ]
colorado texas new york ohio 2001-05-02 0.054055 -0.491146 0.241249 0.310428 2001-05-09 -1.556219 -0.716196 0.514238 1.694993 2001-05-16 -1.382658 0.459362 0.732855 0.644502 2001-05-23 -0.342885 0.544961 -1.133561 1.351170 2001-05-30 -2.051397 0.595807 0.536894 0.584369
11.2.2 含有重复索引的时间序列
dates = pd. DatetimeIndex( [ '1/1/2000' , '1/2/2000' , '1/2/2000' , '1/2/2000' , '1/3/2000' ] )
dates
DatetimeIndex(['2000-01-01', '2000-01-02', '2000-01-02', '2000-01-02',
'2000-01-03'],
dtype='datetime64[ns]', freq=None)
dup_ts = pd. Series( np. arange( 5 ) , index= dates)
dup_ts
2000-01-01 0
2000-01-02 1
2000-01-02 2
2000-01-02 3
2000-01-03 4
dtype: int32
dup_ts. index. is_unique
False
dup_ts[ '1/3/2000' ]
4
dup_ts[ '1/2/2000' ]
2000-01-02 1
2000-01-02 2
2000-01-02 3
dtype: int32
grouped = dup_ts. groupby( level= 0 )
grouped. mean( )
2000-01-01 0
2000-01-02 2
2000-01-03 4
dtype: int32
dup_ts. groupby( level= - 1 ) . mean( )
2000-01-01 0
2000-01-02 2
2000-01-03 4
dtype: int32
grouped. count( )
2000-01-01 1
2000-01-02 3
2000-01-03 1
dtype: int64
11.3 日期范围、频率和移位
pandas拥有一整套标准的时间序列频率和工具用于重新采样、推断频率以及生成固定频率的数据范围。
ts
2011-01-02 -0.388904
2011-01-05 1.029932
2011-01-07 -0.186521
2011-01-08 -0.310343
2011-01-10 0.710121
2011-01-12 -0.053242
dtype: float64
resampler = ts. resample( 'D' )
11.3.1 生成日期范围
pandas.date_range是用于根据特定频率生成指定长度的DatetimeIndex
index = pd. date_range( '2012-04-01' , '2012-06-01' )
index
DatetimeIndex(['2012-04-01', '2012-04-02', '2012-04-03', '2012-04-04',
'2012-04-05', '2012-04-06', '2012-04-07', '2012-04-08',
'2012-04-09', '2012-04-10', '2012-04-11', '2012-04-12',
'2012-04-13', '2012-04-14', '2012-04-15', '2012-04-16',
'2012-04-17', '2012-04-18', '2012-04-19', '2012-04-20',
'2012-04-21', '2012-04-22', '2012-04-23', '2012-04-24',
'2012-04-25', '2012-04-26', '2012-04-27', '2012-04-28',
'2012-04-29', '2012-04-30', '2012-05-01', '2012-05-02',
'2012-05-03', '2012-05-04', '2012-05-05', '2012-05-06',
'2012-05-07', '2012-05-08', '2012-05-09', '2012-05-10',
'2012-05-11', '2012-05-12', '2012-05-13', '2012-05-14',
'2012-05-15', '2012-05-16', '2012-05-17', '2012-05-18',
'2012-05-19', '2012-05-20', '2012-05-21', '2012-05-22',
'2012-05-23', '2012-05-24', '2012-05-25', '2012-05-26',
'2012-05-27', '2012-05-28', '2012-05-29', '2012-05-30',
'2012-05-31', '2012-06-01'],
dtype='datetime64[ns]', freq='D')
pd. date_range( start= '2012-04-01' , periods= 200 )
DatetimeIndex(['2012-04-01', '2012-04-02', '2012-04-03', '2012-04-04',
'2012-04-05', '2012-04-06', '2012-04-07', '2012-04-08',
'2012-04-09', '2012-04-10',
...
'2012-10-08', '2012-10-09', '2012-10-10', '2012-10-11',
'2012-10-12', '2012-10-13', '2012-10-14', '2012-10-15',
'2012-10-16', '2012-10-17'],
dtype='datetime64[ns]', length=200, freq='D')
pd. date_range( '2000-01-01' , '2000-12-01' , freq= 'BM' )
DatetimeIndex(['2000-01-31', '2000-02-29', '2000-03-31', '2000-04-28',
'2000-05-31', '2000-06-30', '2000-07-31', '2000-08-31',
'2000-09-29', '2000-10-31', '2000-11-30'],
dtype='datetime64[ns]', freq='BM')
别名 偏置类型 描述 D Day 日历日的每天 B BusinessDay 工作日的每天 H Hour 每小时 T或min Minute 每分钟 S Second 每秒 L或ms Milli 每毫秒(1/1,000 秒) U Micro 每微秒(1/1,00,000秒) M MonthEnd 日历日的月底日期 BM Busines sMonthEnd 工作日的月底日期 MS MonthBegin 日历日的月初日期 BMS BusinessMonthBegin 工作日的月初日期 W-MON, W-TUE, … Week 按照给定的星期日期按每周取日期(MON, TUE, WED, THU,FRI, SAT或SUN) WOM- 1MON , WOM- 2MON,… Wee kofMonth 在本月的第一,二,三或四周创建按周分隔的日期(例如WOM-3FRI代表每月的第三个星期五) Q-JAN,Q-FEB, … QuarterEnd 每个月最后一个日历日的季度日期,以表示月份结束的年份(JAN, FEB,MAR, APR, MAY,JUN, JUL, AUG, SEP, OCT, NOV或DEC) BQ-JAN,BQ-FEB,… BusinessQuarterEnd 每个月最后一个工作日对应的季度日期,以表示月份结束的年份 QS-JAN, QS-FEB, … QuarterBegin 每个月第一个日历日对应的季度日期,以表示月份结束的年份 BQS-JAN,BQS-FEB, … YearEnd 给定月份所在的月的最后一个日历日所对应的年度日期 A-JAN,A-FEB, … BusinessYearEnd 给定月份所在的月的最后一个.工作日所对应的年度日期(JAN,FEB, MAR, APR, MAY, JUN,JUL,AUG, SEP, OCT,NOV或DEC) BA-JAN,BA-FEB, … YearBegin 给定月份所在的月的第一个日历日所对应的年度日期 AS-JAN,AS-FEB, … BusinessYearBegin 给定月份所在的月的第一个工作日所对应的年度日期 BAS-JAN, BAS-FEB, … usinessYearBegin 给定月份所在月的第一个工作日所对应的年度日期
pd. date_range( '2012-05-02 12:56:31' , periods= 5 )
DatetimeIndex(['2012-05-02 12:56:31', '2012-05-03 12:56:31',
'2012-05-04 12:56:31', '2012-05-05 12:56:31',
'2012-05-06 12:56:31'],
dtype='datetime64[ns]', freq='D')
pd. date_range( '2012-05-02 12:56:31' , periods= 5 , normalize= True )
DatetimeIndex(['2012-05-02', '2012-05-03', '2012-05-04', '2012-05-05',
'2012-05-06'],
dtype='datetime64[ns]', freq='D')
11.3.2 频率和日期偏置
pandas中的频率是由基础频率和倍数组成的。 基础频率通常会有字符串别名,例如’M’代表每月,'H’代表每小时。对于每个基础频率,都有一个对象可以被用于定义日期偏置。
from pandas. tseries. offsets import Hour, Minute
hour = Hour( )
hour
<Hour>
four_hours = Hour( 4 )
four_hours
<4 * Hours>
pd. date_range( '2000-01-01' , '2000-01-03 23:59' , freq= '4h' )
DatetimeIndex(['2000-01-01 00:00:00', '2000-01-01 04:00:00',
'2000-01-01 08:00:00', '2000-01-01 12:00:00',
'2000-01-01 16:00:00', '2000-01-01 20:00:00',
'2000-01-02 00:00:00', '2000-01-02 04:00:00',
'2000-01-02 08:00:00', '2000-01-02 12:00:00',
'2000-01-02 16:00:00', '2000-01-02 20:00:00',
'2000-01-03 00:00:00', '2000-01-03 04:00:00',
'2000-01-03 08:00:00', '2000-01-03 12:00:00',
'2000-01-03 16:00:00', '2000-01-03 20:00:00'],
dtype='datetime64[ns]', freq='4H')
Hour( 2 ) + Minute( 30 )
<150 * Minutes>
pd. date_range( '2000-01-01' , '2000-01-03 23:59' , freq= '1h30min' )
DatetimeIndex(['2000-01-01 00:00:00', '2000-01-01 01:30:00',
'2000-01-01 03:00:00', '2000-01-01 04:30:00',
'2000-01-01 06:00:00', '2000-01-01 07:30:00',
'2000-01-01 09:00:00', '2000-01-01 10:30:00',
'2000-01-01 12:00:00', '2000-01-01 13:30:00',
'2000-01-01 15:00:00', '2000-01-01 16:30:00',
'2000-01-01 18:00:00', '2000-01-01 19:30:00',
'2000-01-01 21:00:00', '2000-01-01 22:30:00',
'2000-01-02 00:00:00', '2000-01-02 01:30:00',
'2000-01-02 03:00:00', '2000-01-02 04:30:00',
'2000-01-02 06:00:00', '2000-01-02 07:30:00',
'2000-01-02 09:00:00', '2000-01-02 10:30:00',
'2000-01-02 12:00:00', '2000-01-02 13:30:00',
'2000-01-02 15:00:00', '2000-01-02 16:30:00',
'2000-01-02 18:00:00', '2000-01-02 19:30:00',
'2000-01-02 21:00:00', '2000-01-02 22:30:00',
'2000-01-03 00:00:00', '2000-01-03 01:30:00',
'2000-01-03 03:00:00', '2000-01-03 04:30:00',
'2000-01-03 06:00:00', '2000-01-03 07:30:00',
'2000-01-03 09:00:00', '2000-01-03 10:30:00',
'2000-01-03 12:00:00', '2000-01-03 13:30:00',
'2000-01-03 15:00:00', '2000-01-03 16:30:00',
'2000-01-03 18:00:00', '2000-01-03 19:30:00',
'2000-01-03 21:00:00', '2000-01-03 22:30:00'],
dtype='datetime64[ns]', freq='90T')
11.3.2.1 月中某星期的日期
rng = pd. date_range( '2012-01-01' , '2012-09-01' , freq= 'WOM-3FRI' )
rng
DatetimeIndex(['2012-01-20', '2012-02-17', '2012-03-16', '2012-04-20',
'2012-05-18', '2012-06-15', '2012-07-20', '2012-08-17'],
dtype='datetime64[ns]', freq='WOM-3FRI')
11.3.3 移位(前向和后向)日期
"移位"是指将日期按时间向前移动或向后移动。 Series和DataFrame都有一个shift方法用于进行简单的前向或后向移位,而不改变索引
ts = pd. Series( np. random. randn( 4 ) , index= pd. date_range( '1/1/2000' , freq= 'M' , periods= 4 ) )
ts
2000-01-31 -0.209127
2000-02-29 0.998652
2000-03-31 0.006782
2000-04-30 0.558995
Freq: M, dtype: float64
ts. shift( 2 )
2000-01-31 NaN
2000-02-29 NaN
2000-03-31 -0.209127
2000-04-30 0.998652
Freq: M, dtype: float64
ts. shift( - 2 )
2000-01-31 0.006782
2000-02-29 0.558995
2000-03-31 NaN
2000-04-30 NaN
Freq: M, dtype: float64
ts/ ts. shift( 1 ) - 1
2000-01-31 NaN
2000-02-29 -5.775325
2000-03-31 -0.993209
2000-04-30 81.428620
Freq: M, dtype: float64
ts. shift( 2 , freq= 'M' )
2000-03-31 -0.209127
2000-04-30 0.998652
2000-05-31 0.006782
2000-06-30 0.558995
Freq: M, dtype: float64
ts. shift( 3 , freq= 'D' )
2000-02-03 -0.209127
2000-03-03 0.998652
2000-04-03 0.006782
2000-05-03 0.558995
dtype: float64
ts. shift( 1 , freq= '90T' )
2000-01-31 01:30:00 -0.209127
2000-02-29 01:30:00 0.998652
2000-03-31 01:30:00 0.006782
2000-04-30 01:30:00 0.558995
dtype: float64
11.3.3.1 使用偏置进行移位日期
pandas日期偏置也可以使用datetime或Timestamp对象完成
from pandas. tseries. offsets import Day, MonthEnd
now = datetime( 2011 , 11 , 17 )
now+ 3 * Day( )
Timestamp('2011-11-20 00:00:00')
now+ MonthEnd( )
Timestamp('2011-11-30 00:00:00')
now+ MonthEnd( 2 )
Timestamp('2011-12-31 00:00:00')
offset = MonthEnd( )
offset. rollforward( now)
Timestamp('2011-11-30 00:00:00')
offset. rollback( now)
Timestamp('2011-10-31 00:00:00')
ts = pd. Series( np. random. randn( 20 ) , index= pd. date_range( '1/15/2000' , periods= 20 , freq= '4d' ) )
ts
2000-01-15 0.165827
2000-01-19 -0.564001
2000-01-23 -1.297098
2000-01-27 0.461758
2000-01-31 0.104934
2000-02-04 -0.197272
2000-02-08 -0.238585
2000-02-12 1.822792
2000-02-16 1.053136
2000-02-20 0.114542
2000-02-24 -0.719873
2000-02-28 0.938727
2000-03-03 0.250604
2000-03-07 1.692985
2000-03-11 0.249253
2000-03-15 -2.015027
2000-03-19 1.274159
2000-03-23 0.593201
2000-03-27 -0.340764
2000-03-31 -0.806581
Freq: 4D, dtype: float64
ts. resample( 'M' ) . mean( )
2000-01-31 -0.225716
2000-02-29 0.396209
2000-03-31 0.112229
Freq: M, dtype: float64
11.4 时区处理
在Python语言中,时区信息来源于第三方库pytz(可以使用pip或conda安装),其中公开了Olson数据库,这是世界时区信息的汇编。
import pytz
pytz. common_timezones[ - 5 : ]
['US/Eastern', 'US/Hawaii', 'US/Mountain', 'US/Pacific', 'UTC']
tz = pytz. timezone( 'America/New_York' )
tz
<DstTzInfo 'America/New_York' LMT-1 day, 19:04:00 STD>
11.4.1 时区的本地化和转换
rng = pd. date_range( '3/9/2012 9:30' , periods= 6 , freq= 'D' )
rng
DatetimeIndex(['2012-03-09 09:30:00', '2012-03-10 09:30:00',
'2012-03-11 09:30:00', '2012-03-12 09:30:00',
'2012-03-13 09:30:00', '2012-03-14 09:30:00'],
dtype='datetime64[ns]', freq='D')
ts = pd. Series( np. random. randn( len ( rng) ) , index= rng)
ts
2012-03-09 09:30:00 -0.804325
2012-03-10 09:30:00 0.060626
2012-03-11 09:30:00 -1.531345
2012-03-12 09:30:00 0.133690
2012-03-13 09:30:00 0.173941
2012-03-14 09:30:00 -1.416870
Freq: D, dtype: float64
print ( ts. index. tz)
None
pd. date_range( '3/9/2012 9:30' , periods= 10 , freq= 'D' , tz= 'UTC' )
DatetimeIndex(['2012-03-09 09:30:00+00:00', '2012-03-10 09:30:00+00:00',
'2012-03-11 09:30:00+00:00', '2012-03-12 09:30:00+00:00',
'2012-03-13 09:30:00+00:00', '2012-03-14 09:30:00+00:00',
'2012-03-15 09:30:00+00:00', '2012-03-16 09:30:00+00:00',
'2012-03-17 09:30:00+00:00', '2012-03-18 09:30:00+00:00'],
dtype='datetime64[ns, UTC]', freq='D')
ts_utc = ts. tz_localize( 'UTC' )
ts_utc
2012-03-09 09:30:00+00:00 -0.804325
2012-03-10 09:30:00+00:00 0.060626
2012-03-11 09:30:00+00:00 -1.531345
2012-03-12 09:30:00+00:00 0.133690
2012-03-13 09:30:00+00:00 0.173941
2012-03-14 09:30:00+00:00 -1.416870
Freq: D, dtype: float64
ts_utc. index
DatetimeIndex(['2012-03-09 09:30:00+00:00', '2012-03-10 09:30:00+00:00',
'2012-03-11 09:30:00+00:00', '2012-03-12 09:30:00+00:00',
'2012-03-13 09:30:00+00:00', '2012-03-14 09:30:00+00:00'],
dtype='datetime64[ns, UTC]', freq='D')
ts_utc. tz_convert( 'America/New_York' )
2012-03-09 04:30:00-05:00 -0.804325
2012-03-10 04:30:00-05:00 0.060626
2012-03-11 05:30:00-04:00 -1.531345
2012-03-12 05:30:00-04:00 0.133690
2012-03-13 05:30:00-04:00 0.173941
2012-03-14 05:30:00-04:00 -1.416870
Freq: D, dtype: float64
ts_eastern = ts. tz_localize( 'America/New_York' )
ts_eastern. tz_convert( 'UTC' )
2012-03-09 14:30:00+00:00 -0.804325
2012-03-10 14:30:00+00:00 0.060626
2012-03-11 13:30:00+00:00 -1.531345
2012-03-12 13:30:00+00:00 0.133690
2012-03-13 13:30:00+00:00 0.173941
2012-03-14 13:30:00+00:00 -1.416870
dtype: float64
ts_eastern. tz_convert( 'Europe/Berlin' )
2012-03-09 15:30:00+01:00 -0.804325
2012-03-10 15:30:00+01:00 0.060626
2012-03-11 14:30:00+01:00 -1.531345
2012-03-12 14:30:00+01:00 0.133690
2012-03-13 14:30:00+01:00 0.173941
2012-03-14 14:30:00+01:00 -1.416870
dtype: float64
ts. index. tz_localize( 'Asia/Shanghai' )
DatetimeIndex(['2012-03-09 09:30:00+08:00', '2012-03-10 09:30:00+08:00',
'2012-03-11 09:30:00+08:00', '2012-03-12 09:30:00+08:00',
'2012-03-13 09:30:00+08:00', '2012-03-14 09:30:00+08:00'],
dtype='datetime64[ns, Asia/Shanghai]', freq=None)
11.4.2 时区感知时间戳对象的操作
stamp = pd. Timestamp( '2011-03-12 04:00' )
stamp_utc = stamp. tz_localize( 'utc' )
stamp_utc. tz_convert( 'America/New_York' )
Timestamp('2011-03-11 23:00:00-0500', tz='America/New_York')
stamp_moscow = pd. Timestamp( '2011-03-12 04:00' , tz= 'Europe/Moscow' )
stamp_moscow
Timestamp('2011-03-12 04:00:00+0300', tz='Europe/Moscow')
stamp_utc. value
1299902400000000000
stamp_utc. tz_convert( 'America/New_York' ) . value
1299902400000000000
from pandas. tseries. offsets import Hour
stamp = pd. Timestamp( '2012-03-12 01:30' , tz= 'US/Eastern' )
stamp
Timestamp('2012-03-12 01:30:00-0400', tz='US/Eastern')
stamp+ Hour( )
Timestamp('2012-03-12 02:30:00-0400', tz='US/Eastern')
stamp = pd. Timestamp( '2012-11-04 00:30' , tz= 'US/Eastern' )
stamp
Timestamp('2012-11-04 00:30:00-0400', tz='US/Eastern')
stamp+ 2 * Hour( )
Timestamp('2012-11-04 01:30:00-0500', tz='US/Eastern')
11.4.3 不同时区间的操作
rng = pd. date_range( '3/7/2012 9:30' , periods= 10 , freq= 'B' )
ts = pd. Series( np. random. randn( len ( rng) ) , index= rng)
ts
2012-03-07 09:30:00 0.386445
2012-03-08 09:30:00 -0.300690
2012-03-09 09:30:00 0.759709
2012-03-12 09:30:00 1.257558
2012-03-13 09:30:00 -0.505203
2012-03-14 09:30:00 -0.002910
2012-03-15 09:30:00 -0.196304
2012-03-16 09:30:00 0.440398
2012-03-19 09:30:00 0.218168
2012-03-20 09:30:00 1.265953
Freq: B, dtype: float64
ts1 = ts[ : 7 ] . tz_localize( 'Europe/London' )
ts1
2012-03-07 09:30:00+00:00 0.386445
2012-03-08 09:30:00+00:00 -0.300690
2012-03-09 09:30:00+00:00 0.759709
2012-03-12 09:30:00+00:00 1.257558
2012-03-13 09:30:00+00:00 -0.505203
2012-03-14 09:30:00+00:00 -0.002910
2012-03-15 09:30:00+00:00 -0.196304
dtype: float64
ts2 = ts[ : 7 ] . tz_localize( 'Europe/Moscow' )
ts2
2012-03-07 09:30:00+04:00 0.386445
2012-03-08 09:30:00+04:00 -0.300690
2012-03-09 09:30:00+04:00 0.759709
2012-03-12 09:30:00+04:00 1.257558
2012-03-13 09:30:00+04:00 -0.505203
2012-03-14 09:30:00+04:00 -0.002910
2012-03-15 09:30:00+04:00 -0.196304
dtype: float64
result = ts1 + ts2
result. index
DatetimeIndex(['2012-03-07 05:30:00+00:00', '2012-03-07 09:30:00+00:00',
'2012-03-08 05:30:00+00:00', '2012-03-08 09:30:00+00:00',
'2012-03-09 05:30:00+00:00', '2012-03-09 09:30:00+00:00',
'2012-03-12 05:30:00+00:00', '2012-03-12 09:30:00+00:00',
'2012-03-13 05:30:00+00:00', '2012-03-13 09:30:00+00:00',
'2012-03-14 05:30:00+00:00', '2012-03-14 09:30:00+00:00',
'2012-03-15 05:30:00+00:00', '2012-03-15 09:30:00+00:00'],
dtype='datetime64[ns, UTC]', freq=None)
11.5 时间区间和区间算术
p = pd. Period( 2007 , freq= 'A-DEC' )
p
Period('2007', 'A-DEC')
p+ 5
Period('2012', 'A-DEC')
p- 2
Period('2005', 'A-DEC')
pd. Period( 2014 , freq= 'A-DEC' ) - p
<7 * YearEnds: month=12>
rng = pd. period_range( '2000-01-01' , '2000-06-30' , freq= 'M' )
rng
PeriodIndex(['2000-01', '2000-02', '2000-03', '2000-04', '2000-05', '2000-06'], dtype='period[M]', freq='M')
pd. Series( np. random. randn( 6 ) , index= rng)
2000-01 -1.474403
2000-02 0.146328
2000-03 1.247877
2000-04 0.797235
2000-05 -1.583312
2000-06 1.525540
Freq: M, dtype: float64
values = [ '2001Q3' , '2002Q2' , '2003Q1' ]
index = pd. PeriodIndex( values, freq= 'Q-DEC' )
index
PeriodIndex(['2001Q3', '2002Q2', '2003Q1'], dtype='period[Q-DEC]', freq='Q-DEC')
11.5.1 区间频率转换
使用asfreq可以将区间和PeriodIndex对象转换为其他的频率。
p = pd. Period( 2007 , freq= 'A-DEC' )
p
Period('2007', 'A-DEC')
p. asfreq( 'M' , how= 'start' )
Period('2007-01', 'M')
p. asfreq( 'M' , how= 'end' )
Period('2007-12', 'M')
p = pd. Period( 2007 , freq= 'A-JUN' )
p
Period('2007', 'A-JUN')
p. asfreq( 'M' , how= 'start' )
Period('2006-07', 'M')
p. asfreq( 'M' , how= 'end' )
Period('2007-06', 'M')
p = pd. Period( 'Aug-2007' , 'M' )
p
Period('2007-08', 'M')
p. asfreq( 'A-JUN' )
Period('2008', 'A-JUN')
rng = pd. period_range( '2006' , '2009' , freq= 'A-DEC' )
rng
PeriodIndex(['2006', '2007', '2008', '2009'], dtype='period[A-DEC]', freq='A-DEC')
ts = pd. Series( np. random. randn( len ( rng) ) , index= rng)
ts
2006 -0.784284
2007 -1.058489
2008 -0.151931
2009 0.443975
Freq: A-DEC, dtype: float64
ts. asfreq( 'M' , how= 'start' )
2006-01 -0.784284
2007-01 -1.058489
2008-01 -0.151931
2009-01 0.443975
Freq: M, dtype: float64
ts. asfreq( 'B' , how= 'end' )
2006-12-29 -0.784284
2007-12-31 -1.058489
2008-12-31 -0.151931
2009-12-31 0.443975
Freq: B, dtype: float64
11.5.2 季度区间频率
p = pd. Period( 2007 , freq= 'Q-JAN' )
p
Period('2007Q4', 'Q-JAN')
p. asfreq( 'D' , 'start' )
Period('2006-11-01', 'D')
p. asfreq( 'D' , 'end' )
Period('2007-01-31', 'D')
p4pm = ( p. asfreq( 'B' , 'e' ) - 1 ) . asfreq( 'T' , 'S' ) + 16 * 60
p4pm
Period('2007-01-30 16:00', 'T')
p4pm. to_timestamp( )
Timestamp('2007-01-30 16:00:00')
rng = pd. period_range( '2011Q3' , '2012Q4' , freq= 'Q-JAN' )
ts = pd. Series( np. arange( len ( rng) ) , index= rng)
ts
2011Q3 0
2011Q4 1
2012Q1 2
2012Q2 3
2012Q3 4
2012Q4 5
Freq: Q-JAN, dtype: int32
new_rng = ( rng. asfreq( 'B' , 'e' ) - 1 ) . asfreq( 'T' , 'S' ) + 16 * 60
ts. index = new_rng. to_timestamp( )
ts
2010-10-28 16:00:00 0
2011-01-28 16:00:00 1
2011-04-28 16:00:00 2
2011-07-28 16:00:00 3
2011-10-28 16:00:00 4
2012-01-30 16:00:00 5
dtype: int32
11.5.3 将时间戳转换为区间(以及逆转换)
rng = pd. date_range( '2000-01-01' , periods= 3 , freq= 'M' )
ts = pd. Series( np. random. randn( 3 ) , index= rng)
ts
2000-01-31 -0.622352
2000-02-29 -1.830166
2000-03-31 -0.601708
Freq: M, dtype: float64
pts = ts. to_period( )
pts
2000-01 -0.622352
2000-02 -1.830166
2000-03 -0.601708
Freq: M, dtype: float64
rng = pd. date_range( '1/29/2000' , periods= 6 , freq= 'D' )
rng
DatetimeIndex(['2000-01-29', '2000-01-30', '2000-01-31', '2000-02-01',
'2000-02-02', '2000-02-03'],
dtype='datetime64[ns]', freq='D')
ts2 = pd. Series( np. random. randn( 6 ) , index= rng)
ts2
2000-01-29 0.056749
2000-01-30 -0.310291
2000-01-31 1.369577
2000-02-01 -0.361215
2000-02-02 0.426119
2000-02-03 -0.010565
Freq: D, dtype: float64
ts2. to_period( 'M' )
2000-01 0.056749
2000-01 -0.310291
2000-01 1.369577
2000-02 -0.361215
2000-02 0.426119
2000-02 -0.010565
Freq: M, dtype: float64
pts = ts2. to_period( )
pts
2000-01-29 0.056749
2000-01-30 -0.310291
2000-01-31 1.369577
2000-02-01 -0.361215
2000-02-02 0.426119
2000-02-03 -0.010565
Freq: D, dtype: float64
pts. to_timestamp( how= 'end' )
2000-01-29 23:59:59.999999999 0.056749
2000-01-30 23:59:59.999999999 -0.310291
2000-01-31 23:59:59.999999999 1.369577
2000-02-01 23:59:59.999999999 -0.361215
2000-02-02 23:59:59.999999999 0.426119
2000-02-03 23:59:59.999999999 -0.010565
Freq: D, dtype: float64
11.5.4 从数组生成PeriodIndex
data = pd. read_csv( 'examples/macrodata.csv' )
data. head( 5 )
year quarter realgdp realcons realinv realgovt realdpi cpi m1 tbilrate unemp pop infl realint 0 1959.0 1.0 2710.349 1707.4 286.898 470.045 1886.9 28.98 139.7 2.82 5.8 177.146 0.00 0.00 1 1959.0 2.0 2778.801 1733.7 310.859 481.301 1919.7 29.15 141.7 3.08 5.1 177.830 2.34 0.74 2 1959.0 3.0 2775.488 1751.8 289.226 491.260 1916.4 29.35 140.5 3.82 5.3 178.657 2.74 1.09 3 1959.0 4.0 2785.204 1753.7 299.356 484.052 1931.3 29.37 140.0 4.33 5.6 179.386 0.27 4.06 4 1960.0 1.0 2847.699 1770.5 331.722 462.199 1955.5 29.54 139.6 3.50 5.2 180.007 2.31 1.19
data. year
0 1959.0
1 1959.0
2 1959.0
3 1959.0
4 1960.0
...
198 2008.0
199 2008.0
200 2009.0
201 2009.0
202 2009.0
Name: year, Length: 203, dtype: float64
data. quarter
0 1.0
1 2.0
2 3.0
3 4.0
4 1.0
...
198 3.0
199 4.0
200 1.0
201 2.0
202 3.0
Name: quarter, Length: 203, dtype: float64
index = pd. PeriodIndex( year= data. year, quarter= data. quarter, freq= 'Q-DEC' )
index
PeriodIndex(['1959Q1', '1959Q2', '1959Q3', '1959Q4', '1960Q1', '1960Q2',
'1960Q3', '1960Q4', '1961Q1', '1961Q2',
...
'2007Q2', '2007Q3', '2007Q4', '2008Q1', '2008Q2', '2008Q3',
'2008Q4', '2009Q1', '2009Q2', '2009Q3'],
dtype='period[Q-DEC]', length=203, freq='Q-DEC')
data. index = index
data. head( )
year quarter realgdp realcons realinv realgovt realdpi cpi m1 tbilrate unemp pop infl realint 1959Q1 1959.0 1.0 2710.349 1707.4 286.898 470.045 1886.9 28.98 139.7 2.82 5.8 177.146 0.00 0.00 1959Q2 1959.0 2.0 2778.801 1733.7 310.859 481.301 1919.7 29.15 141.7 3.08 5.1 177.830 2.34 0.74 1959Q3 1959.0 3.0 2775.488 1751.8 289.226 491.260 1916.4 29.35 140.5 3.82 5.3 178.657 2.74 1.09 1959Q4 1959.0 4.0 2785.204 1753.7 299.356 484.052 1931.3 29.37 140.0 4.33 5.6 179.386 0.27 4.06 1960Q1 1960.0 1.0 2847.699 1770.5 331.722 462.199 1955.5 29.54 139.6 3.50 5.2 180.007 2.31 1.19
data. infl
1959Q1 0.00
1959Q2 2.34
1959Q3 2.74
1959Q4 0.27
1960Q1 2.31
...
2008Q3 -3.16
2008Q4 -8.79
2009Q1 0.94
2009Q2 3.37
2009Q3 3.56
Freq: Q-DEC, Name: infl, Length: 203, dtype: float64
11.6 重新采样与频率转换
重新采样是指将时间序列从一个频率转换为另一个频率的过程。 将更高频率的数据聚合到低频率被称为向下采样,而从低频率转换到高频率称为向上采样。 并不是所有的重新采样都属于上面说的两类 pandas对象都配有resample方法,该方法是所有频率转换的工具函数。 resample拥有类似于groupby的API;你调用resample对数据分组,之后再调用聚合函数
rng = pd. date_range( '2000-01-01' , periods= 100 , freq= 'D' )
ts = pd. Series( np. random. randn( len ( rng) ) , index= rng)
ts
2000-01-01 0.037518
2000-01-02 -0.067833
2000-01-03 -0.704365
2000-01-04 -0.368819
2000-01-05 -1.208229
...
2000-04-05 1.085903
2000-04-06 0.746400
2000-04-07 0.517810
2000-04-08 -0.363853
2000-04-09 0.057838
Freq: D, Length: 100, dtype: float64
ts. resample( 'M' ) . mean( )
2000-01-31 -0.108144
2000-02-29 -0.055158
2000-03-31 0.358646
2000-04-30 0.773431
Freq: M, dtype: float64
ts. resample( 'M' , kind= 'period' ) . mean( )
2000-01 -0.108144
2000-02 -0.055158
2000-03 0.358646
2000-04 0.773431
Freq: M, dtype: float64
参数 描述 freq 表明所需采样频率的字符串或DateOffset对象(例如,‘M’. ‘5min’ 或Second(1)) axis 需要采样的轴向;默认是axis=0 fill_method 向上采样时的插值方式,‘ffill’ 或’bf11’;默认是不插值的 closed 向下采样中,每段间隔的哪-段是封闭的(包含的),’ right’或’1eft’ label 向下采样中,如何用’right’或’left’的箱标签标记聚合结果(例如,9:30到9:35的五分钟间隔可以被标记为9:30或9:35) loffset 对箱标签进行时间调校,例如’-1s’ /Second (-1) 可以将聚合标签向前移动一秒 limit 在前向或后向填充时,填充区间的最大值 kind 对区间(’ period’)或时间戳(‘timestamp’) 的聚合:默认为时间序列索引的类型 convention 在对区间重新采样时,用于将低频周期转换为高频的约定(‘start’ 或’end’);默认为’end’
11.6.1 向下采样
在使用resample进行向下采样数据时有些事情需要考虑:
每段间隔的哪一边是闭合的 如何在间隔的起始或结束位置标记每个已聚合的箱体
rng = pd. date_range( '2000-01-01' , periods= 12 , freq= 'T' )
ts = pd. Series( np. random. randn( len ( rng) ) , index= rng)
ts
2000-01-01 00:00:00 0.579555
2000-01-01 00:01:00 0.210504
2000-01-01 00:02:00 0.135227
2000-01-01 00:03:00 -0.125649
2000-01-01 00:04:00 1.511186
2000-01-01 00:05:00 -1.384490
2000-01-01 00:06:00 -0.938106
2000-01-01 00:07:00 2.057013
2000-01-01 00:08:00 -0.737751
2000-01-01 00:09:00 1.469319
2000-01-01 00:10:00 -0.257609
2000-01-01 00:11:00 -0.975761
Freq: T, dtype: float64
ts. resample( '5min' , closed= 'right' , label= 'right' ) . sum ( )
2000-01-01 00:00:00 0.579555
2000-01-01 00:05:00 0.346778
2000-01-01 00:10:00 1.592866
2000-01-01 00:15:00 -0.975761
Freq: 5T, dtype: float64
ts. resample( '5min' , closed= 'left' ) . sum ( )
2000-01-01 00:00:00 2.310823
2000-01-01 00:05:00 0.465985
2000-01-01 00:10:00 -1.233370
Freq: 5T, dtype: float64
ts. resample( '5min' ) . sum ( )
2000-01-01 00:00:00 2.310823
2000-01-01 00:05:00 0.465985
2000-01-01 00:10:00 -1.233370
Freq: 5T, dtype: float64
ts. resample( '5min' , closed= 'right' , label= 'right' , loffset= '-1s' ) . sum ( )
<ipython-input-407-0b956efa5584>:3: FutureWarning: 'loffset' in .resample() and in Grouper() is deprecated.
>>> df.resample(freq="3s", loffset="8H")
becomes:
>>> from pandas.tseries.frequencies import to_offset
>>> df = df.resample(freq="3s").mean()
>>> df.index = df.index.to_timestamp() + to_offset("8H")
ts.resample('5min',closed='right',label='right',loffset='-1s').sum()
1999-12-31 23:59:59 0.579555
2000-01-01 00:04:59 0.346778
2000-01-01 00:09:59 1.592866
2000-01-01 00:14:59 -0.975761
Freq: 5T, dtype: float64
11.6.1.1 开端-峰值-谷值-结束(OHLC)重新采样
在金融中,为每个数据桶计算四个值是一种流行的时间序列聚合方法:第一个值(开端)、最后一个值(结束)、最大值(峰值)和最小值(谷值)。通过使用ohlc聚合函数你将会获得包含四种聚合值列的DataFrame,这些值在数据的单次扫描中被高效计算
ts. resample( '5min' ) . ohlc( )
open high low close 2000-01-01 00:00:00 0.579555 1.511186 -0.125649 1.511186 2000-01-01 00:05:00 -1.384490 2.057013 -1.384490 1.469319 2000-01-01 00:10:00 -0.257609 -0.257609 -0.975761 -0.975761
11.6.2 向上采样与插值
当从低频率转换为高频率时,并不需要任何聚合 由于区间涉及时间范围,向上采样和向下采样就更为严格:
在向下采样中,目标频率必须是原频率的子区间。 在向上采样中,目标频率必须是原频率的父区间。
frame = pd. DataFrame( np. random. randn( 2 , 4 ) ,
index= pd. date_range( '1/1/2000' , periods= 2 , freq= 'W-WED' ) ,
columns= [ 'colorado' , 'texas' , 'new york' , 'ohio' ] )
frame
colorado texas new york ohio 2000-01-05 -1.050656 0.063228 -1.179070 1.070880 2000-01-12 0.160683 1.629834 -0.340546 -0.767973
df_daily = frame. resample( 'D' ) . asfreq( )
df_daily
colorado texas new york ohio 2000-01-05 -1.050656 0.063228 -1.179070 1.070880 2000-01-06 NaN NaN NaN NaN 2000-01-07 NaN NaN NaN NaN 2000-01-08 NaN NaN NaN NaN 2000-01-09 NaN NaN NaN NaN 2000-01-10 NaN NaN NaN NaN 2000-01-11 NaN NaN NaN NaN 2000-01-12 0.160683 1.629834 -0.340546 -0.767973
frame. resample( 'D' ) . ffill( )
colorado texas new york ohio 2000-01-05 -1.050656 0.063228 -1.179070 1.070880 2000-01-06 -1.050656 0.063228 -1.179070 1.070880 2000-01-07 -1.050656 0.063228 -1.179070 1.070880 2000-01-08 -1.050656 0.063228 -1.179070 1.070880 2000-01-09 -1.050656 0.063228 -1.179070 1.070880 2000-01-10 -1.050656 0.063228 -1.179070 1.070880 2000-01-11 -1.050656 0.063228 -1.179070 1.070880 2000-01-12 0.160683 1.629834 -0.340546 -0.767973
frame. resample( 'D' ) . ffill( limit= 2 )
colorado texas new york ohio 2000-01-05 -1.050656 0.063228 -1.179070 1.070880 2000-01-06 -1.050656 0.063228 -1.179070 1.070880 2000-01-07 -1.050656 0.063228 -1.179070 1.070880 2000-01-08 NaN NaN NaN NaN 2000-01-09 NaN NaN NaN NaN 2000-01-10 NaN NaN NaN NaN 2000-01-11 NaN NaN NaN NaN 2000-01-12 0.160683 1.629834 -0.340546 -0.767973
frame. resample( 'W-THU' ) . ffill( )
colorado texas new york ohio 2000-01-06 -1.050656 0.063228 -1.179070 1.070880 2000-01-13 0.160683 1.629834 -0.340546 -0.767973
11.6.3 使用区间进行重新采样
frame = pd. DataFrame( np. random. randn( 24 , 4 ) ,
index= pd. period_range( '1-2000' , '12-2001' , freq= 'M' ) ,
columns= [ 'colorado' , 'texas' , 'new york' , 'ohio' ] )
frame. head( )
colorado texas new york ohio 2000-01 0.827955 0.212008 0.136026 -1.278525 2000-02 0.745499 1.909823 0.819960 -0.569550 2000-03 0.893031 0.512417 1.035980 -0.923079 2000-04 -0.223343 0.536235 -1.872214 -0.210844 2000-05 -2.059027 -2.296580 -1.082294 -0.352297
annual_frame = frame. resample( 'A-DEC' ) . mean( )
annual_frame
colorado texas new york ohio 2000 0.062947 0.204868 -0.149046 0.072700 2001 0.433417 0.014037 -0.483594 0.242176
annual_frame. resample( 'Q-DEC' ) . ffill( )
colorado texas new york ohio 2000Q1 0.062947 0.204868 -0.149046 0.072700 2000Q2 0.062947 0.204868 -0.149046 0.072700 2000Q3 0.062947 0.204868 -0.149046 0.072700 2000Q4 0.062947 0.204868 -0.149046 0.072700 2001Q1 0.433417 0.014037 -0.483594 0.242176 2001Q2 0.433417 0.014037 -0.483594 0.242176 2001Q3 0.433417 0.014037 -0.483594 0.242176 2001Q4 0.433417 0.014037 -0.483594 0.242176
annual_frame. resample( 'Q-MAR' ) . ffill( )
colorado texas new york ohio 2000Q4 0.062947 0.204868 -0.149046 0.072700 2001Q1 0.062947 0.204868 -0.149046 0.072700 2001Q2 0.062947 0.204868 -0.149046 0.072700 2001Q3 0.062947 0.204868 -0.149046 0.072700 2001Q4 0.433417 0.014037 -0.483594 0.242176 2002Q1 0.433417 0.014037 -0.483594 0.242176 2002Q2 0.433417 0.014037 -0.483594 0.242176 2002Q3 0.433417 0.014037 -0.483594 0.242176
11.7 移动窗口函数
close_px_all = pd. read_csv( 'examples/stock_px_2.csv' , parse_dates= True , index_col= 0 )
close_px_all
AAPL MSFT XOM SPX 2003-01-02 7.40 21.11 29.22 909.03 2003-01-03 7.45 21.14 29.24 908.59 2003-01-06 7.45 21.52 29.96 929.01 2003-01-07 7.43 21.93 28.95 922.93 2003-01-08 7.28 21.31 28.83 909.93 ... ... ... ... ... 2011-10-10 388.81 26.94 76.28 1194.89 2011-10-11 400.29 27.00 76.27 1195.54 2011-10-12 402.19 26.96 77.16 1207.25 2011-10-13 408.43 27.18 76.37 1203.66 2011-10-14 422.00 27.27 78.11 1224.58
2214 rows × 4 columns
close_px = close_px_all[ [ 'AAPL' , 'MSFT' , 'XOM' ] ]
close_px
AAPL MSFT XOM 2003-01-02 7.40 21.11 29.22 2003-01-03 7.45 21.14 29.24 2003-01-06 7.45 21.52 29.96 2003-01-07 7.43 21.93 28.95 2003-01-08 7.28 21.31 28.83 ... ... ... ... 2011-10-10 388.81 26.94 76.28 2011-10-11 400.29 27.00 76.27 2011-10-12 402.19 26.96 77.16 2011-10-13 408.43 27.18 76.37 2011-10-14 422.00 27.27 78.11
2214 rows × 3 columns
close_px = close_px. resample( 'B' ) . ffill( )
close_px
AAPL MSFT XOM 2003-01-02 7.40 21.11 29.22 2003-01-03 7.45 21.14 29.24 2003-01-06 7.45 21.52 29.96 2003-01-07 7.43 21.93 28.95 2003-01-08 7.28 21.31 28.83 ... ... ... ... 2011-10-10 388.81 26.94 76.28 2011-10-11 400.29 27.00 76.27 2011-10-12 402.19 26.96 77.16 2011-10-13 408.43 27.18 76.37 2011-10-14 422.00 27.27 78.11
2292 rows × 3 columns
close_px. AAPL. plot( )
close_px. AAPL. rolling( 250 ) . mean( ) . plot( )
app1_std250 = close_px. AAPL. rolling( 250 , min_periods= 10 ) . std( )
app1_std250[ 5 : 12 ]
2003-01-09 NaN
2003-01-10 NaN
2003-01-13 NaN
2003-01-14 NaN
2003-01-15 0.077496
2003-01-16 0.074760
2003-01-17 0.112368
Freq: B, Name: AAPL, dtype: float64
app1_std250. plot( )
expanding_mean = app1_std250. expanding( ) . mean( )
close_px. rolling( 60 ) . mean( ) . plot( logy= True )
close_px. rolling( '20D' ) . mean( )
AAPL MSFT XOM 2003-01-02 7.400000 21.110000 29.220000 2003-01-03 7.425000 21.125000 29.230000 2003-01-06 7.433333 21.256667 29.473333 2003-01-07 7.432500 21.425000 29.342500 2003-01-08 7.402000 21.402000 29.240000 ... ... ... ... 2011-10-10 389.351429 25.602143 72.527857 2011-10-11 388.505000 25.674286 72.835000 2011-10-12 388.531429 25.810000 73.400714 2011-10-13 388.826429 25.961429 73.905000 2011-10-14 391.038000 26.048667 74.185333
2292 rows × 3 columns
11.7.1 指数加权函数
指定一个常数衰减因子以向更多近期观测值提供更多权重,可以替代使用具有相等加权观察值的静态窗口尺寸的方法
appl_px = close_px. AAPL[ '2006' : '2007' ]
ma60 = appl_px. rolling( 30 , min_periods= 20 ) . mean( )
ewma60 = appl_px. ewm( span= 30 ) . mean( )
import matplotlib. pyplot as plt
ma60. plot( style= 'k--' , label= 'simple ma' )
ewma60. plot( style= 'k-' , label= 'ew ma' )
plt. legend( )
<matplotlib.legend.Legend at 0x259c45d36d0>
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-N4fO6y4C-1640758272713)(output_335_1.png)]
11.7.2 二元移动窗口函数
spx_px = close_px_all[ 'SPX' ]
spx_rets = close_px. pct_change( )
returns = close_px. pct_change( )
corr = returns. AAPL. rolling( 125 , min_periods= 100 ) . corr( spx_rets)
corr. plot( )
11.7.3 用户自定义的移动窗口函数
from scipy. stats import percentileofscore
score_at_2percent = lambda x: percentileofscore( x, 0.2 )
result = returns. AAPL. rolling( 250 ) . apply ( score_at_2percent)
result. plot( )
第十二章 高阶pandas
12.1 分类数据
12.1.1 背景和目标
import numpy as np
import pandas as pd
values = pd. Series( [ 'apple' , 'orange' , 'apple' , 'apple' ] * 2 )
values
0 apple
1 orange
2 apple
3 apple
4 apple
5 orange
6 apple
7 apple
dtype: object
pd. unique( values)
array(['apple', 'orange'], dtype=object)
pd. value_counts( values)
apple 6
orange 2
dtype: int64
values = pd. Series( [ 0 , 1 , 0 , 0 ] * 2 )
values
0 0
1 1
2 0
3 0
4 0
5 1
6 0
7 0
dtype: int64
dim = pd. Series( [ 'apple' , 'orange' ] )
dim
0 apple
1 orange
dtype: object
dim. take( values)
0 apple
1 orange
0 apple
0 apple
0 apple
1 orange
0 apple
0 apple
dtype: object
12.1.2 pandas中的Categorical类型
pandas拥有特殊的Categorical类型,用于承载基于整数的类别展示或编码的数据。
fruits = [ 'apple' , 'orange' , 'apple' , 'apple' ] * 2
fruits
['apple', 'orange', 'apple', 'apple', 'apple', 'orange', 'apple', 'apple']
N = len ( fruits)
N
8
df = pd. DataFrame( { 'fruit' : fruits,
'basker_id' : np. arange( N) ,
'count' : np. random. randint( 3 , 15 , size= N) ,
'weight' : np. random. uniform( 0 , 4 , size= N) } )
df
fruit basker_id count weight 0 apple 0 14 3.004288 1 orange 1 9 1.432892 2 apple 2 10 2.859906 3 apple 3 6 2.328174 4 apple 4 12 3.558704 5 orange 5 13 1.199922 6 apple 6 5 0.090920 7 apple 7 8 2.336066
fruit_cat = df[ 'fruit' ] . astype( 'category' )
fruit_cat
0 apple
1 orange
2 apple
3 apple
4 apple
5 orange
6 apple
7 apple
Name: fruit, dtype: category
Categories (2, object): ['apple', 'orange']
c = fruit_cat. values
c
['apple', 'orange', 'apple', 'apple', 'apple', 'orange', 'apple', 'apple']
Categories (2, object): ['apple', 'orange']
type ( c)
pandas.core.arrays.categorical.Categorical
c. categories
Index(['apple', 'orange'], dtype='object')
c. codes
array([0, 1, 0, 0, 0, 1, 0, 0], dtype=int8)
df[ 'fruit' ] = df[ 'fruit' ] . astype( 'category' )
df. fruit
0 apple
1 orange
2 apple
3 apple
4 apple
5 orange
6 apple
7 apple
Name: fruit, dtype: category
Categories (2, object): ['apple', 'orange']
my_categories = pd. Categorical( [ 'foo' , 'bar' , 'baz' , 'foo' , 'bar' ] )
my_categories
['foo', 'bar', 'baz', 'foo', 'bar']
Categories (3, object): ['bar', 'baz', 'foo']
categories = [ 'foo' , 'bar' , 'baz' ]
codes = [ 0 , 1 , 2 , 0 , 0 , 1 ]
my_cats_2 = pd. Categorical. from_codes( codes, categories)
my_cats_2
['foo', 'bar', 'baz', 'foo', 'foo', 'bar']
Categories (3, object): ['foo', 'bar', 'baz']
ordered_cat = pd. Categorical. from_codes( codes, categories, ordered= True )
ordered_cat
['foo', 'bar', 'baz', 'foo', 'foo', 'bar']
Categories (3, object): ['foo' < 'bar' < 'baz']
my_cats_2. as_ordered( )
['foo', 'bar', 'baz', 'foo', 'foo', 'bar']
Categories (3, object): ['foo' < 'bar' < 'baz']
12.1.3 使用Categorical对象进行计算
np. random. seed( 12345 )
draws = np. random. randn( 1000 )
draws[ : 5 ]
array([-0.20470766, 0.47894334, -0.51943872, -0.5557303 , 1.96578057])
bins = pd. cut( draws, 4 )
bins
[(-1.23, 0.489], (-1.23, 0.489], (-1.23, 0.489], (-1.23, 0.489], (0.489, 2.208], ..., (-1.23, 0.489], (-1.23, 0.489], (-1.23, 0.489], (0.489, 2.208], (0.489, 2.208]]
Length: 1000
Categories (4, interval[float64]): [(-2.956, -1.23] < (-1.23, 0.489] < (0.489, 2.208] < (2.208, 3.928]]
bins = pd. qcut( draws, 4 , labels= [ 'Q1' , 'Q2' , 'Q3' , 'Q4' ] )
bins
['Q2', 'Q3', 'Q2', 'Q2', 'Q4', ..., 'Q3', 'Q2', 'Q1', 'Q3', 'Q4']
Length: 1000
Categories (4, object): ['Q1' < 'Q2' < 'Q3' < 'Q4']
bins = pd. Series( bins, name= 'quartile' )
results = ( pd. Series( draws)
. groupby( bins)
. agg( [ 'count' , 'min' , 'max' ] )
. reset_index( ) )
results
quartile count min max 0 Q1 250 -2.949343 -0.685484 1 Q2 250 -0.683066 -0.010115 2 Q3 250 -0.010032 0.628894 3 Q4 250 0.634238 3.927528
results[ 'quartile' ]
0 Q1
1 Q2
2 Q3
3 Q4
Name: quartile, dtype: category
Categories (4, object): ['Q1' < 'Q2' < 'Q3' < 'Q4']
12.1.3.1 使用分类获得更高性能
N = 10000000
draws = pd. Series( np. random. randn( N) )
labels = pd. Series( [ 'foo' , 'bar' , 'baz' , 'qux' ] * ( N// 4 ) )
categories = labels. astype( 'category' )
labels. memory_usage( )
80000128
categories. memory_usage( )
10000332
% timeit labels. astype( 'category' )
299 ms ± 1.05 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
12.1.4 分类方法
方法 描述 add_ categories 将新的类别(未使用过的)添加到已有类别的尾部 as_ordered 对类别排序 as_unordered 使类别无序 remove_categories 去除类别,将被移除的值置为null remove_unused_categories 去除所有没有出现在数据中的类别 rename_categories 使用新的类别名称替代现有的类别,不会改变类别的数量 reorder_categories 与rename_categories 类似,但结果是经过排序的类别 set_categories 用指定的一组新类别替换现有类别,可以添加或删除类别
s = pd. Series( [ 'a' , 'b' , 'c' , 'd' ] * 2 )
cat_s = s. astype( 'category' )
cat_s
0 a
1 b
2 c
3 d
4 a
5 b
6 c
7 d
dtype: category
Categories (4, object): ['a', 'b', 'c', 'd']
cat_s. cat. codes
0 0
1 1
2 2
3 3
4 0
5 1
6 2
7 3
dtype: int8
cat_s. cat. categories
Index(['a', 'b', 'c', 'd'], dtype='object')
actual_categories = [ 'a' , 'b' , 'c' , 'd' , 'e' ]
cat_s2 = cat_s. cat. set_categories( actual_categories)
cat_s2
0 a
1 b
2 c
3 d
4 a
5 b
6 c
7 d
dtype: category
Categories (5, object): ['a', 'b', 'c', 'd', 'e']
cat_s. value_counts( )
a 2
b 2
c 2
d 2
dtype: int64
cat_s2. value_counts( )
a 2
b 2
c 2
d 2
e 0
dtype: int64
cat_s3 = cat_s[ cat_s2. isin( [ 'a' , 'b' ] ) ]
cat_s3
0 a
1 b
4 a
5 b
dtype: category
Categories (4, object): ['a', 'b', 'c', 'd']
cat_s3. cat. remove_unused_categories( )
0 a
1 b
4 a
5 b
dtype: category
Categories (2, object): ['a', 'b']
12.1.4.1 创建用于建模的虚拟变量
当你使用统计数据或机器学习工具时,通常会将分类数据转换为虚拟变量,也称为one-hot编码。 这会产生一个DataFrame,每个不同的类别都是它的一列。这些列包含一个特定类别的出现次数,否则为0。
cat_s = pd. Series( [ 'a' , 'b' , 'c' , 'd' ] * 2 , dtype= 'category' )
cat_s
0 a
1 b
2 c
3 d
4 a
5 b
6 c
7 d
dtype: category
Categories (4, object): ['a', 'b', 'c', 'd']
pd. get_dummies( cat_s)
a b c d 0 1 0 0 0 1 0 1 0 0 2 0 0 1 0 3 0 0 0 1 4 1 0 0 0 5 0 1 0 0 6 0 0 1 0 7 0 0 0 1
12.2 高阶GroupBy应用
12.2.1 分组转换和“展开”GroupBy
内建方法transform
transform可以产生一个标量值,并广播到各分组的尺寸数据中 transform可以产生一个与输入分组尺寸相同的对象 transform不可改变它的输入
df = pd. DataFrame( { 'key' : [ 'a' , 'b' , 'c' ] * 4 ,
'value' : np. arange( 12 . ) } )
df
key value 0 a 0.0 1 b 1.0 2 c 2.0 3 a 3.0 4 b 4.0 5 c 5.0 6 a 6.0 7 b 7.0 8 c 8.0 9 a 9.0 10 b 10.0 11 c 11.0
g = df. groupby( 'key' ) . value
g. mean( )
key
a 4.5
b 5.5
c 6.5
Name: value, dtype: float64
g. transform( lambda x: x. mean( ) )
0 4.5
1 5.5
2 6.5
3 4.5
4 5.5
5 6.5
6 4.5
7 5.5
8 6.5
9 4.5
10 5.5
11 6.5
Name: value, dtype: float64
g. transform( 'mean' )
0 4.5
1 5.5
2 6.5
3 4.5
4 5.5
5 6.5
6 4.5
7 5.5
8 6.5
9 4.5
10 5.5
11 6.5
Name: value, dtype: float64
g. transform( lambda x: x* 2 )
0 0.0
1 2.0
2 4.0
3 6.0
4 8.0
5 10.0
6 12.0
7 14.0
8 16.0
9 18.0
10 20.0
11 22.0
Name: value, dtype: float64
g. transform( lambda x: x. rank( ascending= False ) )
0 4.0
1 4.0
2 4.0
3 3.0
4 3.0
5 3.0
6 2.0
7 2.0
8 2.0
9 1.0
10 1.0
11 1.0
Name: value, dtype: float64
def normalize ( x) :
return ( x- x. mean( ) ) / x. std( )
g. transform( normalize)
0 -1.161895
1 -1.161895
2 -1.161895
3 -0.387298
4 -0.387298
5 -0.387298
6 0.387298
7 0.387298
8 0.387298
9 1.161895
10 1.161895
11 1.161895
Name: value, dtype: float64
g. apply ( normalize)
0 -1.161895
1 -1.161895
2 -1.161895
3 -0.387298
4 -0.387298
5 -0.387298
6 0.387298
7 0.387298
8 0.387298
9 1.161895
10 1.161895
11 1.161895
Name: value, dtype: float64
g. transform( 'mean' )
0 4.5
1 5.5
2 6.5
3 4.5
4 5.5
5 6.5
6 4.5
7 5.5
8 6.5
9 4.5
10 5.5
11 6.5
Name: value, dtype: float64
normalized = ( df[ 'value' ] - g. transform( 'mean' ) ) / g. transform( 'std' )
normalized
0 -1.161895
1 -1.161895
2 -1.161895
3 -0.387298
4 -0.387298
5 -0.387298
6 0.387298
7 0.387298
8 0.387298
9 1.161895
10 1.161895
11 1.161895
Name: value, dtype: float64
12.2.2 分组的时间重新采样
对于时间序列数据,resample方法在语义上是一种基于时间分段的分组操作
N = 15
times = pd. date_range( '2017-05-20 00:00' , freq= '1min' , periods= N)
times
DatetimeIndex(['2017-05-20 00:00:00', '2017-05-20 00:01:00',
'2017-05-20 00:02:00', '2017-05-20 00:03:00',
'2017-05-20 00:04:00', '2017-05-20 00:05:00',
'2017-05-20 00:06:00', '2017-05-20 00:07:00',
'2017-05-20 00:08:00', '2017-05-20 00:09:00',
'2017-05-20 00:10:00', '2017-05-20 00:11:00',
'2017-05-20 00:12:00', '2017-05-20 00:13:00',
'2017-05-20 00:14:00'],
dtype='datetime64[ns]', freq='T')
df = pd. DataFrame( { 'time' : times,
'value' : np. arange( N) } )
df
time value 0 2017-05-20 00:00:00 0 1 2017-05-20 00:01:00 1 2 2017-05-20 00:02:00 2 3 2017-05-20 00:03:00 3 4 2017-05-20 00:04:00 4 5 2017-05-20 00:05:00 5 6 2017-05-20 00:06:00 6 7 2017-05-20 00:07:00 7 8 2017-05-20 00:08:00 8 9 2017-05-20 00:09:00 9 10 2017-05-20 00:10:00 10 11 2017-05-20 00:11:00 11 12 2017-05-20 00:12:00 12 13 2017-05-20 00:13:00 13 14 2017-05-20 00:14:00 14
df. set_index( 'time' ) . resample( '5min' ) . count( )
value time 2017-05-20 00:00:00 5 2017-05-20 00:05:00 5 2017-05-20 00:10:00 5
df2 = pd. DataFrame( { 'time' : times. repeat( 3 ) ,
'key' : np. tile( [ 'a' , 'b' , 'c' ] , N) ,
'value' : np. arange( N* 3.0 ) } )
df2[ : 7 ]
time key value 0 2017-05-20 00:00:00 a 0.0 1 2017-05-20 00:00:00 b 1.0 2 2017-05-20 00:00:00 c 2.0 3 2017-05-20 00:01:00 a 3.0 4 2017-05-20 00:01:00 b 4.0 5 2017-05-20 00:01:00 c 5.0 6 2017-05-20 00:02:00 a 6.0
TimeGrouper调用不了 https://blog.csdn.net/weixin_51701683/article/details/110710765 from pandas.core import resample
from pandas. core import resample
time_key = resample. TimeGrouper( '5min' )
time_key
TimeGrouper(freq=<5 * Minutes>, axis=0, sort=True, closed='left', label='left', how='mean', convention='e', origin='start_day')
resampled = ( df2. set_index( 'time' )
. groupby( [ 'key' , time_key] )
. sum ( ) )
resampled
value key time a 2017-05-20 00:00:00 30.0 2017-05-20 00:05:00 105.0 2017-05-20 00:10:00 180.0 b 2017-05-20 00:00:00 35.0 2017-05-20 00:05:00 110.0 2017-05-20 00:10:00 185.0 c 2017-05-20 00:00:00 40.0 2017-05-20 00:05:00 115.0 2017-05-20 00:10:00 190.0
resampled. reset_index( )
key time value 0 a 2017-05-20 00:00:00 30.0 1 a 2017-05-20 00:05:00 105.0 2 a 2017-05-20 00:10:00 180.0 3 b 2017-05-20 00:00:00 35.0 4 b 2017-05-20 00:05:00 110.0 5 b 2017-05-20 00:10:00 185.0 6 c 2017-05-20 00:00:00 40.0 7 c 2017-05-20 00:05:00 115.0 8 c 2017-05-20 00:10:00 190.0
12.3 方法链技术
原位赋值可能比使用assign更为快速,但assign可以实现更方便的方法链
12.3.1 pipe方法
表达式f(df)和df.pipe(f)是等价的,但是pipe使得链式调用更为方便
12.4 本章小结