利用python进行数据分析（4）

最新推荐文章于 2024-04-01 12:28:32 发布

hannah2sah

最新推荐文章于 2024-04-01 12:28:32 发布

阅读量1.4k

点赞数 1

分类专栏： # 书籍笔记 # python(软件安装与基础学习) 文章标签： python 数据分析

本文链接：https://blog.csdn.net/weixin_43816759/article/details/122214630

版权

python(软件安装与基础学习) 同时被 2 个专栏收录

48 篇文章 25 订阅

订阅专栏

书籍笔记

25 篇文章 3 订阅

订阅专栏

第十章数据聚合与分组操作

使用一个或多个键（以函数、数组或DataFrame列名的形式）将pandas对象拆分为多块
计算组汇总统计信息，如计数、平均值或标准偏差或用户定义的函数
应用组内变换或其他操作，如标准化、线性回归、排位或子集选择
计算数据透视表和交叉表
执行分位数分析和其他统计组分析

10.1 GroupBy机制

import numpy as np
import pandas as pd

df = pd.DataFrame({'data1':np.random.randn(5),
                  'data2':np.random.randn(5),
                  'key1':['a','a','b','b','a'],
                  'key2':['one','two','one','two','one']})
df

	data1	data2	key1	key2
0	0.405121	0.233504	a	one
1	-0.419340	0.207110	a	two
2	0.871916	-0.476933	b	one
3	-0.311263	0.614344	b	two
4	0.498925	0.502884	a	one

grouped = df['data1'].groupby(df['key1'])
grouped.mean()

key1
a    0.161569
b    0.280326
Name: data1, dtype: float64

means = df['data1'].groupby([df['key1'],df['key2']]).mean()
means

key1  key2
a     one     0.452023
      two    -0.419340
b     one     0.871916
      two    -0.311263
Name: data1, dtype: float64

means.unstack()

key2	one	two
key1
a	0.452023	-0.419340
b	0.871916	-0.311263

states = np.array(['ohio','california','california','ohio','ohio'])
years = np.array([2005,2005,2006,2005,2006])
df['data1'].groupby([states,years]).mean()

california  2005   -0.419340
            2006    0.871916
ohio        2005    0.046929
            2006    0.498925
Name: data1, dtype: float64

#可以传递列名（无论那些列名是字符串、数字或其他Python对象）作为分组键
df.groupby('key1').mean()

	data1	data2
key1
a	0.161569	0.314499
b	0.280326	0.068706

df.groupby(['key1','key2']).mean()

		data1	data2
key1	key2
a	one	0.452023	0.368194
a	two	-0.419340	0.207110
b	one	0.871916	-0.476933
b	two	-0.311263	0.614344

#通用的GroupBy方法是size,size方法返回一个包含组大小信息的Series
#请注意，分组键中的任何缺失值将被排除在结果之外。
df.groupby(['key1','key2']).size()

key1  key2
a     one     2
      two     1
b     one     1
      two     1
dtype: int64

10.1.1 遍历各分组

GroupBy对象支持迭代，会生成一个包含组名和数据块的2维元组序列。

for name,group in df.groupby('key1'):
    print(name)
    print(group)

a
      data1     data2 key1 key2
0  0.405121  0.233504    a  one
1 -0.419340  0.207110    a  two
4  0.498925  0.502884    a  one
b
      data1     data2 key1 key2
2  0.871916 -0.476933    b  one
3 -0.311263  0.614344    b  two

df

	data1	data2	key1	key2
0	0.405121	0.233504	a	one
1	-0.419340	0.207110	a	two
2	0.871916	-0.476933	b	one
3	-0.311263	0.614344	b	two
4	0.498925	0.502884	a	one

for (k1,k2),group in df.groupby(['key1','key2']):
    print((k1,k2))
    print(group)

('a', 'one')
      data1     data2 key1 key2
0  0.405121  0.233504    a  one
4  0.498925  0.502884    a  one
('a', 'two')
     data1    data2 key1 key2
1 -0.41934  0.20711    a  two
('b', 'one')
      data1     data2 key1 key2
2  0.871916 -0.476933    b  one
('b', 'two')
      data1     data2 key1 key2
3 -0.311263  0.614344    b  two

pieces = dict(list(df.groupby('key1')))
pieces['b']

	data1	data2	key1	key2
2	0.871916	-0.476933	b	one
3	-0.311263	0.614344	b	two

#groupby在axis=0的轴向上分组，但你也可以在其他任意轴向上进行分组。
df.dtypes

data1    float64
data2    float64
key1      object
key2      object
dtype: object

for dtype,group in grouped:
    print(dtype)
    print(group)

a
0    0.405121
1   -0.419340
4    0.498925
Name: data1, dtype: float64
b
2    0.871916
3   -0.311263
Name: data1, dtype: float64

10.1.2 选择一列或所有列的子集

将从DataFrame创建的GroupBy对象用列名称或列名称数组进行索引时，会产生用于聚合的列子集的效果。
“语法糖（Syntactic sugar），也译为糖衣语法，是由英国计算机科学家彼得·约翰·兰达（Peter J. Landin）发明的一个术语，指计算机语言中添加的某种语法，这种语法对语言的功能并没有影响，但是更方便程序员使用。通常来说使用语法糖能够增加程序的可读性，从而减少程序代码出错的机会。”

df.groupby('key1')['data1']

<pandas.core.groupby.generic.SeriesGroupBy object at 0x00000259B9B6D760>

df.groupby('key1')[['data2']]

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x00000259B9B70370>

df['data1'].groupby(df['key1'])

<pandas.core.groupby.generic.SeriesGroupBy object at 0x00000259B9B70610>

df[['data2']].groupby(df['key1'])

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x00000259B9B70FA0>

df.groupby(['key1','key2'])[['data2']].mean()

		data2
key1	key2
a	one	0.368194
a	two	0.207110
b	one	-0.476933
b	two	0.614344

s_grouped = df.groupby(['key1','key2'])['data2']
s_grouped.mean()

key1  key2
a     one     0.368194
      two     0.207110
b     one    -0.476933
      two     0.614344
Name: data2, dtype: float64

10.1.3 使用字典和Series分组

people = pd.DataFrame(np.random.randn(5,5),
                     columns = ['a','b','c','d','e'],
                     index = ['joe','steve','wes','jim','travis'])
people

	a	b	c	d	e
joe	1.176984	-0.779043	1.232603	0.811508	-0.114978
steve	-0.162964	0.936905	-0.671911	-0.660697	0.297713
wes	-1.154320	-0.351915	-2.286200	-0.200603	0.990375
jim	-0.496740	0.474151	0.058887	2.775079	-0.615440
travis	-0.737061	1.470497	-0.173087	0.766692	-0.234068

people.iloc[2:3,[1,4]] = np.nan

people

	a	b	c	d	e
joe	1.176984	-0.779043	1.232603	0.811508	-0.114978
steve	-0.162964	0.936905	-0.671911	-0.660697	0.297713
wes	-1.154320	NaN	-2.286200	-0.200603	NaN
jim	-0.496740	0.474151	0.058887	2.775079	-0.615440
travis	-0.737061	1.470497	-0.173087	0.766692	-0.234068

mapping = {'a':'red','b':'red','c':'blue','d':'blue','e':'red','f':'orange'}

by_column = people.groupby(mapping,axis=1)
by_column.sum()

	blue	red
joe	2.044111	0.282963
steve	-1.332608	1.071654
wes	-2.486803	-1.154320
jim	2.833967	-0.638030
travis	0.593605	0.499368

map_series = pd.Series(mapping)
map_series

a       red
b       red
c      blue
d      blue
e       red
f    orange
dtype: object

people.groupby(map_series,axis=1).count()

	blue	red
joe	2	3
steve	2	3
wes	2	1
jim	2	3
travis	2	3

10.1.4 使用函数分组

people

	a	b	c	d	e
joe	1.176984	-0.779043	1.232603	0.811508	-0.114978
steve	-0.162964	0.936905	-0.671911	-0.660697	0.297713
wes	-1.154320	NaN	-2.286200	-0.200603	NaN
jim	-0.496740	0.474151	0.058887	2.775079	-0.615440
travis	-0.737061	1.470497	-0.173087	0.766692	-0.234068

people.groupby(len).sum()

	a	b	c	d	e
3	-0.474076	-0.304892	-0.994710	3.385984	-0.730418
5	-0.162964	0.936905	-0.671911	-0.660697	0.297713
6	-0.737061	1.470497	-0.173087	0.766692	-0.234068

key_list = ['one','one','one','two','two']
people.groupby([len,key_list]).min()

		a	b	c	d	e
3	one	-1.154320	-0.779043	-2.286200	-0.200603	-0.114978
3	two	-0.496740	0.474151	0.058887	2.775079	-0.615440
5	one	-0.162964	0.936905	-0.671911	-0.660697	0.297713
6	two	-0.737061	1.470497	-0.173087	0.766692	-0.234068

10.1.5 根据索引层级分组

hire_df = pd.DataFrame(np.random.randn(4,5),
                      columns = pd.MultiIndex.from_arrays([['US','US','US','JP','JP']
                                                           ,[1,3,5,1,3]],names = ['cty','tenor']))
hire_df

cty	US			JP
tenor	1	3	5	1	3
0	0.840245	2.187988	-0.393045	-0.150897	1.573875
1	1.329124	0.070597	-0.246468	-0.147710	-0.397490
2	0.362198	0.404290	-0.233373	-0.406261	0.812492
3	1.104010	-0.485481	0.439878	-1.948048	0.784266

hire_df.groupby(level='cty',axis=1).count()

cty	JP	US
0	2	3
1	2	3
2	2	3
3	2	3

10.2 数据聚合

优化的groupby方法

函数名	描述
count	分组中的非NA值数量
sum	非NA值的累和
mean	非NA值的均值
median	非NA值的算术中位数
std.var	无偏的(n-1 分母)标准差和方差
min,max	非NA值的最小值.最大值
prod	非NA值的乘积
first、last	非NA值的第-个和最后-个值

df

	data1	data2	key1	key2
0	0.405121	0.233504	a	one
1	-0.419340	0.207110	a	two
2	0.871916	-0.476933	b	one
3	-0.311263	0.614344	b	two
4	0.498925	0.502884	a	one

grouped = df.groupby('key1')
grouped['data1'].quantile(0.9)

key1
a    0.480164
b    0.753598
Name: data1, dtype: float64

def peak_to_peak(arr):
    return arr.max() - arr.min()

grouped.agg(peak_to_peak)

	data1	data2
key1
a	0.918265	0.295774
b	1.183179	1.091277

grouped.describe()

	data1								data2
	count	mean	std	min	25%	50%	75%	max	count	mean	std	min	25%	50%	75%	max
key1
a	3.0	0.161569	0.505263	-0.419340	-0.007109	0.405121	0.452023	0.498925	3.0	0.314499	0.163679	0.207110	0.220307	0.233504	0.368194	0.502884
b	2.0	0.280326	0.836634	-0.311263	-0.015468	0.280326	0.576121	0.871916	2.0	0.068706	0.771649	-0.476933	-0.204113	0.068706	0.341525	0.614344

10.2.1 逐列及多函数应用

tips = pd.read_csv('examples/tips.csv')
tips.head()

	total_bill	tip	smoker	day	time	size
0	16.99	1.01	No	Sun	Dinner	2
1	10.34	1.66	No	Sun	Dinner	3
2	21.01	3.50	No	Sun	Dinner	3
3	23.68	3.31	No	Sun	Dinner	2
4	24.59	3.61	No	Sun	Dinner	4

tips['tip_pct'] = tips['tip']/tips['total_bill']
tips[:8]

	total_bill	tip	smoker	day	time	size	tip_pct
0	16.99	1.01	No	Sun	Dinner	2	0.059447
1	10.34	1.66	No	Sun	Dinner	3	0.160542
2	21.01	3.50	No	Sun	Dinner	3	0.166587
3	23.68	3.31	No	Sun	Dinner	2	0.139780
4	24.59	3.61	No	Sun	Dinner	4	0.146808
5	25.29	4.71	No	Sun	Dinner	4	0.186240
6	8.77	2.00	No	Sun	Dinner	2	0.228050
7	26.88	3.12	No	Sun	Dinner	4	0.116071

grouped = tips.groupby(['day','smoker'])
grouped

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x00000259B96733A0>

grouped_pct = grouped['tip_pct']
grouped_pct.agg('mean')

day   smoker
Fri   No        0.151650
      Yes       0.174783
Sat   No        0.158048
      Yes       0.147906
Sun   No        0.160113
      Yes       0.187250
Thur  No        0.160298
      Yes       0.163863
Name: tip_pct, dtype: float64

#如果你传递的是函数或者函数名的列表，你会获得一个列名是这些函数名的DataFrame
grouped_pct.agg(['mean','std',peak_to_peak])

		mean	std	peak_to_peak
day	smoker
Fri	No	0.151650	0.028123	0.067349
Fri	Yes	0.174783	0.051293	0.159925
Sat	No	0.158048	0.039767	0.235193
Sat	Yes	0.147906	0.061375	0.290095
Sun	No	0.160113	0.042347	0.193226
Sun	Yes	0.187250	0.154134	0.644685
Thur	No	0.160298	0.038774	0.193350
Thur	Yes	0.163863	0.039389	0.151240

grouped_pct.agg([('foo','mean'),('bar',np.std)])

		foo	bar
day	smoker
Fri	No	0.151650	0.028123
Fri	Yes	0.174783	0.051293
Sat	No	0.158048	0.039767
Sat	Yes	0.147906	0.061375
Sun	No	0.160113	0.042347
Sun	Yes	0.187250	0.154134
Thur	No	0.160298	0.038774
Thur	Yes	0.163863	0.039389

functions = ['count','mean','max']
result = grouped['tip_pct','total_bill'].agg(functions)
result

<ipython-input-74-05231bb63e58>:2: FutureWarning: Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.
  result = grouped['tip_pct','total_bill'].agg(functions)

		tip_pct			total_bill
		count	mean	max	count	mean	max
day	smoker
Fri	No	4	0.151650	0.187735	4	18.420000	22.75
Fri	Yes	15	0.174783	0.263480	15	16.813333	40.17
Sat	No	45	0.158048	0.291990	45	19.661778	48.33
Sat	Yes	42	0.147906	0.325733	42	21.276667	50.81
Sun	No	57	0.160113	0.252672	57	20.506667	48.17
Sun	Yes	19	0.187250	0.710345	19	24.120000	45.35
Thur	No	45	0.160298	0.266312	45	17.113111	41.19
Thur	Yes	17	0.163863	0.241255	17	19.190588	43.11

result['tip_pct']

		count	mean	max
day	smoker
Fri	No	4	0.151650	0.187735
Fri	Yes	15	0.174783	0.263480
Sat	No	45	0.158048	0.291990
Sat	Yes	42	0.147906	0.325733
Sun	No	57	0.160113	0.252672
Sun	Yes	19	0.187250	0.710345
Thur	No	45	0.160298	0.266312
Thur	Yes	17	0.163863	0.241255

ftuples = [('Durch','mean'),('abwe',np.var)]
grouped['tip_pct','total_bill'].agg(ftuples)

<ipython-input-76-ce86ef210451>:2: FutureWarning: Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.
  grouped['tip_pct','total_bill'].agg(ftuples)

		tip_pct		total_bill
		Durch	abwe	Durch	abwe
day	smoker
Fri	No	0.151650	0.000791	18.420000	25.596333
Fri	Yes	0.174783	0.002631	16.813333	82.562438
Sat	No	0.158048	0.001581	19.661778	79.908965
Sat	Yes	0.147906	0.003767	21.276667	101.387535
Sun	No	0.160113	0.001793	20.506667	66.099980
Sun	Yes	0.187250	0.023757	24.120000	109.046044
Thur	No	0.160298	0.001503	17.113111	59.625081
Thur	Yes	0.163863	0.001551	19.190588	69.808518

#只有多个函数应用于至少一个列时，DataFrame才具有分层列。
grouped.agg({'tip':np.max,'size':'sum'})

		tip	size
day	smoker
Fri	No	3.50	9
Fri	Yes	4.73	31
Sat	No	9.00	115
Sat	Yes	10.00	104
Sun	No	6.00	167
Sun	Yes	6.50	49
Thur	No	6.70	112
Thur	Yes	5.00	40

10.2.2 返回不含行索引的聚合数据

通过在结果上调用reset_index也可以获得同样的结果。使用as_index=False可以避免一些不必要的计算。

tips.groupby(['day','smoker'],as_index=False).mean()

	day	smoker	total_bill	tip	size	tip_pct
0	Fri	No	18.420000	2.812500	2.250000	0.151650
1	Fri	Yes	16.813333	2.714000	2.066667	0.174783
2	Sat	No	19.661778	3.102889	2.555556	0.158048
3	Sat	Yes	21.276667	2.875476	2.476190	0.147906
4	Sun	No	20.506667	3.167895	2.929825	0.160113
5	Sun	Yes	24.120000	3.516842	2.578947	0.187250
6	Thur	No	17.113111	2.673778	2.488889	0.160298
7	Thur	Yes	19.190588	3.030000	2.352941	0.163863

10.3 应用：通用拆分-应用-联合

import pandas as pd
tips = pd.read_csv('examples/tips.csv')
tips.head()

	total_bill	tip	smoker	day	time	size
0	16.99	1.01	No	Sun	Dinner	2
1	10.34	1.66	No	Sun	Dinner	3
2	21.01	3.50	No	Sun	Dinner	3
3	23.68	3.31	No	Sun	Dinner	2
4	24.59	3.61	No	Sun	Dinner	4

tips['tip_pct'] = tips['tip']/(tips['total_bill']-tips['tip'])

#假设你想要按组选出小费百分比（tip-pct）最高的五组。

#首先，写一个可以在特定列中选出最大值所在行的函数
def top(df,n=5,column='tip_pct'):
    return df.sort_values(by=column)[-n:]

top(tips,n=6)

	total_bill	tip	smoker	day	time	size	tip_pct
109	14.31	4.00	Yes	Sat	Dinner	2	0.387973
183	23.17	6.50	Yes	Sun	Dinner	4	0.389922
232	11.61	3.39	No	Sat	Dinner	2	0.412409
67	3.07	1.00	Yes	Sat	Dinner	1	0.483092
178	9.60	4.00	Yes	Sun	Dinner	2	0.714286
172	7.25	5.15	Yes	Sun	Dinner	2	2.452381

tips.groupby('smoker').apply(top)
#top函数在DataFrame的每一行分组上被调用，之后使用pandas. 
#concat将函数结果粘贴在一起，并使用分组名作为各组的标签。
#因此结果包含一个分层索引，该分层索引的内部层级包含原DataFrame的索引值

		total_bill	tip	smoker	day	time	size	tip_pct
smoker
No	88	24.71	5.85	No	Thur	Lunch	2	0.310180
	185	20.69	5.00	No	Sun	Dinner	5	0.318674
	51	10.29	2.60	No	Sun	Dinner	2	0.338101
	149	7.51	2.00	No	Thur	Lunch	2	0.362976
	232	11.61	3.39	No	Sat	Dinner	2	0.412409
Yes	109	14.31	4.00	Yes	Sat	Dinner	2	0.387973
	183	23.17	6.50	Yes	Sun	Dinner	4	0.389922
	67	3.07	1.00	Yes	Sat	Dinner	1	0.483092
	178	9.60	4.00	Yes	Sun	Dinner	2	0.714286
	172	7.25	5.15	Yes	Sun	Dinner	2	2.452381

#如果你除了向apply传递函数，还传递其他参数或关键字的话，你可以把这些放在函数后进行传递
tips.groupby(['smoker','day']).apply(top,n=1,column='total_bill')

			total_bill	tip	smoker	day	time	size	tip_pct
smoker	day
No	Fri	94	22.75	3.25	No	Fri	Dinner	2	0.166667
	Sat	212	48.33	9.00	No	Sat	Dinner	4	0.228833
	Sun	156	48.17	5.00	No	Sun	Dinner	6	0.115821
	Thur	142	41.19	5.00	No	Thur	Lunch	5	0.138160
Yes	Fri	95	40.17	4.73	Yes	Fri	Dinner	4	0.133465
	Sat	170	50.81	10.00	Yes	Sat	Dinner	3	0.245038
	Sun	182	45.35	3.50	Yes	Sun	Dinner	3	0.083632
	Thur	197	43.11	5.00	Yes	Thur	Lunch	4	0.131199

result = tips.groupby('smoker')['tip_pct'].describe()
result

	count	mean	std	min	25%	50%	75%	max
smoker
No	151.0	0.192237	0.057665	0.060217	0.158622	0.184308	0.227015	0.412409
Yes	93.0	0.218176	0.254295	0.036955	0.119534	0.181818	0.242326	2.452381

result.unstack('smoker')

       smoker
count  No        151.000000
       Yes        93.000000
mean   No          0.192237
       Yes         0.218176
std    No          0.057665
       Yes         0.254295
min    No          0.060217
       Yes         0.036955
25%    No          0.158622
       Yes         0.119534
50%    No          0.184308
       Yes         0.181818
75%    No          0.227015
       Yes         0.242326
max    No          0.412409
       Yes         2.452381
dtype: float64

#在GroupBy对象的内部，当你调用像describe这样的方法时，实际上是以下代码的简写
f = lambda x :x.describe()
grouped.apply(f)

			total_bill	tip	size	tip_pct
day	smoker
Fri	No	count	4.000000	4.000000	4.00	4.000000
		mean	18.420000	2.812500	2.25	0.151650
		std	5.059282	0.898494	0.50	0.028123
		min	12.460000	1.500000	2.00	0.120385
		25%	15.100000	2.625000	2.00	0.137239
...	...	...	...	...	...	...
Thur	Yes	min	10.340000	2.000000	2.00	0.090014
		25%	13.510000	2.000000	2.00	0.148038
		50%	16.470000	2.560000	2.00	0.153846
		75%	19.810000	4.000000	2.00	0.194837
		max	43.110000	5.000000	4.00	0.241255

64 rows × 4 columns

10.3.1 压缩分组键

你可以看到所得到的对象具有分组键所形成的分层索引以及每个原始对象的索引。
你可以通过向groupby传递group_keys=False来禁用这个功能

tips.groupby('smoker').apply(top)

		total_bill	tip	smoker	day	time	size	tip_pct
smoker
No	88	24.71	5.85	No	Thur	Lunch	2	0.310180
	185	20.69	5.00	No	Sun	Dinner	5	0.318674
	51	10.29	2.60	No	Sun	Dinner	2	0.338101
	149	7.51	2.00	No	Thur	Lunch	2	0.362976
	232	11.61	3.39	No	Sat	Dinner	2	0.412409
Yes	109	14.31	4.00	Yes	Sat	Dinner	2	0.387973
	183	23.17	6.50	Yes	Sun	Dinner	4	0.389922
	67	3.07	1.00	Yes	Sat	Dinner	1	0.483092
	178	9.60	4.00	Yes	Sun	Dinner	2	0.714286
	172	7.25	5.15	Yes	Sun	Dinner	2	2.452381

tips.groupby('smoker',group_keys=False).apply(top)

	total_bill	tip	smoker	day	time	size	tip_pct
88	24.71	5.85	No	Thur	Lunch	2	0.310180
185	20.69	5.00	No	Sun	Dinner	5	0.318674
51	10.29	2.60	No	Sun	Dinner	2	0.338101
149	7.51	2.00	No	Thur	Lunch	2	0.362976
232	11.61	3.39	No	Sat	Dinner	2	0.412409
109	14.31	4.00	Yes	Sat	Dinner	2	0.387973
183	23.17	6.50	Yes	Sun	Dinner	4	0.389922
67	3.07	1.00	Yes	Sat	Dinner	1	0.483092
178	9.60	4.00	Yes	Sun	Dinner	2	0.714286
172	7.25	5.15	Yes	Sun	Dinner	2	2.452381

10.3.2 分位数与桶分析

pandas有一些工具，尤其是cut和qcut，用于将数据按照你选择的箱位或样本分位数进行分桶。
与groupby方法一起使用这些函数可以对数据集更方便地进行分桶或分位分析

frame = pd.DataFrame({'data1':np.random.randn(1000),
                     'data2':np.random.randn(1000)})
frame.head()

	data1	data2
0	1.038225	-0.225594
1	-1.142048	0.762311
2	-0.383751	0.319326
3	0.122939	-1.205274
4	-1.557777	-0.495632

quertitles = pd.cut(frame.data1,4)
quertitles[:10]

0     (-0.175, 1.544]
1    (-1.894, -0.175]
2    (-1.894, -0.175]
3     (-0.175, 1.544]
4    (-1.894, -0.175]
5    (-1.894, -0.175]
6    (-1.894, -0.175]
7     (-3.62, -1.894]
8    (-1.894, -0.175]
9     (-0.175, 1.544]
Name: data1, dtype: category
Categories (4, interval[float64]): [(-3.62, -1.894] < (-1.894, -0.175] < (-0.175, 1.544] < (1.544, 3.263]]

def get_stats(group):
    return {'min':group.min(),'max':group.max(),'count':group.count(),'mean':group.mean()}

grouped = frame.data2.groupby(quertitles)
grouped.apply(get_stats).unstack()

	min	max	count	mean
data1
(-3.62, -1.894]	-1.687871	1.887558	30.0	-0.210904
(-1.894, -0.175]	-2.898182	3.015219	412.0	0.043604
(-0.175, 1.544]	-2.829533	2.779859	495.0	-0.040127
(1.544, 3.263]	-1.966623	2.059859	63.0	0.051666

#返回分位数数值
grouping = pd.cut(frame.data1,10,labels=False)
grouped = frame.data2.groupby(grouping)
grouped.apply(get_stats).unstack()

	min	max	count	mean
data1
0	-1.351679	0.479551	3.0	-0.162551
1	-1.396584	1.887558	11.0	0.134897
2	-2.122874	1.734941	47.0	-0.178326
3	-2.020175	3.015219	136.0	-0.004133
4	-2.898182	2.468129	245.0	0.079938
5	-2.758906	2.779859	253.0	-0.035156
6	-2.829533	2.439201	187.0	-0.045802
7	-2.250796	2.128096	86.0	-0.033331
8	-1.605125	2.059859	29.0	0.083334
9	-0.029929	0.692138	3.0	0.433789

10.3.3 示例：使用指定分组值填充缺失值

fillna是一个可以使用的正确工具

s = pd.Series(np.random.randn(6))
s

0   -1.825708
1   -0.406503
2    0.319659
3    0.077313
4   -0.304391
5    0.369055
dtype: float64

s[::2] = np.nan
s

0         NaN
1   -0.406503
2         NaN
3    0.077313
4         NaN
5    0.369055
dtype: float64

s.fillna(s.mean())

0    0.013288
1   -0.406503
2    0.013288
3    0.077313
4    0.013288
5    0.369055
dtype: float64

#假设你需要填充值按组来变化。
#一个方法是对数据分组后使用apply和一个在每个数据块上都调用fillna的函数。
states = ['ohio','new york','vermont','florida','oregon','nevada','california','idaho']
states

['ohio',
 'new york',
 'vermont',
 'florida',
 'oregon',
 'nevada',
 'california',
 'idaho']

group_key = ['east']*4+['west']*4
group_key

['east', 'east', 'east', 'east', 'west', 'west', 'west', 'west']

data = pd.Series(np.random.randn(8),index=states)
data

ohio         -1.196387
new york      1.116637
vermont      -2.096959
florida      -0.482581
oregon       -0.426209
nevada        0.494309
california   -1.444239
idaho         2.388043
dtype: float64

#将数据中的一些值设置为缺失值
data[['vermont','nevada','idaho']] = np.nan
data

ohio         -1.196387
new york      1.116637
vermont            NaN
florida      -0.482581
oregon       -0.426209
nevada             NaN
california   -1.444239
idaho              NaN
dtype: float64

data.groupby(group_key).mean()

east   -0.187444
west   -0.935224
dtype: float64

#使用平均值来填充NA值
fill_mean = lambda g :g.fillna(g.mean())

data.groupby(group_key).apply(fill_mean)

ohio         -1.196387
new york      1.116637
vermont      -0.187444
florida      -0.482581
oregon       -0.426209
nevada       -0.935224
california   -1.444239
idaho        -0.935224
dtype: float64

fill_values = {'east':0.5,'west':-1}
fill_func = lambda g :g.fillna(fill_values[g.name])
data.groupby(group_key).apply(fill_func)

ohio         -1.196387
new york      1.116637
vermont       0.500000
florida      -0.482581
oregon       -0.426209
nevada       -1.000000
california   -1.444239
idaho        -1.000000
dtype: float64

10.3.4 示例：随机采样与排列

#红桃、黑桃、梅花、方块
suits = ['H','S','C','D']
card_val = (list(range(1,11))+[10]*3)*4
card_val[-5:]

[9, 10, 10, 10, 10]

base_names = ['A']+list(range(2,11))+['J','K','Q']
base_names

['A', 2, 3, 4, 5, 6, 7, 8, 9, 10, 'J', 'K', 'Q']

cards = []
for suit in suits:
    cards.extend(str(num)+suit for num in base_names)
    
deck = pd.Series(card_val,index=cards)
deck.head()

AH    1
2H    2
3H    3
4H    4
5H    5
dtype: int64

deck[:13]

AH      1
2H      2
3H      3
4H      4
5H      5
6H      6
7H      7
8H      8
9H      9
10H    10
JH     10
KH     10
QH     10
dtype: int64

def draw(deck,n=5):
    return deck.sample(5)

draw(deck)

9S    9
8S    8
3S    3
2D    2
9H    9
dtype: int64

#假设你想要从每个花色中随机抽取两张牌
get_suit = lambda card:card[-1]
deck.groupby(get_suit).apply(draw,n=2)

C  JC     10
   3C      3
   9C      9
   QC     10
   8C      8
D  6D      6
   QD     10
   7D      7
   9D      9
   4D      4
H  4H      4
   6H      6
   2H      2
   10H    10
   9H      9
S  10S    10
   QS     10
   8S      8
   KS     10
   5S      5
dtype: int64

deck.groupby(get_suit,group_keys=False).apply(draw,n=2)

QC    10
AC     1
JC    10
KC    10
4C     4
JD    10
AD     1
6D     6
5D     5
7D     7
KH    10
2H     2
6H     6
5H     5
8H     8
2S     2
5S     5
QS    10
4S     4
JS    10
dtype: int64

10.3.5 示例：分组加权平均和相关性

df = pd.DataFrame({'category':['a','a','a','a','b','b','b','b'],
                  'data':np.random.randn(8),
                  'weights':np.random.rand(8)})
df

	category	data	weights
0	a	-0.538213	0.890005
1	a	1.036831	0.119786
2	a	-1.706140	0.086685
3	a	-1.687288	0.418367
4	b	0.574443	0.372295
5	b	-1.791063	0.994798
6	b	1.419661	0.715527
7	b	0.674622	0.338912

#通过category进行分组加权平均如下
grouped = df.groupby('category')

get_wavg = lambda g :np.average(g['data'],weights=g['weights'])
grouped.apply(get_wavg)

category
a   -0.797849
b   -0.133568
dtype: float64

close_px = pd.read_csv(r'D:\PythonFlie\python\利用python进行数据分析(书籍笔记)\examples\stock_px_2.csv',
                      parse_dates=True,
                      index_col=0)
close_px.head()

	AAPL	MSFT	XOM	SPX
2003-01-02	7.40	21.11	29.22	909.03
2003-01-03	7.45	21.14	29.24	908.59
2003-01-06	7.45	21.52	29.96	929.01
2003-01-07	7.43	21.93	28.95	922.93
2003-01-08	7.28	21.31	28.83	909.93

close_px.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2214 entries, 2003-01-02 to 2011-10-14
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   AAPL    2214 non-null   float64
 1   MSFT    2214 non-null   float64
 2   XOM     2214 non-null   float64
 3   SPX     2214 non-null   float64
dtypes: float64(4)
memory usage: 86.5 KB

close_px[-4:]

	AAPL	MSFT	XOM	SPX
2011-10-11	400.29	27.00	76.27	1195.54
2011-10-12	402.19	26.96	77.16	1207.25
2011-10-13	408.43	27.18	76.37	1203.66
2011-10-14	422.00	27.27	78.11	1224.58

spx_corr = lambda x :x.corrwith(x['SPX'])

rets = close_px.pct_change().dropna()

get_year = lambda x :x.year
by_year = rets.groupby(get_year)
by_year.apply(spx_corr)

	AAPL	MSFT	XOM	SPX
2003	0.541124	0.745174	0.661265	1.0
2004	0.374283	0.588531	0.557742	1.0
2005	0.467540	0.562374	0.631010	1.0
2006	0.428267	0.406126	0.518514	1.0
2007	0.508118	0.658770	0.786264	1.0
2008	0.681434	0.804626	0.828303	1.0
2009	0.707103	0.654902	0.797921	1.0
2010	0.710105	0.730118	0.839057	1.0
2011	0.691931	0.800996	0.859975	1.0

by_year.apply(lambda g:g['AAPL'].corr(g['MSFT']))

2003    0.480868
2004    0.259024
2005    0.300093
2006    0.161735
2007    0.417738
2008    0.611901
2009    0.432738
2010    0.571946
2011    0.581987
dtype: float64

10.3.6 示例：逐组线性回归

import statsmodels.api as sm

def regress(data,yvar,xvars):
    Y = data[yvar]
    X = data[xvars]
    X['intercept'] = 1.
    result = sm.OLS(Y,X).fit()
    return result.params

by_year.apply(regress,'AAPL',['SPX'])

	SPX	intercept
2003	1.195406	0.000710
2004	1.363463	0.004201
2005	1.766415	0.003246
2006	1.645496	0.000080
2007	1.198761	0.003438
2008	0.968016	-0.001110
2009	0.879103	0.002954
2010	1.052608	0.001261
2011	0.806605	0.001514

10.4 数据透视表与交叉表

Python中的pandas透视表是通过本章所介绍的groupby工具以及使用分层索引的重塑操作实现的。
DataFrame拥有一个pivot_table方法，并且还有一个顶层的pandas.pivot_table函数。
除了为groupby提供一个方便接口，pivot_table还可以添加部分总计，也称作边距。
pivot_table选项

选项名	描述
values	需要聚合的列名:默认情况下聚合所有数值型的列
index	在结果透视表的行.上进行分组的列名或其他分组键

#假设你想要计算一张在行方向上按day和smoker排列的分组平均值（默认的pivot_table聚合类型）的表
tips.pivot_table(index=['day','smoker'])

		size	tip	tip_pct	total_bill
day	smoker
Fri	No	2.250000	2.812500	0.179740	18.420000
Fri	Yes	2.066667	2.714000	0.216293	16.813333
Sat	No	2.555556	3.102889	0.190412	19.661778
Sat	Yes	2.476190	2.875476	0.179833	21.276667
Sun	No	2.929825	3.167895	0.193617	20.506667
Sun	Yes	2.578947	3.516842	0.322021	24.120000
Thur	No	2.488889	2.673778	0.193424	17.113111
Thur	Yes	2.352941	3.030000	0.198508	19.190588

tips.pivot_table(['tip_pct','size'],index=['time','day'],columns='smoker')

		size		tip_pct
	smoker	No	Yes	No	Yes
time	day
Dinner	Fri	2.000000	2.222222	0.162612	0.202545
	Sat	2.555556	2.476190	0.190412	0.179833
	Sun	2.929825	2.578947	0.193617	0.322021
	Thur	2.000000	NaN	0.190114	NaN
Lunch	Fri	3.000000	1.833333	0.231125	0.236915
Lunch	Thur	2.500000	2.352941	0.193499	0.198508

#通过传递margins=True来扩充这个表来包含部分总计。
#All的值是均值，且该均值是不考虑吸烟者与非吸烟者（All列）或行分组中任何两级的（All行）。
tips.pivot_table(['tip_pct','size'],index=['time','day'],columns='smoker',margins=True)

		size			tip_pct
	smoker	No	Yes	All	No	Yes	All
time	day
Dinner	Fri	2.000000	2.222222	2.166667	0.162612	0.202545	0.192562
	Sat	2.555556	2.476190	2.517241	0.190412	0.179833	0.185305
	Sun	2.929825	2.578947	2.842105	0.193617	0.322021	0.225718
	Thur	2.000000	NaN	2.000000	0.190114	NaN	0.190114
Lunch	Fri	3.000000	1.833333	2.000000	0.231125	0.236915	0.236088
Lunch	Thur	2.500000	2.352941	2.459016	0.193499	0.198508	0.194895
All		2.668874	2.408602	2.569672	0.192237	0.218176	0.202123

tips.pivot_table('tip_pct',index=['time','smoker'],aggfunc=len,margins=True)

		tip_pct
time	smoker
Dinner	No	106.0
Dinner	Yes	70.0
Lunch	No	45.0
Lunch	Yes	23.0
All		244.0

#如果某些情况下产生了空值（或者NA），你可能想要传递一个fill_value
tips.pivot_table('tip_pct',index=['time','size','smoker'],aggfunc='mean',fill_value=0)

			tip_pct
time	size	smoker
Dinner	1	No	0.160000
	1	Yes	0.483092
	2	No	0.200221
	2	Yes	0.239672
	3	No	0.183436
	3	Yes	0.179271
	4	No	0.176033
	4	Yes	0.169406
	5	No	0.263344
	5	Yes	0.094779
	6	No	0.115821
Lunch	1	No	0.222087
	1	Yes	0.288288
	2	No	0.201503
	2	Yes	0.201043
	3	No	0.138483
	3	Yes	0.257941
	4	No	0.161573
	4	Yes	0.186592
	5	No	0.138160
	6	No	0.211191

10.4.1 交叉表：crosstab

交叉表（简写为crosstab）是数据透视表的一个特殊情况，计算的是分组中的频率

10.5 本章小结

data = pd.DataFrame({'Sample':range(1,11),
                    'nationality':['USA','Japan','USA','Japan','Japan','Japan','USA','USA','Japan','USA'],
                    'handedness':['right','left','right','right','left','right','right','left','right','right']})
data

	Sample	nationality	handedness
0	1	USA	right
1	2	Japan	left
2	3	USA	right
3	4	Japan	right
4	5	Japan	left
5	6	Japan	right
6	7	USA	right
7	8	USA	left
8	9	Japan	right
9	10	USA	right

data.pivot_table(index = 'nationality',columns='handedness',margins=True,aggfunc='count')

	Sample
handedness	left	right	All
nationality
Japan	2	3	5
USA	1	4	5
All	3	7	10

pd.crosstab(data.nationality,data.handedness,margins=True)

handedness	left	right	All
nationality
Japan	2	3	5
USA	1	4	5
All	3	7	10

#crosstab的前两个参数可是数组、Series或数组的列表
pd.crosstab([tips.time,tips.day],tips.smoker,margins=True)

	smoker	No	Yes	All
time	day
Dinner	Fri	3	9	12
	Sat	45	42	87
	Sun	57	19	76
	Thur	1	0	1
Lunch	Fri	1	6	7
Lunch	Thur	44	17	61
All		151	93	244

第十一章时间序列

11.1 日期和时间数据的类型及工具

Python标准库包含了日期和时间数据的类型，也包括日历相关的功能。
datetime、time和calendar模块是开始处理时间数据的主要内容。
datetime.datetime类型，或简写为datetime，是广泛使用的
datetime模块中的类型

类型	描述
date	使用公历日历存储日历日期(年，月，日)
time	将时间存储为小时，分钟，秒和微秒
datetime	存储日期和时间
timedelta	表示两个datetime值之间的差(如日，秒和微秒)
tzinfo	用于存储时区信息的基本类型

from datetime import datetime

#datetime既存储了日期，也存储了细化到微秒的时间。
now = datetime.now()

now

datetime.datetime(2021, 12, 28, 14, 25, 24, 8294)

now.year,now.month,now.day

(2021, 12, 28)

#timedelta表示两个datetime对象的时间差
delta = datetime(2011,1,7) - datetime(2008,6,24,8,15)
delta

datetime.timedelta(days=926, seconds=56700)

delta.days,delta.seconds

(926, 56700)

#你可以为一个datetime对象加上（或减去）一个timedelta或其整数倍来产生一个新的datetime对象
from datetime import timedelta

start = datetime(2011,1,7)
start+timedelta(12)

datetime.datetime(2011, 1, 19, 0, 0)

start-2*timedelta(12)

datetime.datetime(2010, 12, 14, 0, 0)

11.1.1 字符串与datetime互相转换

可以使用str方法或传递一个指定的格式给strftime方法来对datetime对象和pandas的Timestamp对象进行格式化
datetime格式说明（兼容ISO C89）

类型	描述
%Y	四位的年份
%y	两位的年份
%m	两位的月份[01, 12]
%d	两位的日期号[01,31]
%H	小时，24小时制[00, 23]
%I	小时，12小时制[01, 12]
%M	两位的分钟[00, 59]
%S	秒[00,61] (60、61 是闰秒)
%w	星期日期[0 (星期天)，6]
%U	一年中的星期数[00, 53]。以星期天为每周的第一天，-年中第一个星期天前的日期作为"第0周"
%W	一年中的星期數[00, 53]。以星期一为每周的第一-天，-年中第一个星期一前的日期作为"第0周"
%z	格式为+HHMM或-HHMM的UTC时区偏移;如果没有时区则为空
%F	%Y-%m-%d的简写(例如，2012-4-18)
%D	%m/%d/%y的简写(例如，04/18/12)

stamp = datetime(2011,1,3)
stamp

datetime.datetime(2011, 1, 3, 0, 0)

str(stamp)

'2011-01-03 00:00:00'

stamp.strftime('%Y-%m-%d')

'2011-01-03'

#可以使用datetime.srtptime和这些格式代码，将字符串转换日期
value = '2011-01-03'
datetime.strptime(value,'%Y-%m-%d')

datetime.datetime(2011, 1, 3, 0, 0)

datestrs = ['7/6/2011','8/6/2011']
[datetime.strptime(x,'%m/%d/%Y') for x in datestrs]

[datetime.datetime(2011, 7, 6, 0, 0), datetime.datetime(2011, 8, 6, 0, 0)]

from dateutil.parser import parse
parse('2011-01-03')

datetime.datetime(2011, 1, 3, 0, 0)

#dateutil能够解析大部分人类可理解的日期表示
parse('jan 31,1997 10:45 am')

datetime.datetime(2021, 1, 31, 10, 45)

#在国际场合下，日期出现在月份之前很常见，因此你可以传递dayfirst=True来表明这种情况
parse('6/12/2011',dayfirst=True)

datetime.datetime(2011, 12, 6, 0, 0)

#to_datetime方法可以转换很多不同的日期表示格式
datestrs = ['2011-07-06 12:00:00','2011-08-06 00:00:00']
pd.to_datetime(datestrs)

DatetimeIndex(['2011-07-06 12:00:00', '2011-08-06 00:00:00'], dtype='datetime64[ns]', freq=None)

#to_datetime方法还可以处理那些被认为是缺失值的值（None、空字符串等）
idx = pd.to_datetime(datestrs+[None])
idx

DatetimeIndex(['2011-07-06 12:00:00', '2011-08-06 00:00:00', 'NaT'], dtype='datetime64[ns]', freq=None)

#NaT（Not a time）是pandas中时间戳数据的是null值。
idx[2]

NaT

pd.isnull(idx)

array([False, False,  True])

特定地区日期格式化选项

类型	描述
%a	缩写的工作日名称
%A	全写的工作日名称
%b	简写的月份名称
%B	全写的月份名称
%c	完整的日期和时间(例如，‘ Tue 01 May 2012 04:20:57 PM’)
%p	AM或PM的地区等效
%x	适合地区的格式化日期(例如，在美国，May 1, 2012会生成’05 101/2012)
%X	适合地区的时间(例如，’ 04:24:12 PM’ )

11.2 时间序列基础

pandas中的基础时间序列种类是由时间戳索引的Series，在pandas外部则通常表示为Python字符串或datetime对象

from datetime import datetime

dates = [datetime(2011,1,2),datetime(2011,1,5),datetime(2011,1,7),
        datetime(2011,1,8),datetime(2011,1,10),datetime(2011,1,12)]
dates

[datetime.datetime(2011, 1, 2, 0, 0),
 datetime.datetime(2011, 1, 5, 0, 0),
 datetime.datetime(2011, 1, 7, 0, 0),
 datetime.datetime(2011, 1, 8, 0, 0),
 datetime.datetime(2011, 1, 10, 0, 0),
 datetime.datetime(2011, 1, 12, 0, 0)]

ts = pd.Series(np.random.randn(6),index=dates)
ts

2011-01-02   -0.388904
2011-01-05    1.029932
2011-01-07   -0.186521
2011-01-08   -0.310343
2011-01-10    0.710121
2011-01-12   -0.053242
dtype: float64

#这些datetime对象可以被放入DatetimeIndex中
ts.index

DatetimeIndex(['2011-01-02', '2011-01-05', '2011-01-07', '2011-01-08',
               '2011-01-10', '2011-01-12'],
              dtype='datetime64[ns]', freq=None)

ts[::2]+ts

2011-01-02   -0.777807
2011-01-05         NaN
2011-01-07   -0.373041
2011-01-08         NaN
2011-01-10    1.420241
2011-01-12         NaN
dtype: float64

#pandas使用NumPy的datetime64数据类型在纳秒级的分辨率下存储时间戳
ts.index.dtype

dtype('<M8[ns]')

#DatetimeIndex中的标量值是pandas的Timestamp对象
stamp = ts.index[0]
stamp
#所有使用datetime对象的地方都可以用Timestamp。
#此外，Timestamp还可以存储频率信息（如果有的话）并了解如何进行时区转换和其他类型操作

Timestamp('2011-01-02 00:00:00')

11.2.1 索引、选择、子集

当你基于标签进行索引和选择时，时间序列的行为和其他的pandas.Series类似

stamp = ts.index[2]
ts[stamp]

-0.18652074734702304

#为了方便，你还可以传递一个能解释为日期的字符串
ts['1/10/2011']

0.7101205694131453

ts['20110110']

0.7101205694131453

#对一个长的时间序列，可以传递一个年份或一个年份和月份来轻松地选择数据的切片
longer_ts = pd.Series(np.random.randn(1000),index=pd.date_range('1/1/2000',periods=1000))
longer_ts

2000-01-01    1.399510
2000-01-02    1.167288
2000-01-03   -1.557553
2000-01-04    0.905812
2000-01-05   -1.204531
                ...   
2002-09-22   -0.055722
2002-09-23    0.913953
2002-09-24    3.299072
2002-09-25   -1.384872
2002-09-26    0.034663
Freq: D, Length: 1000, dtype: float64

longer_ts['2001']

2001-01-01   -0.098773
2001-01-02    0.299508
2001-01-03   -0.111169
2001-01-04    0.545366
2001-01-05    0.559263
                ...   
2001-12-27    0.383230
2001-12-28    0.274157
2001-12-29   -1.571527
2001-12-30   -0.066029
2001-12-31    0.860201
Freq: D, Length: 365, dtype: float64

longer_ts['2001-05']

2001-05-01    0.952182
2001-05-02   -1.637248
2001-05-03   -1.673597
2001-05-04   -1.257016
2001-05-05   -0.992064
2001-05-06    0.874060
2001-05-07   -0.491860
2001-05-08    0.479921
2001-05-09    0.286766
2001-05-10    2.026310
2001-05-11   -0.547724
2001-05-12    1.313638
2001-05-13   -0.624659
2001-05-14    0.003374
2001-05-15   -0.097401
2001-05-16   -0.733808
2001-05-17   -0.557165
2001-05-18    1.592470
2001-05-19   -1.030133
2001-05-20    1.789641
2001-05-21    0.821066
2001-05-22    0.954955
2001-05-23    0.537921
2001-05-24   -0.016609
2001-05-25    1.806582
2001-05-26    0.323395
2001-05-27   -0.144737
2001-05-28    0.783111
2001-05-29    0.673603
2001-05-30   -1.775568
2001-05-31   -0.553773
Freq: D, dtype: float64

ts[datetime(2011,1,7):]

2011-01-07   -0.186521
2011-01-08   -0.310343
2011-01-10    0.710121
2011-01-12   -0.053242
dtype: float64

ts

2011-01-02   -0.388904
2011-01-05    1.029932
2011-01-07   -0.186521
2011-01-08   -0.310343
2011-01-10    0.710121
2011-01-12   -0.053242
dtype: float64

ts['1/6/2011':'1/11/2011']

2011-01-07   -0.186521
2011-01-08   -0.310343
2011-01-10    0.710121
dtype: float64

#请记住通过这种方式的切片产生了原时间序列的视图，类似于NumPy的数组。
#这意味着没有数据被复制，并且在切片上的修改会反映在原始数据上。

#有一个等价实例方法，truncate，它可以在两个日期间对Series进行切片
ts.truncate(after='1/9/2011')

2011-01-02   -0.388904
2011-01-05    1.029932
2011-01-07   -0.186521
2011-01-08   -0.310343
dtype: float64

dates = pd.date_range('1/1/2000',periods=100,freq='W-WED')
dates[:5]

DatetimeIndex(['2000-01-05', '2000-01-12', '2000-01-19', '2000-01-26',
               '2000-02-02'],
              dtype='datetime64[ns]', freq='W-WED')

long_df = pd.DataFrame(np.random.randn(100,4),index=dates,columns = ['colorado','texas','new york','ohio'])
long_df

	colorado	texas	new york	ohio
2000-01-05	-0.401749	-0.717368	-1.281874	1.274557
2000-01-12	1.038078	0.915171	-0.080030	-0.280146
2000-01-19	-1.080790	0.255669	-0.987855	-0.662729
2000-01-26	-0.360513	1.133701	0.678780	0.421880
2000-02-02	-0.636087	1.338383	-0.247395	1.089950
...	...	...	...	...
2001-10-31	1.464741	-0.390381	2.867647	-0.299401
2001-11-07	-0.341461	-0.358244	-0.766587	0.728394
2001-11-14	-1.010205	-0.055786	-0.574341	-0.652109
2001-11-21	-0.130517	0.929190	0.102579	-0.677333
2001-11-28	-0.349123	1.238194	-0.058024	-3.170347

100 rows × 4 columns

long_df.loc['5-2001']

	colorado	texas	new york	ohio
2001-05-02	0.054055	-0.491146	0.241249	0.310428
2001-05-09	-1.556219	-0.716196	0.514238	1.694993
2001-05-16	-1.382658	0.459362	0.732855	0.644502
2001-05-23	-0.342885	0.544961	-1.133561	1.351170
2001-05-30	-2.051397	0.595807	0.536894	0.584369

11.2.2 含有重复索引的时间序列

dates = pd.DatetimeIndex(['1/1/2000','1/2/2000','1/2/2000','1/2/2000','1/3/2000'])
dates

DatetimeIndex(['2000-01-01', '2000-01-02', '2000-01-02', '2000-01-02',
               '2000-01-03'],
              dtype='datetime64[ns]', freq=None)

dup_ts = pd.Series(np.arange(5),index=dates)
dup_ts

2000-01-01    0
2000-01-02    1
2000-01-02    2
2000-01-02    3
2000-01-03    4
dtype: int32

#通过检查索引的is_unique属性，我们可以看出索引并不是唯一的
dup_ts.index.is_unique

False

#对上面的Series进行索引，结果是标量值还是Series切片取决于是否有时间戳是重复的
dup_ts['1/3/2000']

dup_ts['1/2/2000']

2000-01-02    1
2000-01-02    2
2000-01-02    3
dtype: int32

#假设你想要聚合含有非唯一时间戳的数据。一种方式就是使用groupby并传递level=0
grouped = dup_ts.groupby(level=0)
grouped.mean()

2000-01-01    0
2000-01-02    2
2000-01-03    4
dtype: int32

dup_ts.groupby(level=-1).mean()

2000-01-01    0
2000-01-02    2
2000-01-03    4
dtype: int32

grouped.count()

2000-01-01    1
2000-01-02    3
2000-01-03    1
dtype: int64

11.3 日期范围、频率和移位

pandas拥有一整套标准的时间序列频率和工具用于重新采样、推断频率以及生成固定频率的数据范围。

ts

2011-01-02   -0.388904
2011-01-05    1.029932
2011-01-07   -0.186521
2011-01-08   -0.310343
2011-01-10    0.710121
2011-01-12   -0.053242
dtype: float64

#字符串’D’被解释为每日频率。
resampler = ts.resample('D')

11.3.1 生成日期范围

pandas.date_range是用于根据特定频率生成指定长度的DatetimeIndex

#默认情况下，date_range生成的是每日的时间戳。
index = pd.date_range('2012-04-01','2012-06-01')
index

DatetimeIndex(['2012-04-01', '2012-04-02', '2012-04-03', '2012-04-04',
               '2012-04-05', '2012-04-06', '2012-04-07', '2012-04-08',
               '2012-04-09', '2012-04-10', '2012-04-11', '2012-04-12',
               '2012-04-13', '2012-04-14', '2012-04-15', '2012-04-16',
               '2012-04-17', '2012-04-18', '2012-04-19', '2012-04-20',
               '2012-04-21', '2012-04-22', '2012-04-23', '2012-04-24',
               '2012-04-25', '2012-04-26', '2012-04-27', '2012-04-28',
               '2012-04-29', '2012-04-30', '2012-05-01', '2012-05-02',
               '2012-05-03', '2012-05-04', '2012-05-05', '2012-05-06',
               '2012-05-07', '2012-05-08', '2012-05-09', '2012-05-10',
               '2012-05-11', '2012-05-12', '2012-05-13', '2012-05-14',
               '2012-05-15', '2012-05-16', '2012-05-17', '2012-05-18',
               '2012-05-19', '2012-05-20', '2012-05-21', '2012-05-22',
               '2012-05-23', '2012-05-24', '2012-05-25', '2012-05-26',
               '2012-05-27', '2012-05-28', '2012-05-29', '2012-05-30',
               '2012-05-31', '2012-06-01'],
              dtype='datetime64[ns]', freq='D')

#如果你只传递一个起始或结尾日期，你必须传递一个用于生成范围的数字
pd.date_range(start='2012-04-01',periods=200)

DatetimeIndex(['2012-04-01', '2012-04-02', '2012-04-03', '2012-04-04',
               '2012-04-05', '2012-04-06', '2012-04-07', '2012-04-08',
               '2012-04-09', '2012-04-10',
               ...
               '2012-10-08', '2012-10-09', '2012-10-10', '2012-10-11',
               '2012-10-12', '2012-10-13', '2012-10-14', '2012-10-15',
               '2012-10-16', '2012-10-17'],
              dtype='datetime64[ns]', length=200, freq='D')

#可以传递’BM’频率（business end of month ，月度业务结尾；参考表11-4的频率列表）
pd.date_range('2000-01-01','2000-12-01',freq='BM')

DatetimeIndex(['2000-01-31', '2000-02-29', '2000-03-31', '2000-04-28',
               '2000-05-31', '2000-06-30', '2000-07-31', '2000-08-31',
               '2000-09-29', '2000-10-31', '2000-11-30'],
              dtype='datetime64[ns]', freq='BM')

基础时间序列频率（不全）

别名	偏置类型	描述
D	Day	日历日的每天
B	BusinessDay	工作日的每天
H	Hour	每小时
T或min	Minute	每分钟
S	Second	每秒
L或ms	Milli	每毫秒(1/1,000 秒)
U	Micro	每微秒(1/1,00,000秒)
M	MonthEnd	日历日的月底日期
BM	Busines sMonthEnd	工作日的月底日期
MS	MonthBegin	日历日的月初日期
BMS	BusinessMonthBegin	工作日的月初日期
W-MON, W-TUE, …	Week	按照给定的星期日期按每周取日期(MON, TUE, WED, THU,FRI, SAT或SUN)
WOM- 1MON , WOM- 2MON,…	Wee kofMonth	在本月的第一,二,三或四周创建按周分隔的日期(例如WOM-3FRI代表每月的第三个星期五)
Q-JAN，Q-FEB， …	QuarterEnd	每个月最后一个日历日的季度日期，以表示月份结束的年份(JAN, FEB,MAR, APR, MAY,JUN, JUL, AUG, SEP, OCT, NOV或DEC)
BQ-JAN，BQ-FEB，…	BusinessQuarterEnd	每个月最后一个工作日对应的季度日期，以表示月份结束的年份
QS-JAN, QS-FEB， …	QuarterBegin	每个月第一个日历日对应的季度日期，以表示月份结束的年份
BQS-JAN，BQS-FEB， …	YearEnd	给定月份所在的月的最后一个日历日所对应的年度日期
A-JAN，A-FEB， …	BusinessYearEnd	给定月份所在的月的最后一个.工作日所对应的年度日期(JAN,FEB, MAR, APR, MAY, JUN,JUL，AUG, SEP, OCT，NOV或DEC)
BA-JAN，BA-FEB， …	YearBegin	给定月份所在的月的第一个日历日所对应的年度日期
AS-JAN，AS-FEB， …	BusinessYearBegin	给定月份所在的月的第一个工作日所对应的年度日期
BAS-JAN, BAS-FEB, …	usinessYearBegin	给定月份所在月的第一个工作日所对应的年度日期

#默认情况下，date_range保留开始或结束时间戳的时间（如果有的话）
pd.date_range('2012-05-02 12:56:31',periods=5)

DatetimeIndex(['2012-05-02 12:56:31', '2012-05-03 12:56:31',
               '2012-05-04 12:56:31', '2012-05-05 12:56:31',
               '2012-05-06 12:56:31'],
              dtype='datetime64[ns]', freq='D')

pd.date_range('2012-05-02 12:56:31',periods=5,normalize=True)

DatetimeIndex(['2012-05-02', '2012-05-03', '2012-05-04', '2012-05-05',
               '2012-05-06'],
              dtype='datetime64[ns]', freq='D')

11.3.2 频率和日期偏置

pandas中的频率是由基础频率和倍数组成的。
基础频率通常会有字符串别名，例如’M’代表每月，'H’代表每小时。对于每个基础频率，都有一个对象可以被用于定义日期偏置。

from pandas.tseries.offsets import Hour,Minute

hour = Hour()
hour

<Hour>

#你可以传递一个整数来定义偏置量的倍数
four_hours = Hour(4)
four_hours

<4 * Hours>

pd.date_range('2000-01-01','2000-01-03 23:59',freq='4h')

DatetimeIndex(['2000-01-01 00:00:00', '2000-01-01 04:00:00',
               '2000-01-01 08:00:00', '2000-01-01 12:00:00',
               '2000-01-01 16:00:00', '2000-01-01 20:00:00',
               '2000-01-02 00:00:00', '2000-01-02 04:00:00',
               '2000-01-02 08:00:00', '2000-01-02 12:00:00',
               '2000-01-02 16:00:00', '2000-01-02 20:00:00',
               '2000-01-03 00:00:00', '2000-01-03 04:00:00',
               '2000-01-03 08:00:00', '2000-01-03 12:00:00',
               '2000-01-03 16:00:00', '2000-01-03 20:00:00'],
              dtype='datetime64[ns]', freq='4H')

#多个偏置可以通过加法进行联合
Hour(2)+Minute(30)

<150 * Minutes>

pd.date_range('2000-01-01','2000-01-03 23:59',freq='1h30min')

DatetimeIndex(['2000-01-01 00:00:00', '2000-01-01 01:30:00',
               '2000-01-01 03:00:00', '2000-01-01 04:30:00',
               '2000-01-01 06:00:00', '2000-01-01 07:30:00',
               '2000-01-01 09:00:00', '2000-01-01 10:30:00',
               '2000-01-01 12:00:00', '2000-01-01 13:30:00',
               '2000-01-01 15:00:00', '2000-01-01 16:30:00',
               '2000-01-01 18:00:00', '2000-01-01 19:30:00',
               '2000-01-01 21:00:00', '2000-01-01 22:30:00',
               '2000-01-02 00:00:00', '2000-01-02 01:30:00',
               '2000-01-02 03:00:00', '2000-01-02 04:30:00',
               '2000-01-02 06:00:00', '2000-01-02 07:30:00',
               '2000-01-02 09:00:00', '2000-01-02 10:30:00',
               '2000-01-02 12:00:00', '2000-01-02 13:30:00',
               '2000-01-02 15:00:00', '2000-01-02 16:30:00',
               '2000-01-02 18:00:00', '2000-01-02 19:30:00',
               '2000-01-02 21:00:00', '2000-01-02 22:30:00',
               '2000-01-03 00:00:00', '2000-01-03 01:30:00',
               '2000-01-03 03:00:00', '2000-01-03 04:30:00',
               '2000-01-03 06:00:00', '2000-01-03 07:30:00',
               '2000-01-03 09:00:00', '2000-01-03 10:30:00',
               '2000-01-03 12:00:00', '2000-01-03 13:30:00',
               '2000-01-03 15:00:00', '2000-01-03 16:30:00',
               '2000-01-03 18:00:00', '2000-01-03 19:30:00',
               '2000-01-03 21:00:00', '2000-01-03 22:30:00'],
              dtype='datetime64[ns]', freq='90T')

#有些频率描述点的时间并不是均匀分隔的。
#例如，'M'（日历月末）和’BM'（月内最后工作日）取决于当月天数，以及像之后的例子中，取决于月末是否是周末。
#我们将这些日期称为锚定偏置量。

11.3.2.1 月中某星期的日期

#允许你可以获取每月第三个星期五这样的日期
rng = pd.date_range('2012-01-01','2012-09-01',freq='WOM-3FRI')
rng

DatetimeIndex(['2012-01-20', '2012-02-17', '2012-03-16', '2012-04-20',
               '2012-05-18', '2012-06-15', '2012-07-20', '2012-08-17'],
              dtype='datetime64[ns]', freq='WOM-3FRI')

11.3.3 移位（前向和后向）日期

"移位"是指将日期按时间向前移动或向后移动。
Series和DataFrame都有一个shift方法用于进行简单的前向或后向移位，而不改变索引

ts = pd.Series(np.random.randn(4),index=pd.date_range('1/1/2000',freq='M',periods=4))
ts

2000-01-31   -0.209127
2000-02-29    0.998652
2000-03-31    0.006782
2000-04-30    0.558995
Freq: M, dtype: float64

ts.shift(2)

2000-01-31         NaN
2000-02-29         NaN
2000-03-31   -0.209127
2000-04-30    0.998652
Freq: M, dtype: float64

ts.shift(-2)

2000-01-31    0.006782
2000-02-29    0.558995
2000-03-31         NaN
2000-04-30         NaN
Freq: M, dtype: float64

#shift常用于计算时间序列或DataFrame多列时间序列的百分比变化
ts/ts.shift(1)-1

2000-01-31          NaN
2000-02-29    -5.775325
2000-03-31    -0.993209
2000-04-30    81.428620
Freq: M, dtype: float64

#如果频率是已知的，则可以将频率传递给shift来推移时间戳而不是简单的数据
ts.shift(2,freq='M')

2000-03-31   -0.209127
2000-04-30    0.998652
2000-05-31    0.006782
2000-06-30    0.558995
Freq: M, dtype: float64

#其他的频率也可以传递，为你前移和后移数据提供灵活性
ts.shift(3,freq='D')

2000-02-03   -0.209127
2000-03-03    0.998652
2000-04-03    0.006782
2000-05-03    0.558995
dtype: float64

#这里的T代表分钟
ts.shift(1,freq='90T')

2000-01-31 01:30:00   -0.209127
2000-02-29 01:30:00    0.998652
2000-03-31 01:30:00    0.006782
2000-04-30 01:30:00    0.558995
dtype: float64

11.3.3.1 使用偏置进行移位日期

pandas日期偏置也可以使用datetime或Timestamp对象完成

from pandas.tseries.offsets import Day,MonthEnd

now = datetime(2011,11,17)

now+3*Day()

Timestamp('2011-11-20 00:00:00')

now+MonthEnd()

Timestamp('2011-11-30 00:00:00')

now+MonthEnd(2)

Timestamp('2011-12-31 00:00:00')

#锚定偏置可以使用rollforward和rollback分别显式地将日期向前或向后"滚动"
offset = MonthEnd()
offset.rollforward(now)

Timestamp('2011-11-30 00:00:00')

offset.rollback(now)

Timestamp('2011-10-31 00:00:00')

ts = pd.Series(np.random.randn(20),index=pd.date_range('1/15/2000',periods=20,freq='4d'))
ts

2000-01-15    0.165827
2000-01-19   -0.564001
2000-01-23   -1.297098
2000-01-27    0.461758
2000-01-31    0.104934
2000-02-04   -0.197272
2000-02-08   -0.238585
2000-02-12    1.822792
2000-02-16    1.053136
2000-02-20    0.114542
2000-02-24   -0.719873
2000-02-28    0.938727
2000-03-03    0.250604
2000-03-07    1.692985
2000-03-11    0.249253
2000-03-15   -2.015027
2000-03-19    1.274159
2000-03-23    0.593201
2000-03-27   -0.340764
2000-03-31   -0.806581
Freq: 4D, dtype: float64

ts.resample('M').mean()

2000-01-31   -0.225716
2000-02-29    0.396209
2000-03-31    0.112229
Freq: M, dtype: float64

11.4 时区处理

在Python语言中，时区信息来源于第三方库pytz（可以使用pip或conda安装），其中公开了Olson数据库，这是世界时区信息的汇编。

import pytz

pytz.common_timezones[-5:]

['US/Eastern', 'US/Hawaii', 'US/Mountain', 'US/Pacific', 'UTC']

#要获得pytz的时区对象，可使用pytz.timezone
tz = pytz.timezone('America/New_York')
tz

<DstTzInfo 'America/New_York' LMT-1 day, 19:04:00 STD>

11.4.1 时区的本地化和转换

rng = pd.date_range('3/9/2012 9:30',periods=6,freq='D')
rng

DatetimeIndex(['2012-03-09 09:30:00', '2012-03-10 09:30:00',
               '2012-03-11 09:30:00', '2012-03-12 09:30:00',
               '2012-03-13 09:30:00', '2012-03-14 09:30:00'],
              dtype='datetime64[ns]', freq='D')

ts = pd.Series(np.random.randn(len(rng)),index=rng)
ts

2012-03-09 09:30:00   -0.804325
2012-03-10 09:30:00    0.060626
2012-03-11 09:30:00   -1.531345
2012-03-12 09:30:00    0.133690
2012-03-13 09:30:00    0.173941
2012-03-14 09:30:00   -1.416870
Freq: D, dtype: float64

#索引的tz属性是None
print(ts.index.tz)

None

pd.date_range('3/9/2012 9:30',periods=10,freq='D',tz='UTC')

DatetimeIndex(['2012-03-09 09:30:00+00:00', '2012-03-10 09:30:00+00:00',
               '2012-03-11 09:30:00+00:00', '2012-03-12 09:30:00+00:00',
               '2012-03-13 09:30:00+00:00', '2012-03-14 09:30:00+00:00',
               '2012-03-15 09:30:00+00:00', '2012-03-16 09:30:00+00:00',
               '2012-03-17 09:30:00+00:00', '2012-03-18 09:30:00+00:00'],
              dtype='datetime64[ns, UTC]', freq='D')

#使用tz_localize方法可以从简单时区转换到本地化时区
ts_utc = ts.tz_localize('UTC')
ts_utc

2012-03-09 09:30:00+00:00   -0.804325
2012-03-10 09:30:00+00:00    0.060626
2012-03-11 09:30:00+00:00   -1.531345
2012-03-12 09:30:00+00:00    0.133690
2012-03-13 09:30:00+00:00    0.173941
2012-03-14 09:30:00+00:00   -1.416870
Freq: D, dtype: float64

ts_utc.index

DatetimeIndex(['2012-03-09 09:30:00+00:00', '2012-03-10 09:30:00+00:00',
               '2012-03-11 09:30:00+00:00', '2012-03-12 09:30:00+00:00',
               '2012-03-13 09:30:00+00:00', '2012-03-14 09:30:00+00:00'],
              dtype='datetime64[ns, UTC]', freq='D')

#一旦时间序列被本地化为某个特定的时区，则可以通过tz_convert将其转换为另一个时区
ts_utc.tz_convert('America/New_York')

2012-03-09 04:30:00-05:00   -0.804325
2012-03-10 04:30:00-05:00    0.060626
2012-03-11 05:30:00-04:00   -1.531345
2012-03-12 05:30:00-04:00    0.133690
2012-03-13 05:30:00-04:00    0.173941
2012-03-14 05:30:00-04:00   -1.416870
Freq: D, dtype: float64

ts_eastern = ts.tz_localize('America/New_York')

ts_eastern.tz_convert('UTC')

2012-03-09 14:30:00+00:00   -0.804325
2012-03-10 14:30:00+00:00    0.060626
2012-03-11 13:30:00+00:00   -1.531345
2012-03-12 13:30:00+00:00    0.133690
2012-03-13 13:30:00+00:00    0.173941
2012-03-14 13:30:00+00:00   -1.416870
dtype: float64

ts_eastern.tz_convert('Europe/Berlin')

2012-03-09 15:30:00+01:00   -0.804325
2012-03-10 15:30:00+01:00    0.060626
2012-03-11 14:30:00+01:00   -1.531345
2012-03-12 14:30:00+01:00    0.133690
2012-03-13 14:30:00+01:00    0.173941
2012-03-14 14:30:00+01:00   -1.416870
dtype: float64

#tz_localize和tz_convert也是DatetimeIndex的实例方法
ts.index.tz_localize('Asia/Shanghai')

DatetimeIndex(['2012-03-09 09:30:00+08:00', '2012-03-10 09:30:00+08:00',
               '2012-03-11 09:30:00+08:00', '2012-03-12 09:30:00+08:00',
               '2012-03-13 09:30:00+08:00', '2012-03-14 09:30:00+08:00'],
              dtype='datetime64[ns, Asia/Shanghai]', freq=None)

11.4.2 时区感知时间戳对象的操作

stamp = pd.Timestamp('2011-03-12 04:00')
stamp_utc = stamp.tz_localize('utc')
stamp_utc.tz_convert('America/New_York')

Timestamp('2011-03-11 23:00:00-0500', tz='America/New_York')

stamp_moscow = pd.Timestamp('2011-03-12 04:00',tz='Europe/Moscow')
stamp_moscow

Timestamp('2011-03-12 04:00:00+0300', tz='Europe/Moscow')

stamp_utc.value

1299902400000000000

stamp_utc.tz_convert('America/New_York').value

1299902400000000000

from pandas.tseries.offsets import Hour

stamp = pd.Timestamp('2012-03-12 01:30',tz='US/Eastern')
stamp

Timestamp('2012-03-12 01:30:00-0400', tz='US/Eastern')

stamp+Hour()

Timestamp('2012-03-12 02:30:00-0400', tz='US/Eastern')

stamp = pd.Timestamp('2012-11-04 00:30',tz='US/Eastern')
stamp

Timestamp('2012-11-04 00:30:00-0400', tz='US/Eastern')

stamp+2*Hour()

Timestamp('2012-11-04 01:30:00-0500', tz='US/Eastern')

11.4.3 不同时区间的操作

rng = pd.date_range('3/7/2012 9:30',periods=10,freq='B')
ts = pd.Series(np.random.randn(len(rng)),index=rng)
ts

2012-03-07 09:30:00    0.386445
2012-03-08 09:30:00   -0.300690
2012-03-09 09:30:00    0.759709
2012-03-12 09:30:00    1.257558
2012-03-13 09:30:00   -0.505203
2012-03-14 09:30:00   -0.002910
2012-03-15 09:30:00   -0.196304
2012-03-16 09:30:00    0.440398
2012-03-19 09:30:00    0.218168
2012-03-20 09:30:00    1.265953
Freq: B, dtype: float64

ts1 = ts[:7].tz_localize('Europe/London')
ts1

2012-03-07 09:30:00+00:00    0.386445
2012-03-08 09:30:00+00:00   -0.300690
2012-03-09 09:30:00+00:00    0.759709
2012-03-12 09:30:00+00:00    1.257558
2012-03-13 09:30:00+00:00   -0.505203
2012-03-14 09:30:00+00:00   -0.002910
2012-03-15 09:30:00+00:00   -0.196304
dtype: float64

ts2 = ts[:7].tz_localize('Europe/Moscow')
ts2

2012-03-07 09:30:00+04:00    0.386445
2012-03-08 09:30:00+04:00   -0.300690
2012-03-09 09:30:00+04:00    0.759709
2012-03-12 09:30:00+04:00    1.257558
2012-03-13 09:30:00+04:00   -0.505203
2012-03-14 09:30:00+04:00   -0.002910
2012-03-15 09:30:00+04:00   -0.196304
dtype: float64

result = ts1 + ts2

result.index

DatetimeIndex(['2012-03-07 05:30:00+00:00', '2012-03-07 09:30:00+00:00',
               '2012-03-08 05:30:00+00:00', '2012-03-08 09:30:00+00:00',
               '2012-03-09 05:30:00+00:00', '2012-03-09 09:30:00+00:00',
               '2012-03-12 05:30:00+00:00', '2012-03-12 09:30:00+00:00',
               '2012-03-13 05:30:00+00:00', '2012-03-13 09:30:00+00:00',
               '2012-03-14 05:30:00+00:00', '2012-03-14 09:30:00+00:00',
               '2012-03-15 05:30:00+00:00', '2012-03-15 09:30:00+00:00'],
              dtype='datetime64[ns, UTC]', freq=None)

11.5 时间区间和区间算术

p = pd.Period(2007,freq='A-DEC')
p

Period('2007', 'A-DEC')

p+5

Period('2012', 'A-DEC')

p-2

Period('2005', 'A-DEC')

#如果两个区间拥有相同的频率，则它们的差是它们之间的单位数
pd.Period(2014,freq='A-DEC')-p

<7 * YearEnds: month=12>

#使用period_range函数可以构造规则区间序列
rng = pd.period_range('2000-01-01','2000-06-30',freq='M')
rng

PeriodIndex(['2000-01', '2000-02', '2000-03', '2000-04', '2000-05', '2000-06'], dtype='period[M]', freq='M')

#PeriodIndex类存储的是区间的序列，可以作为任意pandas数据结构的轴索引
pd.Series(np.random.randn(6),index=rng)

2000-01   -1.474403
2000-02    0.146328
2000-03    1.247877
2000-04    0.797235
2000-05   -1.583312
2000-06    1.525540
Freq: M, dtype: float64

#如果你有一个字符串数组，你也可以使用PeriodIndex类
values = ['2001Q3','2002Q2','2003Q1']
index = pd.PeriodIndex(values,freq='Q-DEC')
index

PeriodIndex(['2001Q3', '2002Q2', '2003Q1'], dtype='period[Q-DEC]', freq='Q-DEC')

11.5.1 区间频率转换

使用asfreq可以将区间和PeriodIndex对象转换为其他的频率。

p = pd.Period(2007,freq='A-DEC')
p

Period('2007', 'A-DEC')

p.asfreq('M',how='start')

Period('2007-01', 'M')

p.asfreq('M',how='end')

Period('2007-12', 'M')

p = pd.Period(2007,freq='A-JUN')
p

Period('2007', 'A-JUN')

p.asfreq('M',how='start')

Period('2006-07', 'M')

p.asfreq('M',how='end')

Period('2007-06', 'M')

p = pd.Period('Aug-2007','M')
p

Period('2007-08', 'M')

p.asfreq('A-JUN')

Period('2008', 'A-JUN')

rng = pd.period_range('2006','2009',freq='A-DEC')
rng

PeriodIndex(['2006', '2007', '2008', '2009'], dtype='period[A-DEC]', freq='A-DEC')

ts = pd.Series(np.random.randn(len(rng)),index=rng)
ts

2006   -0.784284
2007   -1.058489
2008   -0.151931
2009    0.443975
Freq: A-DEC, dtype: float64

ts.asfreq('M',how='start')

2006-01   -0.784284
2007-01   -1.058489
2008-01   -0.151931
2009-01    0.443975
Freq: M, dtype: float64

ts.asfreq('B',how='end')

2006-12-29   -0.784284
2007-12-31   -1.058489
2008-12-31   -0.151931
2009-12-31    0.443975
Freq: B, dtype: float64

11.5.2 季度区间频率

p = pd.Period(2007,freq='Q-JAN')
p

Period('2007Q4', 'Q-JAN')

p.asfreq('D','start')

Period('2006-11-01', 'D')

p.asfreq('D','end')

Period('2007-01-31', 'D')

#获取在季度倒数第二个工作日下午4点的时间戳
p4pm = (p.asfreq('B','e')-1).asfreq('T','S')+16*60
p4pm

Period('2007-01-30 16:00', 'T')

p4pm.to_timestamp()

Timestamp('2007-01-30 16:00:00')

rng = pd.period_range('2011Q3','2012Q4',freq='Q-JAN')
ts = pd.Series(np.arange(len(rng)),index=rng)
ts

2011Q3    0
2011Q4    1
2012Q1    2
2012Q2    3
2012Q3    4
2012Q4    5
Freq: Q-JAN, dtype: int32

new_rng = (rng.asfreq('B','e')-1).asfreq('T','S')+16*60

ts.index = new_rng.to_timestamp()
ts

2010-10-28 16:00:00    0
2011-01-28 16:00:00    1
2011-04-28 16:00:00    2
2011-07-28 16:00:00    3
2011-10-28 16:00:00    4
2012-01-30 16:00:00    5
dtype: int32

11.5.3 将时间戳转换为区间（以及逆转换）

rng = pd.date_range('2000-01-01',periods=3,freq='M')
ts = pd.Series(np.random.randn(3),index=rng)
ts

2000-01-31   -0.622352
2000-02-29   -1.830166
2000-03-31   -0.601708
Freq: M, dtype: float64

#通过时间戳索引的Series和DataFrame可以被to_period方法转换为区间
pts = ts.to_period()
pts

2000-01   -0.622352
2000-02   -1.830166
2000-03   -0.601708
Freq: M, dtype: float64

rng = pd.date_range('1/29/2000',periods=6,freq='D')
rng

DatetimeIndex(['2000-01-29', '2000-01-30', '2000-01-31', '2000-02-01',
               '2000-02-02', '2000-02-03'],
              dtype='datetime64[ns]', freq='D')

ts2 = pd.Series(np.random.randn(6),index=rng)
ts2

2000-01-29    0.056749
2000-01-30   -0.310291
2000-01-31    1.369577
2000-02-01   -0.361215
2000-02-02    0.426119
2000-02-03   -0.010565
Freq: D, dtype: float64

ts2.to_period('M')

2000-01    0.056749
2000-01   -0.310291
2000-01    1.369577
2000-02   -0.361215
2000-02    0.426119
2000-02   -0.010565
Freq: M, dtype: float64

#使用to_timestamp可以将区间再转换为时间戳
pts = ts2.to_period()
pts

2000-01-29    0.056749
2000-01-30   -0.310291
2000-01-31    1.369577
2000-02-01   -0.361215
2000-02-02    0.426119
2000-02-03   -0.010565
Freq: D, dtype: float64

pts.to_timestamp(how='end')

2000-01-29 23:59:59.999999999    0.056749
2000-01-30 23:59:59.999999999   -0.310291
2000-01-31 23:59:59.999999999    1.369577
2000-02-01 23:59:59.999999999   -0.361215
2000-02-02 23:59:59.999999999    0.426119
2000-02-03 23:59:59.999999999   -0.010565
Freq: D, dtype: float64

11.5.4 从数组生成PeriodIndex

data = pd.read_csv('examples/macrodata.csv')
data.head(5)

	year	quarter	realgdp	realcons	realinv	realgovt	realdpi	cpi	m1	tbilrate	unemp	pop	infl	realint
0	1959.0	1.0	2710.349	1707.4	286.898	470.045	1886.9	28.98	139.7	2.82	5.8	177.146	0.00	0.00
1	1959.0	2.0	2778.801	1733.7	310.859	481.301	1919.7	29.15	141.7	3.08	5.1	177.830	2.34	0.74
2	1959.0	3.0	2775.488	1751.8	289.226	491.260	1916.4	29.35	140.5	3.82	5.3	178.657	2.74	1.09
3	1959.0	4.0	2785.204	1753.7	299.356	484.052	1931.3	29.37	140.0	4.33	5.6	179.386	0.27	4.06
4	1960.0	1.0	2847.699	1770.5	331.722	462.199	1955.5	29.54	139.6	3.50	5.2	180.007	2.31	1.19

data.year

0      1959.0
1      1959.0
2      1959.0
3      1959.0
4      1960.0
        ...  
198    2008.0
199    2008.0
200    2009.0
201    2009.0
202    2009.0
Name: year, Length: 203, dtype: float64

data.quarter

0      1.0
1      2.0
2      3.0
3      4.0
4      1.0
      ... 
198    3.0
199    4.0
200    1.0
201    2.0
202    3.0
Name: quarter, Length: 203, dtype: float64

#通过将这些数组和频率传递给PeriodIndex，你可以联合这些数组形成DataFrame的索引
index = pd.PeriodIndex(year=data.year,quarter=data.quarter,freq='Q-DEC')
index

PeriodIndex(['1959Q1', '1959Q2', '1959Q3', '1959Q4', '1960Q1', '1960Q2',
             '1960Q3', '1960Q4', '1961Q1', '1961Q2',
             ...
             '2007Q2', '2007Q3', '2007Q4', '2008Q1', '2008Q2', '2008Q3',
             '2008Q4', '2009Q1', '2009Q2', '2009Q3'],
            dtype='period[Q-DEC]', length=203, freq='Q-DEC')

data.index = index

data.head()

	year	quarter	realgdp	realcons	realinv	realgovt	realdpi	cpi	m1	tbilrate	unemp	pop	infl	realint
1959Q1	1959.0	1.0	2710.349	1707.4	286.898	470.045	1886.9	28.98	139.7	2.82	5.8	177.146	0.00	0.00
1959Q2	1959.0	2.0	2778.801	1733.7	310.859	481.301	1919.7	29.15	141.7	3.08	5.1	177.830	2.34	0.74
1959Q3	1959.0	3.0	2775.488	1751.8	289.226	491.260	1916.4	29.35	140.5	3.82	5.3	178.657	2.74	1.09
1959Q4	1959.0	4.0	2785.204	1753.7	299.356	484.052	1931.3	29.37	140.0	4.33	5.6	179.386	0.27	4.06
1960Q1	1960.0	1.0	2847.699	1770.5	331.722	462.199	1955.5	29.54	139.6	3.50	5.2	180.007	2.31	1.19

data.infl

1959Q1    0.00
1959Q2    2.34
1959Q3    2.74
1959Q4    0.27
1960Q1    2.31
          ... 
2008Q3   -3.16
2008Q4   -8.79
2009Q1    0.94
2009Q2    3.37
2009Q3    3.56
Freq: Q-DEC, Name: infl, Length: 203, dtype: float64

11.6 重新采样与频率转换

重新采样是指将时间序列从一个频率转换为另一个频率的过程。
将更高频率的数据聚合到低频率被称为向下采样，而从低频率转换到高频率称为向上采样。
并不是所有的重新采样都属于上面说的两类
pandas对象都配有resample方法，该方法是所有频率转换的工具函数。
resample拥有类似于groupby的API；你调用resample对数据分组，之后再调用聚合函数

rng = pd.date_range('2000-01-01',periods=100,freq='D')
ts = pd.Series(np.random.randn(len(rng)),index=rng)
ts

2000-01-01    0.037518
2000-01-02   -0.067833
2000-01-03   -0.704365
2000-01-04   -0.368819
2000-01-05   -1.208229
                ...   
2000-04-05    1.085903
2000-04-06    0.746400
2000-04-07    0.517810
2000-04-08   -0.363853
2000-04-09    0.057838
Freq: D, Length: 100, dtype: float64

ts.resample('M').mean()

2000-01-31   -0.108144
2000-02-29   -0.055158
2000-03-31    0.358646
2000-04-30    0.773431
Freq: M, dtype: float64

ts.resample('M',kind='period').mean()

2000-01   -0.108144
2000-02   -0.055158
2000-03    0.358646
2000-04    0.773431
Freq: M, dtype: float64

resample方法参数

参数	描述
freq	表明所需采样频率的字符串或DateOffset对象(例如，‘M’. ‘5min’ 或Second(1))
axis	需要采样的轴向;默认是axis=0
fill_method	向上采样时的插值方式，‘ffill’ 或’bf11’;默认是不插值的
closed	向下采样中，每段间隔的哪-段是封闭的(包含的),’ right’或’1eft’
label	向下采样中，如何用’right’或’left’的箱标签标记聚合结果(例如，9:30到9:35的五分钟间隔可以被标记为9:30或9:35)
loffset	对箱标签进行时间调校，例如’-1s’ /Second (-1) 可以将聚合标签向前移动一秒
limit	在前向或后向填充时，填充区间的最大值
kind	对区间(’ period’)或时间戳(‘timestamp’) 的聚合:默认为时间序列索引的类型
convention	在对区间重新采样时，用于将低频周期转换为高频的约定(‘start’ 或’end’);默认为’end’

11.6.1 向下采样

在使用resample进行向下采样数据时有些事情需要考虑：
- 每段间隔的哪一边是闭合的
- 如何在间隔的起始或结束位置标记每个已聚合的箱体

rng = pd.date_range('2000-01-01',periods=12,freq='T')
ts = pd.Series(np.random.randn(len(rng)),index=rng)
ts

2000-01-01 00:00:00    0.579555
2000-01-01 00:01:00    0.210504
2000-01-01 00:02:00    0.135227
2000-01-01 00:03:00   -0.125649
2000-01-01 00:04:00    1.511186
2000-01-01 00:05:00   -1.384490
2000-01-01 00:06:00   -0.938106
2000-01-01 00:07:00    2.057013
2000-01-01 00:08:00   -0.737751
2000-01-01 00:09:00    1.469319
2000-01-01 00:10:00   -0.257609
2000-01-01 00:11:00   -0.975761
Freq: T, dtype: float64

ts.resample('5min',closed='right',label='right').sum()

2000-01-01 00:00:00    0.579555
2000-01-01 00:05:00    0.346778
2000-01-01 00:10:00    1.592866
2000-01-01 00:15:00   -0.975761
Freq: 5T, dtype: float64

ts.resample('5min',closed='left').sum()

2000-01-01 00:00:00    2.310823
2000-01-01 00:05:00    0.465985
2000-01-01 00:10:00   -1.233370
Freq: 5T, dtype: float64

ts.resample('5min').sum()

2000-01-01 00:00:00    2.310823
2000-01-01 00:05:00    0.465985
2000-01-01 00:10:00   -1.233370
Freq: 5T, dtype: float64

#你可能需要将结果索引移动一定的数量，例如从右边缘减去一秒，以使其更清楚地表明时间戳所指的间隔。
#要实现这个功能，向loffset传递字符串或日期偏置
ts.resample('5min',closed='right',label='right',loffset='-1s').sum()
#你也可以通过在结果上调用shift方法来完成loffset的效果。

<ipython-input-407-0b956efa5584>:3: FutureWarning: 'loffset' in .resample() and in Grouper() is deprecated.

>>> df.resample(freq="3s", loffset="8H")

becomes:

>>> from pandas.tseries.frequencies import to_offset
>>> df = df.resample(freq="3s").mean()
>>> df.index = df.index.to_timestamp() + to_offset("8H")

  ts.resample('5min',closed='right',label='right',loffset='-1s').sum()





1999-12-31 23:59:59    0.579555
2000-01-01 00:04:59    0.346778
2000-01-01 00:09:59    1.592866
2000-01-01 00:14:59   -0.975761
Freq: 5T, dtype: float64

11.6.1.1 开端-峰值-谷值-结束（OHLC）重新采样

在金融中，为每个数据桶计算四个值是一种流行的时间序列聚合方法：第一个值（开端）、最后一个值（结束）、最大值（峰值）和最小值（谷值）。通过使用ohlc聚合函数你将会获得包含四种聚合值列的DataFrame，这些值在数据的单次扫描中被高效计算

ts.resample('5min').ohlc()

	open	high	low	close
2000-01-01 00:00:00	0.579555	1.511186	-0.125649	1.511186
2000-01-01 00:05:00	-1.384490	2.057013	-1.384490	1.469319
2000-01-01 00:10:00	-0.257609	-0.257609	-0.975761	-0.975761

11.6.2 向上采样与插值

当从低频率转换为高频率时，并不需要任何聚合
由于区间涉及时间范围，向上采样和向下采样就更为严格：
- 在向下采样中，目标频率必须是原频率的子区间。
- 在向上采样中，目标频率必须是原频率的父区间。

frame = pd.DataFrame(np.random.randn(2,4),
                     index=pd.date_range('1/1/2000',periods=2,freq='W-WED'),
                     columns=['colorado','texas','new york','ohio'])
frame

	colorado	texas	new york	ohio
2000-01-05	-1.050656	0.063228	-1.179070	1.070880
2000-01-12	0.160683	1.629834	-0.340546	-0.767973

#使用asfreq方法在不聚合的情况下转换到高频率
df_daily = frame.resample('D').asfreq()
df_daily

	colorado	texas	new york	ohio
2000-01-05	-1.050656	0.063228	-1.179070	1.070880
2000-01-06	NaN	NaN	NaN	NaN
2000-01-07	NaN	NaN	NaN	NaN
2000-01-08	NaN	NaN	NaN	NaN
2000-01-09	NaN	NaN	NaN	NaN
2000-01-10	NaN	NaN	NaN	NaN
2000-01-11	NaN	NaN	NaN	NaN
2000-01-12	0.160683	1.629834	-0.340546	-0.767973

#fillna和reindex方法中可用的填充或插值方法可用于重采样
frame.resample('D').ffill()

	colorado	texas	new york	ohio
2000-01-05	-1.050656	0.063228	-1.179070	1.070880
2000-01-06	-1.050656	0.063228	-1.179070	1.070880
2000-01-07	-1.050656	0.063228	-1.179070	1.070880
2000-01-08	-1.050656	0.063228	-1.179070	1.070880
2000-01-09	-1.050656	0.063228	-1.179070	1.070880
2000-01-10	-1.050656	0.063228	-1.179070	1.070880
2000-01-11	-1.050656	0.063228	-1.179070	1.070880
2000-01-12	0.160683	1.629834	-0.340546	-0.767973

#你可以同样选择仅向前填充一定数量的区间，以限制继续使用观测值的时距
frame.resample('D').ffill(limit=2)

	colorado	texas	new york	ohio
2000-01-05	-1.050656	0.063228	-1.179070	1.070880
2000-01-06	-1.050656	0.063228	-1.179070	1.070880
2000-01-07	-1.050656	0.063228	-1.179070	1.070880
2000-01-08	NaN	NaN	NaN	NaN
2000-01-09	NaN	NaN	NaN	NaN
2000-01-10	NaN	NaN	NaN	NaN
2000-01-11	NaN	NaN	NaN	NaN
2000-01-12	0.160683	1.629834	-0.340546	-0.767973

#请注意，新的日期索引根本不需要与旧的索引重叠
frame.resample('W-THU').ffill()

	colorado	texas	new york	ohio
2000-01-06	-1.050656	0.063228	-1.179070	1.070880
2000-01-13	0.160683	1.629834	-0.340546	-0.767973

11.6.3 使用区间进行重新采样

frame = pd.DataFrame(np.random.randn(24,4),
                    index=pd.period_range('1-2000','12-2001',freq='M'),
                    columns=['colorado','texas','new york','ohio'])
frame.head()

	colorado	texas	new york	ohio
2000-01	0.827955	0.212008	0.136026	-1.278525
2000-02	0.745499	1.909823	0.819960	-0.569550
2000-03	0.893031	0.512417	1.035980	-0.923079
2000-04	-0.223343	0.536235	-1.872214	-0.210844
2000-05	-2.059027	-2.296580	-1.082294	-0.352297

annual_frame = frame.resample('A-DEC').mean()
annual_frame

	colorado	texas	new york	ohio
2000	0.062947	0.204868	-0.149046	0.072700
2001	0.433417	0.014037	-0.483594	0.242176

annual_frame.resample('Q-DEC').ffill()

	colorado	texas	new york	ohio
2000Q1	0.062947	0.204868	-0.149046	0.072700
2000Q2	0.062947	0.204868	-0.149046	0.072700
2000Q3	0.062947	0.204868	-0.149046	0.072700
2000Q4	0.062947	0.204868	-0.149046	0.072700
2001Q1	0.433417	0.014037	-0.483594	0.242176
2001Q2	0.433417	0.014037	-0.483594	0.242176
2001Q3	0.433417	0.014037	-0.483594	0.242176
2001Q4	0.433417	0.014037	-0.483594	0.242176

annual_frame.resample('Q-MAR').ffill()

	colorado	texas	new york	ohio
2000Q4	0.062947	0.204868	-0.149046	0.072700
2001Q1	0.062947	0.204868	-0.149046	0.072700
2001Q2	0.062947	0.204868	-0.149046	0.072700
2001Q3	0.062947	0.204868	-0.149046	0.072700
2001Q4	0.433417	0.014037	-0.483594	0.242176
2002Q1	0.433417	0.014037	-0.483594	0.242176
2002Q2	0.433417	0.014037	-0.483594	0.242176
2002Q3	0.433417	0.014037	-0.483594	0.242176

11.7 移动窗口函数

close_px_all = pd.read_csv('examples/stock_px_2.csv',parse_dates=True,index_col=0)
close_px_all

	AAPL	MSFT	XOM	SPX
2003-01-02	7.40	21.11	29.22	909.03
2003-01-03	7.45	21.14	29.24	908.59
2003-01-06	7.45	21.52	29.96	929.01
2003-01-07	7.43	21.93	28.95	922.93
2003-01-08	7.28	21.31	28.83	909.93
...	...	...	...	...
2011-10-10	388.81	26.94	76.28	1194.89
2011-10-11	400.29	27.00	76.27	1195.54
2011-10-12	402.19	26.96	77.16	1207.25
2011-10-13	408.43	27.18	76.37	1203.66
2011-10-14	422.00	27.27	78.11	1224.58

2214 rows × 4 columns

close_px = close_px_all[['AAPL','MSFT','XOM']]
close_px

	AAPL	MSFT	XOM
2003-01-02	7.40	21.11	29.22
2003-01-03	7.45	21.14	29.24
2003-01-06	7.45	21.52	29.96
2003-01-07	7.43	21.93	28.95
2003-01-08	7.28	21.31	28.83
...	...	...	...
2011-10-10	388.81	26.94	76.28
2011-10-11	400.29	27.00	76.27
2011-10-12	402.19	26.96	77.16
2011-10-13	408.43	27.18	76.37
2011-10-14	422.00	27.27	78.11

2214 rows × 3 columns

close_px = close_px.resample('B').ffill()
close_px

	AAPL	MSFT	XOM
2003-01-02	7.40	21.11	29.22
2003-01-03	7.45	21.14	29.24
2003-01-06	7.45	21.52	29.96
2003-01-07	7.43	21.93	28.95
2003-01-08	7.28	21.31	28.83
...	...	...	...
2011-10-10	388.81	26.94	76.28
2011-10-11	400.29	27.00	76.27
2011-10-12	402.19	26.96	77.16
2011-10-13	408.43	27.18	76.37
2011-10-14	422.00	27.27	78.11

2292 rows × 3 columns

close_px.AAPL.plot()
close_px.AAPL.rolling(250).mean().plot()

app1_std250 = close_px.AAPL.rolling(250,min_periods=10).std()

app1_std250[5:12]

2003-01-09         NaN
2003-01-10         NaN
2003-01-13         NaN
2003-01-14         NaN
2003-01-15    0.077496
2003-01-16    0.074760
2003-01-17    0.112368
Freq: B, Name: AAPL, dtype: float64

app1_std250.plot()

expanding_mean = app1_std250.expanding().mean()

close_px.rolling(60).mean().plot(logy=True)

close_px.rolling('20D').mean()

	AAPL	MSFT	XOM
2003-01-02	7.400000	21.110000	29.220000
2003-01-03	7.425000	21.125000	29.230000
2003-01-06	7.433333	21.256667	29.473333
2003-01-07	7.432500	21.425000	29.342500
2003-01-08	7.402000	21.402000	29.240000
...	...	...	...
2011-10-10	389.351429	25.602143	72.527857
2011-10-11	388.505000	25.674286	72.835000
2011-10-12	388.531429	25.810000	73.400714
2011-10-13	388.826429	25.961429	73.905000
2011-10-14	391.038000	26.048667	74.185333

2292 rows × 3 columns

11.7.1 指数加权函数

指定一个常数衰减因子以向更多近期观测值提供更多权重，可以替代使用具有相等加权观察值的静态窗口尺寸的方法

appl_px = close_px.AAPL['2006':'2007']

ma60 = appl_px.rolling(30,min_periods=20).mean()

ewma60 = appl_px.ewm(span=30).mean()

import matplotlib.pyplot as plt
ma60.plot(style='k--',label='simple ma')
ewma60.plot(style='k-',label='ew ma')
plt.legend()

<matplotlib.legend.Legend at 0x259c45d36d0>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-N4fO6y4C-1640758272713)(output_335_1.png)]

11.7.2 二元移动窗口函数

spx_px = close_px_all['SPX']

spx_rets = close_px.pct_change()

returns = close_px.pct_change()

corr = returns.AAPL.rolling(125,min_periods=100).corr(spx_rets)

corr.plot()

11.7.3 用户自定义的移动窗口函数

from scipy.stats import percentileofscore
score_at_2percent = lambda x:percentileofscore(x,0.2)
result = returns.AAPL.rolling(250).apply(score_at_2percent)
result.plot()

第十二章高阶pandas

12.1 分类数据

12.1.1 背景和目标

import numpy as np
import pandas as pd

values = pd.Series(['apple','orange','apple','apple']*2)
values

0     apple
1    orange
2     apple
3     apple
4     apple
5    orange
6     apple
7     apple
dtype: object

pd.unique(values)

array(['apple', 'orange'], dtype=object)

pd.value_counts(values)

apple     6
orange    2
dtype: int64

values = pd.Series([0,1,0,0]*2)
values

0    0
1    1
2    0
3    0
4    0
5    1
6    0
7    0
dtype: int64

dim = pd.Series(['apple', 'orange'])
dim

0     apple
1    orange
dtype: object

#可以使用take方法来恢复原来的字符串Series
dim.take(values)

0     apple
1    orange
0     apple
0     apple
0     apple
1    orange
0     apple
0     apple
dtype: object

#这种按照整数展现的方式被称为分类或字典编码展现

12.1.2 pandas中的Categorical类型

pandas拥有特殊的Categorical类型，用于承载基于整数的类别展示或编码的数据。

fruits = ['apple','orange','apple','apple']*2
fruits

['apple', 'orange', 'apple', 'apple', 'apple', 'orange', 'apple', 'apple']

N = len(fruits)
N

df = pd.DataFrame({'fruit':fruits,
                  'basker_id':np.arange(N),
                  'count':np.random.randint(3,15,size=N),
                  'weight':np.random.uniform(0,4,size=N)})
df

	fruit	basker_id	count	weight
0	apple	0	14	3.004288
1	orange	1	9	1.432892
2	apple	2	10	2.859906
3	apple	3	6	2.328174
4	apple	4	12	3.558704
5	orange	5	13	1.199922
6	apple	6	5	0.090920
7	apple	7	8	2.336066

#df['fruit']是一个Python字符串对象组成的数组。
#我们可以通过调用函数将它转换为Categorical对象
fruit_cat = df['fruit'].astype('category')
fruit_cat
#fruit_cat的值并不是NumPy数组，而是pandas.Categorical的实例

0     apple
1    orange
2     apple
3     apple
4     apple
5    orange
6     apple
7     apple
Name: fruit, dtype: category
Categories (2, object): ['apple', 'orange']

c = fruit_cat.values
c

['apple', 'orange', 'apple', 'apple', 'apple', 'orange', 'apple', 'apple']
Categories (2, object): ['apple', 'orange']

type(c)

pandas.core.arrays.categorical.Categorical

#Categorical对象拥有categories和codes属性
c.categories

Index(['apple', 'orange'], dtype='object')

c.codes

array([0, 1, 0, 0, 0, 1, 0, 0], dtype=int8)

#你可以通过分配已转换的结果将DataFrame的一列转换为Categorical对象
df['fruit'] = df['fruit'].astype('category')
df.fruit

0     apple
1    orange
2     apple
3     apple
4     apple
5    orange
6     apple
7     apple
Name: fruit, dtype: category
Categories (2, object): ['apple', 'orange']

#可以从其他Python序列类型直接生成pandas.Categorical
my_categories = pd.Categorical(['foo','bar','baz','foo','bar'])
my_categories

['foo', 'bar', 'baz', 'foo', 'bar']
Categories (3, object): ['bar', 'baz', 'foo']

#如果你已经从另一个数据源获得了分类编码数据，你可以使用from_codes构造函数
categories = ['foo','bar','baz']
codes = [0,1,2,0,0,1]
my_cats_2 = pd.Categorical.from_codes(codes,categories)
my_cats_2

['foo', 'bar', 'baz', 'foo', 'foo', 'bar']
Categories (3, object): ['foo', 'bar', 'baz']

#当使用from_codes或其他任意构造函数时，你可以为类别指定一个有意义的顺序
ordered_cat = pd.Categorical.from_codes(codes,categories,ordered=True)
ordered_cat
#输出的[foo<bar<baz]表明’foo’的顺序在’bar’之前，以此类推

['foo', 'bar', 'baz', 'foo', 'foo', 'bar']
Categories (3, object): ['foo' < 'bar' < 'baz']

#一个未排序的分类实例可以使用as_ordered进行排序
my_cats_2.as_ordered()

['foo', 'bar', 'baz', 'foo', 'foo', 'bar']
Categories (3, object): ['foo' < 'bar' < 'baz']

12.1.3 使用Categorical对象进行计算

np.random.seed(12345)
draws = np.random.randn(1000)
draws[:5]

array([-0.20470766,  0.47894334, -0.51943872, -0.5557303 ,  1.96578057])

#让我们计算上面数据的四分位分箱，并提取一些统计值
bins = pd.cut(draws,4)
bins

[(-1.23, 0.489], (-1.23, 0.489], (-1.23, 0.489], (-1.23, 0.489], (0.489, 2.208], ..., (-1.23, 0.489], (-1.23, 0.489], (-1.23, 0.489], (0.489, 2.208], (0.489, 2.208]]
Length: 1000
Categories (4, interval[float64]): [(-2.956, -1.23] < (-1.23, 0.489] < (0.489, 2.208] < (2.208, 3.928]]

bins = pd.qcut(draws,4,labels=['Q1','Q2','Q3','Q4'])
bins

['Q2', 'Q3', 'Q2', 'Q2', 'Q4', ..., 'Q3', 'Q2', 'Q1', 'Q3', 'Q4']
Length: 1000
Categories (4, object): ['Q1' < 'Q2' < 'Q3' < 'Q4']

bins = pd.Series(bins,name='quartile')
results = (pd.Series(draws)
          .groupby(bins)
          .agg(['count','min','max'])
          .reset_index())
results

	quartile	count	min	max
0	Q1	250	-2.949343	-0.685484
1	Q2	250	-0.683066	-0.010115
2	Q3	250	-0.010032	0.628894
3	Q4	250	0.634238	3.927528

results['quartile']

0    Q1
1    Q2
2    Q3
3    Q4
Name: quartile, dtype: category
Categories (4, object): ['Q1' < 'Q2' < 'Q3' < 'Q4']

12.1.3.1 使用分类获得更高性能

N = 10000000
draws = pd.Series(np.random.randn(N))
labels = pd.Series(['foo','bar','baz','qux']*(N//4))

categories = labels.astype('category')

#labels比categories使用了明显更多的内存
labels.memory_usage()

80000128

categories.memory_usage()

10000332

#分类转换当然不是免费的，但是它是一次性开销       
%timeit labels.astype('category')

299 ms ± 1.05 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

12.1.4 分类方法

pandas中Series的分类方法

方法	描述
add_ categories	将新的类别(未使用过的)添加到已有类别的尾部
as_ordered	对类别排序
as_unordered	使类别无序
remove_categories	去除类别，将被移除的值置为null
remove_unused_categories	去除所有没有出现在数据中的类别
rename_categories	使用新的类别名称替代现有的类别，不会改变类别的数量
reorder_categories	与rename_categories 类似，但结果是经过排序的类别
set_categories	用指定的一组新类别替换现有类别，可以添加或删除类别

s = pd.Series(['a','b','c','d']*2)
cat_s = s.astype('category')
cat_s

0    a
1    b
2    c
3    d
4    a
5    b
6    c
7    d
dtype: category
Categories (4, object): ['a', 'b', 'c', 'd']

#特殊属性cat提供了对分类方法的访问
cat_s.cat.codes

0    0
1    1
2    2
3    3
4    0
5    1
6    2
7    3
dtype: int8

cat_s.cat.categories

Index(['a', 'b', 'c', 'd'], dtype='object')

#可以使用set_categories方法来改变类别
actual_categories = ['a','b','c','d','e']
cat_s2 = cat_s.cat.set_categories(actual_categories)
cat_s2

0    a
1    b
2    c
3    d
4    a
5    b
6    c
7    d
dtype: category
Categories (5, object): ['a', 'b', 'c', 'd', 'e']

cat_s.value_counts()

a    2
b    2
c    2
d    2
dtype: int64

cat_s2.value_counts()

a    2
b    2
c    2
d    2
e    0
dtype: int64

#可以使用remove_unused_categories方法来去除未观察到的类别
cat_s3 = cat_s[cat_s2.isin(['a','b'])]
cat_s3

0    a
1    b
4    a
5    b
dtype: category
Categories (4, object): ['a', 'b', 'c', 'd']

cat_s3.cat.remove_unused_categories()

0    a
1    b
4    a
5    b
dtype: category
Categories (2, object): ['a', 'b']

12.1.4.1 创建用于建模的虚拟变量

当你使用统计数据或机器学习工具时，通常会将分类数据转换为虚拟变量，也称为one-hot编码。
这会产生一个DataFrame，每个不同的类别都是它的一列。这些列包含一个特定类别的出现次数，否则为0。

cat_s = pd.Series(['a','b','c','d']*2,dtype='category')

cat_s

0    a
1    b
2    c
3    d
4    a
5    b
6    c
7    d
dtype: category
Categories (4, object): ['a', 'b', 'c', 'd']

#pandas.get_dummies函数将一维的分类数据转换为一个包含虚拟变量的DataFrame
pd.get_dummies(cat_s)

	a	b	c	d
0	1	0	0	0
1	0	1	0	0
2	0	0	1	0
3	0	0	0	1
4	1	0	0	0
5	0	1	0	0
6	0	0	1	0
7	0	0	0	1

12.2 高阶GroupBy应用

12.2.1 分组转换和“展开”GroupBy

内建方法transform
- transform可以产生一个标量值，并广播到各分组的尺寸数据中
- transform可以产生一个与输入分组尺寸相同的对象
- transform不可改变它的输入

df = pd.DataFrame({'key':['a','b','c']*4,
                  'value':np.arange(12.)})
df

	key	value
0	a	0.0
1	b	1.0
2	c	2.0
3	a	3.0
4	b	4.0
5	c	5.0
6	a	6.0
7	b	7.0
8	c	8.0
9	a	9.0
10	b	10.0
11	c	11.0

g = df.groupby('key').value
g.mean()

key
a    4.5
b    5.5
c    6.5
Name: value, dtype: float64

g.transform(lambda x:x.mean())

0     4.5
1     5.5
2     6.5
3     4.5
4     5.5
5     6.5
6     4.5
7     5.5
8     6.5
9     4.5
10    5.5
11    6.5
Name: value, dtype: float64

#对于内建的聚合函数，我们可以像GroupBy的agg方法一样传递一个字符串别名
g.transform('mean')

0     4.5
1     5.5
2     6.5
3     4.5
4     5.5
5     6.5
6     4.5
7     5.5
8     6.5
9     4.5
10    5.5
11    6.5
Name: value, dtype: float64

g.transform(lambda x:x*2)

0      0.0
1      2.0
2      4.0
3      6.0
4      8.0
5     10.0
6     12.0
7     14.0
8     16.0
9     18.0
10    20.0
11    22.0
Name: value, dtype: float64

#可以按照每个组的降序计算排名
g.transform(lambda x:x.rank(ascending=False))

0     4.0
1     4.0
2     4.0
3     3.0
4     3.0
5     3.0
6     2.0
7     2.0
8     2.0
9     1.0
10    1.0
11    1.0
Name: value, dtype: float64

#考虑一个由简单聚合构成的分组转换函数
def normalize(x):
    return (x-x.mean())/x.std()

g.transform(normalize)

0    -1.161895
1    -1.161895
2    -1.161895
3    -0.387298
4    -0.387298
5    -0.387298
6     0.387298
7     0.387298
8     0.387298
9     1.161895
10    1.161895
11    1.161895
Name: value, dtype: float64

g.apply(normalize)

0    -1.161895
1    -1.161895
2    -1.161895
3    -0.387298
4    -0.387298
5    -0.387298
6     0.387298
7     0.387298
8     0.387298
9     1.161895
10    1.161895
11    1.161895
Name: value, dtype: float64

g.transform('mean')

0     4.5
1     5.5
2     6.5
3     4.5
4     5.5
5     6.5
6     4.5
7     5.5
8     6.5
9     4.5
10    5.5
11    6.5
Name: value, dtype: float64

normalized = (df['value']-g.transform('mean'))/g.transform('std')
normalized

0    -1.161895
1    -1.161895
2    -1.161895
3    -0.387298
4    -0.387298
5    -0.387298
6     0.387298
7     0.387298
8     0.387298
9     1.161895
10    1.161895
11    1.161895
Name: value, dtype: float64

12.2.2 分组的时间重新采样

对于时间序列数据，resample方法在语义上是一种基于时间分段的分组操作

N = 15
times = pd.date_range('2017-05-20 00:00',freq='1min',periods=N)
times

DatetimeIndex(['2017-05-20 00:00:00', '2017-05-20 00:01:00',
               '2017-05-20 00:02:00', '2017-05-20 00:03:00',
               '2017-05-20 00:04:00', '2017-05-20 00:05:00',
               '2017-05-20 00:06:00', '2017-05-20 00:07:00',
               '2017-05-20 00:08:00', '2017-05-20 00:09:00',
               '2017-05-20 00:10:00', '2017-05-20 00:11:00',
               '2017-05-20 00:12:00', '2017-05-20 00:13:00',
               '2017-05-20 00:14:00'],
              dtype='datetime64[ns]', freq='T')

df = pd.DataFrame({'time':times,
                  'value':np.arange(N)})
df

	time	value
0	2017-05-20 00:00:00	0
1	2017-05-20 00:01:00	1
2	2017-05-20 00:02:00	2
3	2017-05-20 00:03:00	3
4	2017-05-20 00:04:00	4
5	2017-05-20 00:05:00	5
6	2017-05-20 00:06:00	6
7	2017-05-20 00:07:00	7
8	2017-05-20 00:08:00	8
9	2017-05-20 00:09:00	9
10	2017-05-20 00:10:00	10
11	2017-05-20 00:11:00	11
12	2017-05-20 00:12:00	12
13	2017-05-20 00:13:00	13
14	2017-05-20 00:14:00	14

df.set_index('time').resample('5min').count()

	value
time
2017-05-20 00:00:00	5
2017-05-20 00:05:00	5
2017-05-20 00:10:00	5

df2 = pd.DataFrame({'time':times.repeat(3),
                   'key':np.tile(['a','b','c'],N),
                   'value':np.arange(N*3.0)})
df2[:7]

	time	key	value
0	2017-05-20 00:00:00	a	0.0
1	2017-05-20 00:00:00	b	1.0
2	2017-05-20 00:00:00	c	2.0
3	2017-05-20 00:01:00	a	3.0
4	2017-05-20 00:01:00	b	4.0
5	2017-05-20 00:01:00	c	5.0
6	2017-05-20 00:02:00	a	6.0

TimeGrouper调用不了
https://blog.csdn.net/weixin_51701683/article/details/110710765
from pandas.core import resample

from pandas.core import resample

#要为每个’key’的值进行相同的重新采样，我们可以使用pandas.TimeGrouper对象
#使用TimeGrouper的一个限制是时间必须是Series或DataFrame的索引。
time_key = resample.TimeGrouper('5min')
time_key

TimeGrouper(freq=<5 * Minutes>, axis=0, sort=True, closed='left', label='left', how='mean', convention='e', origin='start_day')

resampled = (df2.set_index('time')
             .groupby(['key',time_key])
             .sum())
resampled

		value
key	time
a	2017-05-20 00:00:00	30.0
	2017-05-20 00:05:00	105.0
	2017-05-20 00:10:00	180.0
b	2017-05-20 00:00:00	35.0
	2017-05-20 00:05:00	110.0
	2017-05-20 00:10:00	185.0
c	2017-05-20 00:00:00	40.0
	2017-05-20 00:05:00	115.0
	2017-05-20 00:10:00	190.0

resampled.reset_index()

	key	time	value
0	a	2017-05-20 00:00:00	30.0
1	a	2017-05-20 00:05:00	105.0
2	a	2017-05-20 00:10:00	180.0
3	b	2017-05-20 00:00:00	35.0
4	b	2017-05-20 00:05:00	110.0
5	b	2017-05-20 00:10:00	185.0
6	c	2017-05-20 00:00:00	40.0
7	c	2017-05-20 00:05:00	115.0
8	c	2017-05-20 00:10:00	190.0