Pandas学习之时序数据

最新推荐文章于 2023-06-20 09:11:14 发布

Miss小姐姐

最新推荐文章于 2023-06-20 09:11:14 发布

阅读量550

点赞数 1

分类专栏： Python学习笔记

本文链接：https://blog.csdn.net/qq_26982913/article/details/107029841

版权

Python学习笔记专栏收录该内容

5 篇文章 0 订阅

订阅专栏

1. to_datetime方法，用来建立时间点

import numpy as np
import pandas as pd
pd.to_datetime('2020/1/1')

Timestamp('2020-01-01 00:00:00')

# 使用列表将其转换为时间点索引
pd.Series(range(2),index=pd.to_datetime(['2020/1/1','2020/1/2']))

2020-01-01    0
2020-01-02    1
dtype: int64

#对于DataFrame 而言，如果时间顺序排好，可以利用to_datetime对数据进行自动转换
df = pd.DataFrame({'year':[2020,2020],'month':[1,1],'day':[1,2]})
df

	year	month	day
0	2020	1	1
1	2020	1	2

pd.to_datetime(df)

0   2020-01-01
1   2020-01-02
dtype: datetime64[ns]

2. date_range方法

start/end/periods(时间点个数)/freq(间隔方法)是这个方法的重要参数

pd.date_range(start='2020/1/1',end='2020/1/10',periods=3)

DatetimeIndex(['2020-01-01 00:00:00', '2020-01-05 12:00:00',
               '2020-01-10 00:00:00'],
              dtype='datetime64[ns]', freq=None)

pd.date_range(start='2020/1/1',end='2020/1/10',freq='D')

DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04',
               '2020-01-05', '2020-01-06', '2020-01-07', '2020-01-08',
               '2020-01-09', '2020-01-10'],
              dtype='datetime64[ns]', freq='D')

pd.date_range(start='2020/1/1',periods=3,freq='D')

DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03'], dtype='datetime64[ns]', freq='D')

3. DateOffset 对象

DateOffset的可选参数包括years/months/weeks/days/hours/minutes/seconds

pd.Timestamp('2020-01-01')

Timestamp('2020-01-01 00:00:00')

pd.Timestamp('2020-01-01')+pd.DateOffset(minutes = 20)-pd.DateOffset(days = 1)

Timestamp('2019-12-31 00:20:00')

pd.Timestamp('2020-01-01') + pd.offsets.Week(2)

Timestamp('2020-01-15 00:00:00')

序列的offset操作

利用apply函数

pd.date_range('20200101',periods=3,freq = 'Y') #Y指的是月末

DatetimeIndex(['2020-12-31', '2021-12-31', '2022-12-31'], dtype='datetime64[ns]', freq='A-DEC')

pd.Series(pd.offsets.BYearBegin(3).apply(i) for i in pd.date_range('20200101',periods=3,freq = 'Y') )

0   2023-01-02
1   2024-01-01
2   2025-01-01
dtype: datetime64[ns]

pd.date_range('20200101',periods=3,freq='Y')+pd.offsets.BYearBegin(3)

DatetimeIndex(['2023-01-02', '2024-01-01', '2025-01-01'], dtype='datetime64[ns]', freq='A-DEC')

时序的索引及属性

1. 索引切片

rng = pd.date_range('2020','2021',freq='W')
ts = pd.Series(np.random.rand(len(rng)),index=rng)
ts.head()

2020-01-05    0.009639
2020-01-12    0.061814
2020-01-19    0.470897
2020-01-26    0.803914
2020-02-02    0.104896
Freq: W-SUN, dtype: float64

ts['2020-01-05']

0.009639339138300618

2. 子集索引

ts['2020-7']

2020-07-05    0.073073
2020-07-12    0.593621
2020-07-19    0.028066
2020-07-26    0.537048
Freq: W-SUN, dtype: float64

ts['2011-1':'20200726'].head()

2020-01-05    0.009639
2020-01-12    0.061814
2020-01-19    0.470897
2020-01-26    0.803914
2020-02-02    0.104896
Freq: W-SUN, dtype: float64

3. 时间点的属性

采用dt对象获取关于时间的信息

pd.Series(ts.index).head()

0   2020-01-05
1   2020-01-12
2   2020-01-19
3   2020-01-26
4   2020-02-02
dtype: datetime64[ns]

pd.Series(ts.index).dt.month.head()

0    1
1    1
2    1
3    1
4    2
dtype: int64

pd.Series(ts.index).dt.day.head()

0     5
1    12
2    19
3    26
4     2
dtype: int64

利用strftime重新修改时间格式

pd.Series(ts.index).dt.strftime('%Y*%m*%d').head()

0    2020*01*05
1    2020*01*12
2    2020*01*19
3    2020*01*26
4    2020*02*02
dtype: object

对于datetime对象可以直接通过属性获取信息

pd.date_range('2020','2021',freq='W')

DatetimeIndex(['2020-01-05', '2020-01-12', '2020-01-19', '2020-01-26',
               '2020-02-02', '2020-02-09', '2020-02-16', '2020-02-23',
               '2020-03-01', '2020-03-08', '2020-03-15', '2020-03-22',
               '2020-03-29', '2020-04-05', '2020-04-12', '2020-04-19',
               '2020-04-26', '2020-05-03', '2020-05-10', '2020-05-17',
               '2020-05-24', '2020-05-31', '2020-06-07', '2020-06-14',
               '2020-06-21', '2020-06-28', '2020-07-05', '2020-07-12',
               '2020-07-19', '2020-07-26', '2020-08-02', '2020-08-09',
               '2020-08-16', '2020-08-23', '2020-08-30', '2020-09-06',
               '2020-09-13', '2020-09-20', '2020-09-27', '2020-10-04',
               '2020-10-11', '2020-10-18', '2020-10-25', '2020-11-01',
               '2020-11-08', '2020-11-15', '2020-11-22', '2020-11-29',
               '2020-12-06', '2020-12-13', '2020-12-20', '2020-12-27'],
              dtype='datetime64[ns]', freq='W-SUN')

pd.date_range('2020','2021',freq='W').month

Int64Index([ 1,  1,  1,  1,  2,  2,  2,  2,  3,  3,  3,  3,  3,  4,  4,  4,  4,
             5,  5,  5,  5,  5,  6,  6,  6,  6,  7,  7,  7,  7,  8,  8,  8,  8,
             8,  9,  9,  9,  9, 10, 10, 10, 10, 11, 11, 11, 11, 11, 12, 12, 12,
            12],
           dtype='int64')

重采样

1. 利用resample函数进行重采样

#产生一个DataFrame，其中以时间为索引，每隔一秒产生一个索引
df_r = pd.DataFrame(np.random.rand(1000,3),index=pd.date_range('1/1/2020',freq='S',periods=1000),columns=['A','B','C'])
df_r.head()

	A	B	C
2020-01-01 00:00:00	0.030495	0.346511	0.329326
2020-01-01 00:00:01	0.834827	0.302838	0.550707
2020-01-01 00:00:02	0.041700	0.200662	0.000873
2020-01-01 00:00:03	0.483835	0.211402	0.447749
2020-01-01 00:00:04	0.599844	0.850822	0.650199

#重采样，r是一个resample对象
r = df_r.resample('3min')
r

<pandas.core.resample.DatetimeIndexResampler object at 0x0000023ABC3C19E8>

#以后3分钟为间隔，求和
r.sum()

	A	B	C
2020-01-01 00:00:00	92.942598	89.039870	90.154015
2020-01-01 00:03:00	89.688035	86.098906	89.821036
2020-01-01 00:06:00	90.771298	90.452599	90.166902
2020-01-01 00:09:00	93.041682	93.163446	90.303107
2020-01-01 00:12:00	92.641535	95.147615	89.392215
2020-01-01 00:15:00	55.440978	50.845603	50.553468

#以3min为间隔，求平均值
r.mean()

	A	B	C
2020-01-01 00:00:00	0.516348	0.494666	0.500856
2020-01-01 00:03:00	0.498267	0.478327	0.499006
2020-01-01 00:06:00	0.504285	0.502514	0.500927
2020-01-01 00:09:00	0.516898	0.517575	0.501684
2020-01-01 00:12:00	0.514675	0.528598	0.496623
2020-01-01 00:15:00	0.554410	0.508456	0.505535

df_r2 = pd.DataFrame(np.random.randn(200,3),index=pd.date_range('1/1/2020',freq= 'D',periods=200),columns=['A','B','C'])
df_r2.head()

	A	B	C
2020-01-01	0.917372	-0.305394	-1.163468
2020-01-02	1.027062	-0.722735	-0.390128
2020-01-03	0.902275	0.306910	-0.482234
2020-01-04	-0.362833	0.583678	0.716035
2020-01-05	-0.467158	-1.345731	-2.380988

r = df_r2.resample('CBMS')
r.sum()

	A	B	C
2020-01-01	1.068820	3.800052	-5.188159
2020-02-03	2.292636	3.275062	-7.076462
2020-03-02	8.565083	-1.366455	-0.495671
2020-04-01	6.730568	2.981884	-4.715256
2020-05-01	6.593104	4.132095	-13.792500
2020-06-01	11.134183	-1.903034	-19.861671
2020-07-01	-0.863740	-1.153881	-4.164100

2. 采样聚合

r = df_r.resample('3T')

r['A'].mean()

2020-01-01 00:00:00    0.516348
2020-01-01 00:03:00    0.498267
2020-01-01 00:06:00    0.504285
2020-01-01 00:09:00    0.516898
2020-01-01 00:12:00    0.514675
2020-01-01 00:15:00    0.554410
Freq: 3T, Name: A, dtype: float64

r['A'].agg([np.sum,np.mean,np.std])

	sum	mean	std
2020-01-01 00:00:00	92.942598	0.516348	0.288201
2020-01-01 00:03:00	89.688035	0.498267	0.293031
2020-01-01 00:06:00	90.771298	0.504285	0.280504
2020-01-01 00:09:00	93.041682	0.516898	0.287934
2020-01-01 00:12:00	92.641535	0.514675	0.278168
2020-01-01 00:15:00	55.440978	0.554410	0.260155

#使用函数/lambda表达式
r.agg({'A':np.sum,'B':lambda x:max(x)-min(x)})

	A	B
2020-01-01 00:00:00	92.942598	0.997325
2020-01-01 00:03:00	89.688035	0.996741
2020-01-01 00:06:00	90.771298	0.993118
2020-01-01 00:09:00	93.041682	0.988965
2020-01-01 00:12:00	92.641535	0.993741
2020-01-01 00:15:00	55.440978	0.975447

3.采样组的迭代

采用组的迭代和groupby迭代类似，对于每一个组都可以做相应的操作

#根据时间段进行分组
small = pd.Series(range(6),index=pd.to_datetime(['2020-01-01 00:00:00', '2020-01-01 00:30:00'
                                                 , '2020-01-01 00:31:00','2020-01-01 01:00:00'
                                                 ,'2020-01-01 03:00:00','2020-01-01 03:05:00']))
resampled = small.resample('H')
for name, group in resampled:
    print("Group: ", name)
    print("-" * 27)
    print(group, end="\n\n")

Group:  2020-01-01 00:00:00
---------------------------
2020-01-01 00:00:00    0
2020-01-01 00:30:00    1
2020-01-01 00:31:00    2
dtype: int64

Group:  2020-01-01 01:00:00
---------------------------
2020-01-01 01:00:00    3
dtype: int64

Group:  2020-01-01 02:00:00
---------------------------
Series([], dtype: int64)

Group:  2020-01-01 03:00:00
---------------------------
2020-01-01 03:00:00    4
2020-01-01 03:05:00    5
dtype: int64

窗口函数

rolling/expanding

s = pd.Series(np.random.rand(1000),index=pd.date_range('1/1/2020',periods=1000))
s.head()

2020-01-01    0.842824
2020-01-02    0.826125
2020-01-03    0.860557
2020-01-04    0.511902
2020-01-05    0.144901
Freq: D, dtype: float64

s.rolling(window=50)

Rolling [window=50,center=False,axis=0]

s.rolling(window=50).mean()

2020-01-01         NaN
2020-01-02         NaN
2020-01-03         NaN
2020-01-04         NaN
2020-01-05         NaN
                ...   
2022-09-22    0.498686
2022-09-23    0.511147
2022-09-24    0.514356
2022-09-25    0.509788
2022-09-26    0.505171
Freq: D, Length: 1000, dtype: float64

s.rolling(window=50,min_periods=3).mean().head()

2020-01-01         NaN
2020-01-02         NaN
2020-01-03    0.843169
2020-01-04    0.760352
2020-01-05    0.637262
Freq: D, dtype: float64

普通的expanding函数于rolling(window = len(s),min_periods = 1)，是对序列的累计计算

s.rolling(window=len(s),min_periods=1).sum().head()

2020-01-01    0.842824
2020-01-02    1.668950
2020-01-03    2.529507
2020-01-04    3.041409
2020-01-05    3.186310
Freq: D, dtype: float64

s.expanding().sum().head()

2020-01-01    0.842824
2020-01-02    1.668950
2020-01-03    2.529507
2020-01-04    3.041409
2020-01-05    3.186310
Freq: D, dtype: float64

参考：https://github.com/datawhalechina/joyful-pandas

Miss小姐姐

关注

1
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
Pandas学习之时序数据

1. to_datetime方法，用来建立时间点import numpy as npimport pandas as pdpd.to_datetime('2020/1/1')Timestamp('2020-01-01 00:00:00')# 使用列表将其转换为时间点索引pd.Series(range(2),index=pd.to_datetime(['2020/1/1','2020/1/2']))2020-01-01 02020-01-02 1dtype: int64
复制链接

扫一扫