Pandas学习之时序数据

1. to_datetime方法,用来建立时间点
import numpy as np
import pandas as pd
pd.to_datetime('2020/1/1')
Timestamp('2020-01-01 00:00:00')
# 使用列表将其转换为时间点索引
pd.Series(range(2),index=pd.to_datetime(['2020/1/1','2020/1/2']))
2020-01-01    0
2020-01-02    1
dtype: int64
#对于DataFrame 而言,如果时间顺序排好,可以利用to_datetime对数据进行自动转换
df = pd.DataFrame({'year':[2020,2020],'month':[1,1],'day':[1,2]})
df
yearmonthday
0202011
1202012
pd.to_datetime(df)
0   2020-01-01
1   2020-01-02
dtype: datetime64[ns]
2. date_range方法

start/end/periods(时间点个数)/freq(间隔方法)是这个方法的重要参数

pd.date_range(start='2020/1/1',end='2020/1/10',periods=3)
DatetimeIndex(['2020-01-01 00:00:00', '2020-01-05 12:00:00',
               '2020-01-10 00:00:00'],
              dtype='datetime64[ns]', freq=None)
pd.date_range(start='2020/1/1',end='2020/1/10',freq='D')
DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04',
               '2020-01-05', '2020-01-06', '2020-01-07', '2020-01-08',
               '2020-01-09', '2020-01-10'],
              dtype='datetime64[ns]', freq='D')
pd.date_range(start='2020/1/1',periods=3,freq='D')
DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03'], dtype='datetime64[ns]', freq='D')
3. DateOffset 对象

DateOffset的可选参数包括years/months/weeks/days/hours/minutes/seconds

pd.Timestamp('2020-01-01')
Timestamp('2020-01-01 00:00:00')
pd.Timestamp('2020-01-01')+pd.DateOffset(minutes = 20)-pd.DateOffset(days = 1)
Timestamp('2019-12-31 00:20:00')
pd.Timestamp('2020-01-01') + pd.offsets.Week(2)
Timestamp('2020-01-15 00:00:00')
序列的offset操作
利用apply函数
pd.date_range('20200101',periods=3,freq = 'Y') #Y指的是月末
DatetimeIndex(['2020-12-31', '2021-12-31', '2022-12-31'], dtype='datetime64[ns]', freq='A-DEC')
pd.Series(pd.offsets.BYearBegin(3).apply(i) for i in pd.date_range('20200101',periods=3,freq = 'Y') )
0   2023-01-02
1   2024-01-01
2   2025-01-01
dtype: datetime64[ns]
pd.date_range('20200101',periods=3,freq='Y')+pd.offsets.BYearBegin(3)
DatetimeIndex(['2023-01-02', '2024-01-01', '2025-01-01'], dtype='datetime64[ns]', freq='A-DEC')

时序的索引及属性

1. 索引切片
rng = pd.date_range('2020','2021',freq='W')
ts = pd.Series(np.random.rand(len(rng)),index=rng)
ts.head()
2020-01-05    0.009639
2020-01-12    0.061814
2020-01-19    0.470897
2020-01-26    0.803914
2020-02-02    0.104896
Freq: W-SUN, dtype: float64
ts['2020-01-05']
0.009639339138300618
2. 子集索引
ts['2020-7']
2020-07-05    0.073073
2020-07-12    0.593621
2020-07-19    0.028066
2020-07-26    0.537048
Freq: W-SUN, dtype: float64
ts['2011-1':'20200726'].head()
2020-01-05    0.009639
2020-01-12    0.061814
2020-01-19    0.470897
2020-01-26    0.803914
2020-02-02    0.104896
Freq: W-SUN, dtype: float64
3. 时间点的属性

采用dt对象获取关于时间的信息

pd.Series(ts.index).head()
0   2020-01-05
1   2020-01-12
2   2020-01-19
3   2020-01-26
4   2020-02-02
dtype: datetime64[ns]
pd.Series(ts.index).dt.month.head()
0    1
1    1
2    1
3    1
4    2
dtype: int64
pd.Series(ts.index).dt.day.head()
0     5
1    12
2    19
3    26
4     2
dtype: int64
利用strftime重新修改时间格式
pd.Series(ts.index).dt.strftime('%Y*%m*%d').head()
0    2020*01*05
1    2020*01*12
2    2020*01*19
3    2020*01*26
4    2020*02*02
dtype: object
对于datetime对象可以直接通过属性获取信息
pd.date_range('2020','2021',freq='W')
DatetimeIndex(['2020-01-05', '2020-01-12', '2020-01-19', '2020-01-26',
               '2020-02-02', '2020-02-09', '2020-02-16', '2020-02-23',
               '2020-03-01', '2020-03-08', '2020-03-15', '2020-03-22',
               '2020-03-29', '2020-04-05', '2020-04-12', '2020-04-19',
               '2020-04-26', '2020-05-03', '2020-05-10', '2020-05-17',
               '2020-05-24', '2020-05-31', '2020-06-07', '2020-06-14',
               '2020-06-21', '2020-06-28', '2020-07-05', '2020-07-12',
               '2020-07-19', '2020-07-26', '2020-08-02', '2020-08-09',
               '2020-08-16', '2020-08-23', '2020-08-30', '2020-09-06',
               '2020-09-13', '2020-09-20', '2020-09-27', '2020-10-04',
               '2020-10-11', '2020-10-18', '2020-10-25', '2020-11-01',
               '2020-11-08', '2020-11-15', '2020-11-22', '2020-11-29',
               '2020-12-06', '2020-12-13', '2020-12-20', '2020-12-27'],
              dtype='datetime64[ns]', freq='W-SUN')
pd.date_range('2020','2021',freq='W').month
Int64Index([ 1,  1,  1,  1,  2,  2,  2,  2,  3,  3,  3,  3,  3,  4,  4,  4,  4,
             5,  5,  5,  5,  5,  6,  6,  6,  6,  7,  7,  7,  7,  8,  8,  8,  8,
             8,  9,  9,  9,  9, 10, 10, 10, 10, 11, 11, 11, 11, 11, 12, 12, 12,
            12],
           dtype='int64')

重采样

1. 利用resample函数进行重采样
#产生一个DataFrame,其中以时间为索引,每隔一秒产生一个索引
df_r = pd.DataFrame(np.random.rand(1000,3),index=pd.date_range('1/1/2020',freq='S',periods=1000),columns=['A','B','C'])
df_r.head()
ABC
2020-01-01 00:00:000.0304950.3465110.329326
2020-01-01 00:00:010.8348270.3028380.550707
2020-01-01 00:00:020.0417000.2006620.000873
2020-01-01 00:00:030.4838350.2114020.447749
2020-01-01 00:00:040.5998440.8508220.650199
#重采样,r是一个resample对象
r = df_r.resample('3min')
r
<pandas.core.resample.DatetimeIndexResampler object at 0x0000023ABC3C19E8>
#以后3分钟为间隔,求和
r.sum()
ABC
2020-01-01 00:00:0092.94259889.03987090.154015
2020-01-01 00:03:0089.68803586.09890689.821036
2020-01-01 00:06:0090.77129890.45259990.166902
2020-01-01 00:09:0093.04168293.16344690.303107
2020-01-01 00:12:0092.64153595.14761589.392215
2020-01-01 00:15:0055.44097850.84560350.553468
#以3min为间隔,求平均值
r.mean()
ABC
2020-01-01 00:00:000.5163480.4946660.500856
2020-01-01 00:03:000.4982670.4783270.499006
2020-01-01 00:06:000.5042850.5025140.500927
2020-01-01 00:09:000.5168980.5175750.501684
2020-01-01 00:12:000.5146750.5285980.496623
2020-01-01 00:15:000.5544100.5084560.505535
df_r2 = pd.DataFrame(np.random.randn(200,3),index=pd.date_range('1/1/2020',freq= 'D',periods=200),columns=['A','B','C'])
df_r2.head()
ABC
2020-01-010.917372-0.305394-1.163468
2020-01-021.027062-0.722735-0.390128
2020-01-030.9022750.306910-0.482234
2020-01-04-0.3628330.5836780.716035
2020-01-05-0.467158-1.345731-2.380988
r = df_r2.resample('CBMS')
r.sum()
ABC
2020-01-011.0688203.800052-5.188159
2020-02-032.2926363.275062-7.076462
2020-03-028.565083-1.366455-0.495671
2020-04-016.7305682.981884-4.715256
2020-05-016.5931044.132095-13.792500
2020-06-0111.134183-1.903034-19.861671
2020-07-01-0.863740-1.153881-4.164100
2. 采样聚合
r = df_r.resample('3T')
r['A'].mean()
2020-01-01 00:00:00    0.516348
2020-01-01 00:03:00    0.498267
2020-01-01 00:06:00    0.504285
2020-01-01 00:09:00    0.516898
2020-01-01 00:12:00    0.514675
2020-01-01 00:15:00    0.554410
Freq: 3T, Name: A, dtype: float64
r['A'].agg([np.sum,np.mean,np.std])
summeanstd
2020-01-01 00:00:0092.9425980.5163480.288201
2020-01-01 00:03:0089.6880350.4982670.293031
2020-01-01 00:06:0090.7712980.5042850.280504
2020-01-01 00:09:0093.0416820.5168980.287934
2020-01-01 00:12:0092.6415350.5146750.278168
2020-01-01 00:15:0055.4409780.5544100.260155
#使用函数/lambda表达式
r.agg({'A':np.sum,'B':lambda x:max(x)-min(x)})
AB
2020-01-01 00:00:0092.9425980.997325
2020-01-01 00:03:0089.6880350.996741
2020-01-01 00:06:0090.7712980.993118
2020-01-01 00:09:0093.0416820.988965
2020-01-01 00:12:0092.6415350.993741
2020-01-01 00:15:0055.4409780.975447
3.采样组的迭代
采用组的迭代和groupby迭代类似,对于每一个组都可以做相应的操作
#根据时间段进行分组
small = pd.Series(range(6),index=pd.to_datetime(['2020-01-01 00:00:00', '2020-01-01 00:30:00'
                                                 , '2020-01-01 00:31:00','2020-01-01 01:00:00'
                                                 ,'2020-01-01 03:00:00','2020-01-01 03:05:00']))
resampled = small.resample('H')
for name, group in resampled:
    print("Group: ", name)
    print("-" * 27)
    print(group, end="\n\n")

Group:  2020-01-01 00:00:00
---------------------------
2020-01-01 00:00:00    0
2020-01-01 00:30:00    1
2020-01-01 00:31:00    2
dtype: int64

Group:  2020-01-01 01:00:00
---------------------------
2020-01-01 01:00:00    3
dtype: int64

Group:  2020-01-01 02:00:00
---------------------------
Series([], dtype: int64)

Group:  2020-01-01 03:00:00
---------------------------
2020-01-01 03:00:00    4
2020-01-01 03:05:00    5
dtype: int64

窗口函数

rolling/expanding
s = pd.Series(np.random.rand(1000),index=pd.date_range('1/1/2020',periods=1000))
s.head()
2020-01-01    0.842824
2020-01-02    0.826125
2020-01-03    0.860557
2020-01-04    0.511902
2020-01-05    0.144901
Freq: D, dtype: float64
s.rolling(window=50)
Rolling [window=50,center=False,axis=0]
s.rolling(window=50).mean()
2020-01-01         NaN
2020-01-02         NaN
2020-01-03         NaN
2020-01-04         NaN
2020-01-05         NaN
                ...   
2022-09-22    0.498686
2022-09-23    0.511147
2022-09-24    0.514356
2022-09-25    0.509788
2022-09-26    0.505171
Freq: D, Length: 1000, dtype: float64
s.rolling(window=50,min_periods=3).mean().head()
2020-01-01         NaN
2020-01-02         NaN
2020-01-03    0.843169
2020-01-04    0.760352
2020-01-05    0.637262
Freq: D, dtype: float64

普通的expanding函数于rolling(window = len(s),min_periods = 1),是对序列的累计计算

s.rolling(window=len(s),min_periods=1).sum().head()
2020-01-01    0.842824
2020-01-02    1.668950
2020-01-03    2.529507
2020-01-04    3.041409
2020-01-05    3.186310
Freq: D, dtype: float64
s.expanding().sum().head()
2020-01-01    0.842824
2020-01-02    1.668950
2020-01-03    2.529507
2020-01-04    3.041409
2020-01-05    3.186310
Freq: D, dtype: float64

参考:https://github.com/datawhalechina/joyful-pandas

  • 1
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值