python-时间序列-pandas基础知识

最新推荐文章于 2024-01-17 16:10:40 发布

木心心以向荣

最新推荐文章于 2024-01-17 16:10:40 发布

阅读量348

点赞数

文章标签： pandas python 开发语言

本文链接：https://blog.csdn.net/weixin_57194935/article/details/127359670

版权

文章目录

前言
1.时间戳
- 时间戳函数
- 整列数据转换成时间
2.时间序列
3.时间索引

前言

时间序列在很多领域都是常用的一种数据结构形式，例如金融、经济、物理等。在不用的时间点观测的数据形成了时间序列

1.时间戳

许多文件中时间数据会被存储成字符串的格式，可以借助函数讲字符串转换成时间数据格式。

时间戳函数

pandas.Timestamp(ts_input,freq=None, tz=None, unit=None, year=None, month=None, day=None, hour=None, minute=None, second=None, microsecond=None, nanosecond=None, tzinfo=None, *, fold=None)

参数	含义
ts_input	要转换为Timestamp的值
unit	可以为[‘D’, ‘h’ ,‘m’, ‘ms’ ,‘s’, ‘ns’]

整列数据转换成时间

pandas.to_datetime(arg, errors=‘raise’, dayfirst=False, yearfirst=False, utc=None, format=None, exact=True, unit=None, infer_datetime_format=False, origin=‘unix’, cache=True）

参数	含义
dayfirst/ yearfirst	表示传入数据的前两位数为天/年
format	自定义输出格式，如“%Y-%m-%d”

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

water_data=pd.read_excel('一季度用水（小）.xlsx')
print(water_data.head(5))

         水表名           水表号      采集时间      上次读数      当前读数
0  XXX4舍热泵热水  1.836719e+09  2019/1/10 12:30:00  25483.09  25483.65
1  XXX4舍热泵热水  1.836719e+09  2019/1/10 12:45:00  25483.65  25484.23
2  XXX4舍热泵热水  1.836719e+09  2019/1/10 13:00:00  25484.23  25484.39
3  XXX4舍热泵热水  1.836719e+09  2019/1/10 13:15:00  25484.39  25484.46
4  XXX4舍热泵热水  1.836719e+09  2019/1/10 13:30:00  25484.46  25485.30

water_data['用水量']=water_data['当前读数']-water_data['上次读数']
water_data['采集时间'] = pd.to_datetime(water_data['采集时间'])
print(water_data.head(5))
          水表名           水表号     采集时间      上次读数      当前读数  用水量
0  XXX4舍热泵热水  1.836719e+09 2019-01-10 12:30:00  25483.09  25483.65  0.56
1  XXX4舍热泵热水  1.836719e+09 2019-01-10 12:45:00  25483.65  25484.23  0.58
2  XXX4舍热泵热水  1.836719e+09 2019-01-10 13:00:00  25484.23  25484.39  0.16
3  XXX4舍热泵热水  1.836719e+09 2019-01-10 13:15:00  25484.39  25484.46  0.07
4  XXX4舍热泵热水  1.836719e+09 2019-01-10 13:30:00  25484.46  25485.30  0.84

2.时间序列

除了将字符串转化成时间格式，Pands还支持生成时间序列。生成时间序列需要调用date_range函数
pandas.date_range(start=None, end=None, periods=None, freq=‘D’, tz=None, normalize=False, name=None, closed=None, **kwargs)

参数	含义
periods	固定时期，取值为整数或None
freq	日期偏移量，取值为string或DateOffset，默认为’D’
normalize	若参数为True表示将start、end参数值正则化到午夜时间戳
name	生成时间索引对象的名称，取值为string或None
closed	可以理解成在closed=None情况下返回的结果中，若closed=‘left’表示在返回的结果基础上，再取左开右闭的结果，若closed='right’表示在返回的结果基础上，再取做闭右开的结果

# 生成一个时间序列，从2021-06-27开始，到2021-09-27结束，时间间隔3天
pd.date_range("2021-06-27", "2021-09-27",freq="3D")

# 生成一个时间序列，从12：00开始，到23：59结束，时间间隔30分钟
pd.date_range("12:00", "23:59", freq="30min")

DatetimeIndex(['2022-10-17 12:00:00', '2022-10-17 12:30:00',
               '2022-10-17 13:00:00', '2022-10-17 13:30:00',
               '2022-10-17 14:00:00', '2022-10-17 14:30:00',
               '2022-10-17 15:00:00', '2022-10-17 15:30:00',
               '2022-10-17 16:00:00', '2022-10-17 16:30:00',
               '2022-10-17 17:00:00', '2022-10-17 17:30:00',
               '2022-10-17 18:00:00', '2022-10-17 18:30:00',
               '2022-10-17 19:00:00', '2022-10-17 19:30:00',
               '2022-10-17 20:00:00', '2022-10-17 20:30:00',
               '2022-10-17 21:00:00', '2022-10-17 21:30:00',
               '2022-10-17 22:00:00', '2022-10-17 22:30:00',
               '2022-10-17 23:00:00', '2022-10-17 23:30:00'],
              dtype='datetime64[ns]', freq='30T')

3.时间索引

Pandas支持将时间序列数据设置为索引，常见的索引有DatetimeIndex和PeriodIndex两种。二者区别在日常使用的过程中相对较小，其中DatetimeIndex是用来指代一系列时间点的一种数据结构，而PeriodIndex则是用来指代一系列时间段的数据结构。
pandas.DatetimeIndex(data=None, freq=<no_default>, tz=None, normalize=False, closed=None,
ambiguous=‘raise’, dayfirst=False, yearfirst=False, dtype=None, copy=False, name=None)

参数	含义
data	接收array。表示DatetimeIndex的值。无默认
freq	接收string。表示时间的间隔频率。无默认
start	接收string。表示生成规则时间数据的起始点。无默认
periods	表示需要生成的周期数目。无默认
end	接收string。表示生成规则时间数据的终结点。无默认
tz	接收timezone。表示数据的时区。默认为None
name	接收int，string。默认为空。指定DatetimeIndex的名字

#生成一个同样索引为字符串的Series
str_index=['2017/8/1','2018/8/1','2018/8/3','2018/8/4/','2018/8/7']
df1=pd.DataFrame([1,2,3,4,5],index=str_index)
	        0
2017/8/1	1
2018/8/1	2
2018/8/3	3
2018/8/4/	4
2018/8/7	5

#生成同样内容的Series索引格式是DatatimeIndex
time_index=pd.DatetimeIndex(['2017/8/1','2018/8/1','2018/8/3','2018/8/4/','2018/8/7'])
print(time_index)
# 输出
DatetimeIndex(['2017-08-01', '2018-08-01', '2018-08-03', '2018-08-04',
               '2018-08-07'],
              dtype='datetime64[ns]', freq=None)
df2=pd.DataFrame([1,2,3,4,5],index=time_index)              
	        0
2017-08-01	1
2018-08-01	2
2018-08-03	3
2018-08-04	4
2018-08-07	5

# 取指定年、月、日的数据
print(df[df.index.year==2018])
print('*'*30)
print(df[df.index.month==7])
print('*'*30)
print(df[df.index.day==3])
            0
2018-07-01  3
2018-07-03  4
2018-08-04  5
2018-08-07  6
******************************
            0
2018-07-01  3
2018-07-03  4
******************************
            0
2017-08-03  2
2018-07-03  4

# 取指定时间段的数据
df[(df.index.day < 5) & (df.index.day >=1)]
			0
2017-08-03	2
2018-07-01	3
2018-07-03	4
2018-08-04	5

面对大量的时间数据时可以借用resample函数对时间序列进行采样。
DataFrame.resample(rule)
rule：表示目标转换的偏移字符串或对象，一般是时间参数，比如“M”，“A”，“Q”，“BM”，“BA”， “ BQ”和“W”；
采样后的数据是一个resample.DatetimeIndexResampler的数据，无法直接查看

#时间抽样
df=pd.read_excel('Excel数据.xlsx',sheet_name=0)
df.index=df['销售日期']

#降低采集频率为每月一次
df.resample('M',).count()
			订单号 销售日期 销售人员 地区	城市 家电品牌 单价 数量（台）	销售额
销售日期									
2009-01-31	10	10	10	10	10	10	10	10	10
2009-02-28	6	6	6	6	6	6	6	6	6
2009-03-31	8	8	8	8	8	8	8	8	8
2009-04-30	8	8	8	8	8	8	8	8	8
2009-05-31	11	11	11	11	11	11	11	11	11
2009-06-30	11	11	11	11	11	11	11	11	11
2009-07-31	9	9	9	9	9	9	9	9	9
2009-08-31	9	9	9	9	9	9	9	9	9
2009-09-30	10	10	10	10	10	10	10	10	10
2009-10-31	9	9	9	9	9	9	9	9	9
2009-11-30	4	4	4	4	4	4	4	4	4
2009-12-31	4	4	4	4	4	4	4	4	4