补齐时间序列
Table of Contents
处理数据时我们总会遇到令人头疼的时间序列,一方面我们遇到看着是时间又不是时间格式的数据需要我们将其转化为时间格式。另一方面就是这次讨论的时间序列缺失的问题。
- 准备我们的数据:
import numpy as np
import pandas as pd
时间索引缺失如何补齐?
# 数据可以自行下载
file = r'timeseries.csv'
df = pd.read_csv(file, index_col=0)
df
A | B | C | D | |
---|---|---|---|---|
2019/10/11 0:00 | 28 | 14 | 25 | 15 |
2019/10/11 1:00 | 11 | 11 | 14 | 26 |
2019/10/11 3:00 | 15 | 12 | 14 | 17 |
2019/10/11 4:00 | 16 | 10 | 14 | 12 |
2019/10/11 6:00 | 24 | 11 | 13 | 26 |
2019/10/11 8:00 | 21 | 29 | 29 | 22 |
2019/10/11 9:00 | 18 | 18 | 17 | 25 |
明显时间索引有缺失的情况,我们可以使用reindex()
来处理:
- 确保索引是时间格式
- 得到完整的时间序列
- reindex补齐
为什么要确保时间格式?我们生成的序列要保证时间完整只能使用时间格式,除非自己手动实现复杂的算法。reindex要求两者数据一致,否则你会得到全NAN数据。
df.index
Index(['2019/10/11 0:00', '2019/10/11 1:00', '2019/10/11 3:00',
'2019/10/11 4:00', '2019/10/11 6:00', '2019/10/11 8:00',
'2019/10/11 9:00'],
dtype='object')
# 创建完整的时间格式
t_index = pd.date_range('2019-10-11 00:00:00', '2019-10-11 09:00:00', freq='H')
df.reindex(t_index)
A | B | C | D | |
---|---|---|---|---|
2019-10-11 00:00:00 | NaN | NaN | NaN | NaN |
2019-10-11 01:00:00 | NaN | NaN | NaN | NaN |
2019-10-11 02:00:00 | NaN | NaN | NaN | NaN |
2019-10-11 03:00:00 | NaN | NaN | NaN | NaN |
2019-10-11 04:00:00 | NaN | NaN | NaN | NaN |
2019-10-11 05:00:00 | NaN | NaN | NaN | NaN |
2019-10-11 06:00:00 | NaN | NaN | NaN | NaN |
2019-10-11 07:00:00 | NaN | NaN | NaN | NaN |
2019-10-11 08:00:00 | NaN | NaN | NaN | NaN |
2019-10-11 09:00:00 | NaN | NaN | NaN | NaN |
由于CSV文本文件,我们也看到数据类型不对的情况下,全NaN的结果。
- 重新获取数据,并实现时间格式解析
df = pd.read_csv(file, parse_dates=[0], index_col=0)
df
A | B | C | D | |
---|---|---|---|---|
2019-10-11 00:00:00 | 28 | 14 | 25 | 15 |
2019-10-11 01:00:00 | 11 | 11 | 14 | 26 |
2019-10-11 03:00:00 | 15 | 12 | 14 | 17 |
2019-10-11 04:00:00 | 16 | 10 | 14 | 12 |
2019-10-11 06:00:00 | 24 | 11 | 13 | 26 |
2019-10-11 08:00:00 | 21 | 29 | 29 | 22 |
2019-10-11 09:00:00 | 18 | 18 | 17 | 25 |
df.index # 解析成时间格式了
DatetimeIndex(['2019-10-11 00:00:00', '2019-10-11 01:00:00',
'2019-10-11 03:00:00', '2019-10-11 04:00:00',
'2019-10-11 06:00:00', '2019-10-11 08:00:00',
'2019-10-11 09:00:00'],
dtype='datetime64[ns]', freq=None)
- 使用完整的时间reindex
df.reindex(t_index)
A | B | C | D | |
---|---|---|---|---|
2019-10-11 00:00:00 | 28.0 | 14.0 | 25.0 | 15.0 |
2019-10-11 01:00:00 | 11.0 | 11.0 | 14.0 | 26.0 |
2019-10-11 02:00:00 | NaN | NaN | NaN | NaN |
2019-10-11 03:00:00 | 15.0 | 12.0 | 14.0 | 17.0 |
2019-10-11 04:00:00 | 16.0 | 10.0 | 14.0 | 12.0 |
2019-10-11 05:00:00 | NaN | NaN | NaN | NaN |
2019-10-11 06:00:00 | 24.0 | 11.0 | 13.0 | 26.0 |
2019-10-11 07:00:00 | NaN | NaN | NaN | NaN |
2019-10-11 08:00:00 | 21.0 | 29.0 | 29.0 | 22.0 |
2019-10-11 09:00:00 | 18.0 | 18.0 | 17.0 | 25.0 |
df.reindex(t_index, fill_value=0) # 使用0填充缺失值
A | B | C | D | |
---|---|---|---|---|
2019-10-11 00:00:00 | 28 | 14 | 25 | 15 |
2019-10-11 01:00:00 | 11 | 11 | 14 | 26 |
2019-10-11 02:00:00 | 0 | 0 | 0 | 0 |
2019-10-11 03:00:00 | 15 | 12 | 14 | 17 |
2019-10-11 04:00:00 | 16 | 10 | 14 | 12 |
2019-10-11 05:00:00 | 0 | 0 | 0 | 0 |
2019-10-11 06:00:00 | 24 | 11 | 13 | 26 |
2019-10-11 07:00:00 | 0 | 0 | 0 | 0 |
2019-10-11 08:00:00 | 21 | 29 | 29 | 22 |
2019-10-11 09:00:00 | 18 | 18 | 17 | 25 |
需要补齐的时间序列不是索引?
我们无法保证所有的情况都是使用时间格式作为索引的,那么在这样的情况下我们应该怎么办?
df = pd.read_csv(file, parse_dates=[0], names=list('TABCD'), skiprows=[0])
df
T | A | B | C | D | |
---|---|---|---|---|---|
0 | 2019-10-11 00:00:00 | 28 | 14 | 25 | 15 |
1 | 2019-10-11 01:00:00 | 11 | 11 | 14 | 26 |
2 | 2019-10-11 03:00:00 | 15 | 12 | 14 | 17 |
3 | 2019-10-11 04:00:00 | 16 | 10 | 14 | 12 |
4 | 2019-10-11 06:00:00 | 24 | 11 | 13 | 26 |
5 | 2019-10-11 08:00:00 | 21 | 29 | 29 | 22 |
6 | 2019-10-11 09:00:00 | 18 | 18 | 17 | 25 |
我们手动将时间列命名成"T"
我们可以使用merge()
实现:
# 创建一个待混合的数据帧
T_df = pd.DataFrame(t_index, columns=['T'])
T_df
T | |
---|---|
0 | 2019-10-11 00:00:00 |
1 | 2019-10-11 01:00:00 |
2 | 2019-10-11 02:00:00 |
3 | 2019-10-11 03:00:00 |
4 | 2019-10-11 04:00:00 |
5 | 2019-10-11 05:00:00 |
6 | 2019-10-11 06:00:00 |
7 | 2019-10-11 07:00:00 |
8 | 2019-10-11 08:00:00 |
9 | 2019-10-11 09:00:00 |
df.merge(T_df, how='right', sort=True)
T | A | B | C | D | |
---|---|---|---|---|---|
0 | 2019-10-11 00:00:00 | 28.0 | 14.0 | 25.0 | 15.0 |
1 | 2019-10-11 01:00:00 | 11.0 | 11.0 | 14.0 | 26.0 |
2 | 2019-10-11 02:00:00 | NaN | NaN | NaN | NaN |
3 | 2019-10-11 03:00:00 | 15.0 | 12.0 | 14.0 | 17.0 |
4 | 2019-10-11 04:00:00 | 16.0 | 10.0 | 14.0 | 12.0 |
5 | 2019-10-11 05:00:00 | NaN | NaN | NaN | NaN |
6 | 2019-10-11 06:00:00 | 24.0 | 11.0 | 13.0 | 26.0 |
7 | 2019-10-11 07:00:00 | NaN | NaN | NaN | NaN |
8 | 2019-10-11 08:00:00 | 21.0 | 29.0 | 29.0 | 22.0 |
9 | 2019-10-11 09:00:00 | 18.0 | 18.0 | 17.0 | 25.0 |
使用merge的注意点:
- merge(right), 中right:T_df必须是数据帧(DataFrame),而且列名保持一致
- sort=True: 保证时间序列的正确
- 这是实现vlookup的翻版,属于讨巧的办法,如果您有更优雅的方式实现请留言,不甚感激!!!