pandas对时间序列中缺失值进行线性插值_Johngo学长 (johngo689.com)
一、问题背景
以二氧化碳浓度数据集为例,该时间序列在某些时间点上会有缺失值。
# 读取数据集
df = pd.read_csv('二氧化碳浓度.csv')
# 将字段Datetime数据类型转换为日期类型
df['datetime'] = pd.to_datetime(df['datetime'], format="%Y-%m-%d %H:%M:%S")
df
value datetime 0 370 2022-04-13 00:11:00 1 375 2022-04-13 00:42:00 2 383 2022-04-13 01:11:00 3 391 2022-04-13 01:41:00 4 397 2022-04-13 02:11:00 … … … 583 369 2022-04-23 22:12:00 584 376 2022-04-23 22:41:00 585 379 2022-04-23 23:11:00 586 379 2022-04-23 23:12:00 587 378 2022-04-23 23:41:00
588 rows × 2 columns
该数据集是没个半小时一条数据,若想要填充成10分钟一条数据,就需要进行线性插值。
二、pandas解决方案
下面我们需要生成一个辅助dataframe,命名为helper, 包含所有日期。
# pd.date_range时间序列
# 间隔十分钟
helper = pd.DataFrame({'datetime': pd.date_range(start=df['datetime'].min(), end=df['datetime'].max(),freq='600s')})
# helper = pd.DataFrame(helper).set_index('datetime')#将时间列变为索引
helper:
datetime 0 2022-04-13 00:11:00 1 2022-04-13 00:21:00 2 2022-04-13 00:31:00 3 2022-04-13 00:41:00 4 2022-04-13 00:51:00 … … 1577 2022-04-23 23:01:00 1578 2022-04-23 23:11:00 1579 2022-04-23 23:21:00 1580 2022-04-23 23:31:00 1581 2022-04-23 23:41:00
1582 rows × 1 columns
然后再将helper和d join起来
d = pd.merge(df, helper, on='datetime', how='outer').sort_values('datetime')
最后一步就是插值了,直接用interpolate方法,method选择linear,线性插值。
d['value'] = d['value'].interpolate(method='linear')
value datetime 0 370.0 2022-04-13 00:11:00 588 NaN 2022-04-13 00:21:00 589 NaN 2022-04-13 00:31:00 590 NaN 2022-04-13 00:41:00 1 375.0 2022-04-13 00:42:00 … … … 585 379.0 2022-04-23 23:11:00 586 379.0 2022-04-23 23:12:00 1658 NaN 2022-04-23 23:21:00 1659 NaN 2022-04-23 23:31:00 587 378.0 2022-04-23 23:41:00
1660 rows × 2 columns
插值选择方法不止有线性,还可以是
Original: pandas对时间序列中缺失值进行线性插值 - 王陸 - 博客园
Author: 王陸
Title: pandas对时间序列中缺失值进行线性插值
原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/8417/
转载文章受原作者版权保护。转载请注明原作者出处!
======================================================================
pandas插值--非规则时间序列
pandas插值--非规则时间序列 - Heywhale.com
问题描述:现有一观测资料,时间间隔非均匀,如何将其插值到均匀的时间序列?
这个问题可以参考:工作台 - Heywhale.com ,具体步骤如下
- 生成目标时间的list,和当前index取交集
- 将交集reindex为新的索引,填充缺测,调用插值函数
- 再次reindex到目标时间list
当然,这里也可以考虑利用xarray的插值函数
原始数据
In [1]:
import pandas as pd import xarray as xr import numpy as np
Warning: ecCodes 2.21.0 or higher is recommended. You are running version 2.14.1
In [2]:
df = pd.read_csv("/home/mw/input/pcp6426/interp2.csv", parse_dates=[0], index_col=0) df
Out[2]:
data | |
---|---|
time | |
2019-12-10 17:17:37.140624896 | 0.255970 |
2019-12-10 17:17:37.339844352 | 0.364656 |
2019-12-10 17:17:37.539062016 | 0.206504 |
2019-12-10 17:17:37.738280704 | 0.491149 |
2019-12-10 17:17:37.937499904 | 0.037144 |
... | ... |
2019-12-10 17:17:56.140624896 | 0.400654 |
2019-12-10 17:17:56.339844352 | 0.382707 |
2019-12-10 17:17:56.539062016 | 0.094809 |
2019-12-10 17:17:56.738280704 | 0.049449 |
2019-12-10 17:17:56.937499904 | 0.793193 |
100 rows × 1 columns
转到xarray
In [3]:
ds = df.to_xarray() ds
Out[3]:
xarray.Dataset
- Dimensions:
- time: 100
- Coordinates:
-
time
(time)
datetime64[ns]
2019-12-10T17:17:37.140624896 .....
-
- Data variables:
-
data
(time)
float64
0.256 0.3647 ... 0.04945 0.7932
array([0.25596959, 0.36465587, 0.20650436, 0.49114901, 0.03714372, 0.28040043, 0.70309464, 0.68119741, 0.05901183, 0.67645174, 0.28849891, 0.50132343, 0.62840763, 0.95099851, 0.02626051, 0.33692432, 0.46553236, 0.50861759, 0.74550601, 0.55250929, 0.52566986, 0.52261307, 0.37321184, 0.79995134, 0.23094678, 0.87915976, 0.70379002, 0.1751345 , 0.08580627, 0.48962096, 0.85569395, 0.20317885, 0.80113267, 0.64351059, 0.06970162, 0.3934217 , 0.44517141, 0.89346621, 0.55945748, 0.72797598, 0.02390159, 0.20475986, 0.51505429, 0.82730583, 0.77451266, 0.65776446, 0.6155744 , 0.71989889, 0.01017897, 0.38116152, 0.9602366 , 0.86313496, 0.76415434, 0.85792074, 0.84286672, 0.56732516, 0.58180219, 0.23357688, 0.77524271, 0.56301411, 0.48070764, 0.19323841, 0.37067283, 0.14380672, 0.8256961 , 0.90791718, 0.31419928, 0.42323499, 0.58264084, 0.73124201, 0.80804131, 0.47969966, 0.2311884 , 0.45588249, 0.72280821, 0.64822314, 0.84239576, 0.14129363, 0.59488152, 0.62924007, 0.93597915, 0.96099593, 0.63139743, 0.86937852, 0.59717561, 0.06693927, 0.93089973, 0.45931833, 0.51508115, 0.40343549, 0.22152076, 0.84937486, 0.30642194, 0.8782651 , 0.09930092, 0.40065394, 0.38270678, 0.09480903, 0.04944922, 0.79319296])
-
- Attributes: (0)
插值
In [4]:
idx = pd.date_range('2019-12-10 17:17:37','2019-12-10 17:17:57',freq='0.2S') ds = ds.interp(time=idx)
In [5]:
ds
Out[5]:
xarray.Dataset
- Dimensions:
- time: 101
- Coordinates:
-
time
(time)
datetime64[ns]
2019-12-10T17:17:37 ... 2019-12-...
-
- Data variables:
-
data
(time)
float64
nan 0.2884 0.3169 ... 0.2799 nan
array([ nan, 0.2883623 , 0.31690053, 0.29357285, 0.35049546, 0.11199205, 0.40889926, 0.69618861, 0.49088025, 0.25029839, 0.55474113, 0.35610155, 0.54140369, 0.7270833 , 0.66450917, 0.12372359, 0.37699039, 0.47854235, 0.58107818, 0.68571448, 0.54408908, 0.52471756, 0.47749996, 0.503745 , 0.62367034, 0.43430757, 0.82452572, 0.54415751, 0.14781035, 0.21091047, 0.60446731, 0.65241178, 0.38373663, 0.75291846, 0.46574115, 0.17126078, 0.40954361, 0.58053824, 0.79129799, 0.61166552, 0.50709006, 0.08024552, 0.29845618, 0.61056731, 0.81095019, 0.7378858 , 0.64462072, 0.64707616, 0.5028063 , 0.12511157, 0.562832 , 0.92998591, 0.83324683, 0.79283606, 0.85325691, 0.75642237, 0.57183528, 0.47665228, 0.39926427, 0.70949302, 0.53719249, 0.39115053, 0.2468164 , 0.30127792, 0.35506012, 0.85149093, 0.72295248, 0.34712363, 0.47199483, 0.62867837, 0.75533589, 0.70575096, 0.40465934, 0.29991892, 0.53857767, 0.69940899, 0.70871497, 0.63069137, 0.28003931, 0.60552601, 0.72547187, 0.94377278, 0.86147056, 0.70419224, 0.78504843, 0.43402571, 0.3244329 , 0.78850129, 0.47637533, 0.48049266, 0.34746164, 0.40864556, 0.68542511, 0.48134011, 0.63693735, 0.19202507, 0.39530499, 0.29577334, 0.08093415, 0.27986546, nan])
-
- Attributes: (0)
转回pandas
In [6]:
df = ds.to_dataframe() df
Out[6]:
data | |
---|---|
time | |
2019-12-10 17:17:37.000 | NaN |
2019-12-10 17:17:37.200 | 0.288362 |
2019-12-10 17:17:37.400 | 0.316901 |
2019-12-10 17:17:37.600 | 0.293573 |
2019-12-10 17:17:37.800 | 0.350495 |
... | ... |
2019-12-10 17:17:56.200 | 0.395305 |
2019-12-10 17:17:56.400 | 0.295773 |
2019-12-10 17:17:56.600 | 0.080934 |
2019-12-10 17:17:56.800 | 0.279865 |
2019-12-10 17:17:57.000 | NaN |
101 rows × 1 columns