时间序列数据集可以包含季节性成分。这是一个随时间重复的周期,如每月或每年。这种重复的循环可能会模糊我们在预测时希望建模的信号,从而可能为我们的预测模型提供一个强大的信号。
可以看出有很强的季节性成分
第一个方法:差分
用上一年的数据剪去这一年的数据
from pandas import Series
from matplotlib import pyplot
series = Series.from_csv('daily-minimum-temperatures.csv', header=0)
X = series.values
diff = list()
days_in_year = 365
for i in range(days_in_year, len(X)):
value = X[i] - X[i - days_in_year]
diff.append(value)
pyplot.plot(diff)
pyplot.show()
最后结果如下
我们的数据集中有两个闰年(1984年和1988年)。它们没有被显式地处理;这意味着1984年3月以后的观测偏移量错了一天,1988年3月以后的观测偏移量错了两天。
我们可以不用一天一天的差分,而是剪去上个月的均值
from pandas import Series
from matplotlib import pyplot
series = Series.from_csv('daily-minimum-temperatures.csv', header=0)
resample = series.resample('M')
monthly_mean = resample.mean()
X = series.values
diff = list()
months_in_year = 12
for i in range(months_in_year, len(monthly_mean)):
value = monthly_mean[i] - monthly_mean[i - months_in_year]
diff.append(value)
pyplot.plot(diff)
pyplot.show()
上面这个图是每个月的表现,可以看出有明显趋势
月份和上年月份值相减之后变成下列形式
接下来,我们可以使用去年同期的月平均最低气温来调整日最低气温数据集。同样,我们只是跳过第一年的数据,但是使用月度数据而不是每日数据进行修正可能是更稳定的方法。
from pandas import Series
from matplotlib import pyplot
series = Series.from_csv('daily-minimum-temperatures.csv', header=0)
X = series.values
diff = list()
days_in_year = 365
for i in range(days_in_year, len(X)):
month_str = str(series.index[i].year-1)+'-'+str(series.index[i].month)
month_mean_last_year = series[month_str].mean()
value = X[i] - month_mean_last_year
diff.append(value)
pyplot.plot(diff)
pyplot.show()
最后调整如下
更灵活的方法是取前一年同一日期任意一周的平均值,这可能再次是更好的方法。此外,多个尺度的温度数据可能存在季节性,可直接或间接加以修正,例如:天的水平。多日水平,如一周或几周。多周水平,如一个月。多月水平,如季度或季节。
第二个方法:机器学习
from pandas import Series
from matplotlib import pyplot
from numpy import polyfit
series = Series.from_csv('daily-minimum-temperatures.csv', header=0)
# fit polynomial: x^2*b1 + x*b2 + ... + bn
X = [i%365 for i in range(0, len(series))]
y = series.values
degree = 4
coef = polyfit(X, y, degree)
print('Coefficients: %s' % coef)
# create curve
curve = list()
for i in range(len(X)):
value = coef[-1]
for d in range(degree):
value += X[i]**(degree-d) * coef[d]
curve.append(value)
# plot curve over original data
pyplot.plot(series.values)
pyplot.plot(curve, color='red', linewidth=3)
pyplot.show()
效果图如下:
之后我再讲两者相减
from pandas import Series
from matplotlib import pyplot
from numpy import polyfit
series = Series.from_csv('daily-minimum-temperatures.csv', header=0)
# fit polynomial: x^2*b1 + x*b2 + ... + bn
X = [i%365 for i in range(0, len(series))]
y = series.values
degree = 4
coef = polyfit(X, y, degree)
print('Coefficients: %s' % coef)
# create curve
curve = list()
for i in range(len(X)):
value = coef[-1]
for d in range(degree):
value += X[i]**(degree-d) * coef[d]
curve.append(value)
# create seasonally adjusted
values = series.values
diff = list()
for i in range(len(values)):
value = values[i] - curve[i]
diff.append(value)
pyplot.plot(diff)
pyplot.show()
最终效果如下
为什么要去趋势呢?
我自己的想法是可以把这些趋势也当做一个feature用于预测
去趋势之后的数据maybe更有代表性。。
https://machinelearningmastery.com/time-series-seasonality-with-python/