加载csv文件
df = pd.read_csv("/kaggle/input/delhi-weather-data/testset.csv")
pandas中某一列里面元素的分布情况
其中_conds
代表属性列.图中的列向量表示的在这个属性中的特征,纵坐标表示特征出现的次数
plt.figure(figsize=(15,10))
df[' _conds'].value_counts().head(15).plot(kind='bar')
plt.title('15 most common weathers in Delhi')
plt.show()
得到前15个统计分布情况
第二种方式
对于连续分布的的特征向量,通过第一方式无法表示,于是需要用到如下方法。
plt.figure(figsize=(15, 10))
sns.distplot(df[' _tempm'],bins=[i for i in range(0,61,5)], kde=False)
plt.title("Distribution of Temperatures")
plt.grid()
plt.show()
图中横坐标表示的温度,一个连续的变量。纵坐标表示某个温度范围的出现次数。
对于时间变量的处理
在pdDateframe
中包含时间变量
df.head()
_thunder _tornado _vism _wdird _wdire _wgustm _windchillm _wspdm
0 19961101-11:00 Smoke 9.0 0 0 NaN 27.0 NaN 1010.0 0 0 30.0 0 0 5.0 280.0 West NaN NaN 7.4
1 19961101-12:00 Smoke 10.0 0 0 NaN 32.0 NaN -9999.0 0 0 28.0 0 0 NaN 0.0 North NaN NaN NaN
2 19961101-13:00 Smoke 11.0 0 0 NaN 44.0 NaN -9999.0 0 0 24.0 0 0 NaN 0.0 North NaN NaN NaN
3 19961101-14:00 Smoke 10.0 0 0 NaN 41.0 NaN 1010.0 0 0 24.0 0 0 2.0 0.0 North NaN NaN NaN
4 19961101-16:00 Smoke 11.0 0 0 NaN 47.0 NaN 1011.0 0 0 23.0 0 0 1.2 0.0 North NaN NaN 0.0
如上图所示,第一列便为时间变量。
如何将其变成直观容易查看的形式呢?
为了将其变为时间类型的变量,那么就需要用到:
df['datetime_utc'] = pd.to_datetime(df['datetime_utc'])
df['datetime_utc']
0 1996-11-01 11:00:00
1 1996-11-01 12:00:00
2 1996-11-01 13:00:00
3 1996-11-01 14:00:00
4 1996-11-01 16:00:00
...
100985 2017-04-24 06:00:00
100986 2017-04-24 09:00:00
100987 2017-04-24 12:00:00
100988 2017-04-24 15:00:00
100989 2017-04-24 18:00:00
Name: datetime_utc, Length: 100990, dtype: datetime64[ns]
这样是不是显示的很美观,符合时间计数的规则。
- 有时候需要提取出时间变量中年和月的信息。为此用到一下信息。
# a function to extract year part from the whole date
def get_year(x):
return x[0:4]
# a function to extract month part from the whole date
def get_month(x):
return x[5:7]
将到得到新的特征作为其属性列。
# making two new features year and month
df['year'] = df['datetime_utc'].apply(lambda x: get_year(str(x)))
df['month'] = df['datetime_utc'].apply(lambda x: get_month(str(x)))
- 如何得到一张交叉表格,更美观的显示在不同时间的温度。由于温度每天都有变化,特征量多,为此将每个月的温度求平均。
temp_year = pd.crosstab(df['year'], df['month'], values=df[' _tempm'], aggfunc='mean')
temp_year
month 01 02 03 04 05 06 07 08 09 10 11 12
year
1996 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 19.488145 14.026110
1997 13.063475 15.830687 21.184446 26.304116 29.909869 31.745214 31.087132 29.467082 29.710198 24.143141 19.982586 13.737461
1998 13.155684 16.522225 19.893251 28.698630 33.994089 32.450593 29.930593 29.189565 29.122685 25.284581 20.312433 14.697139
1999 12.542130 16.717543 22.452373 31.398329 34.263086 34.273255 31.992749 32.892425 30.710109 27.691203 22.657713 15.260374
2000 13.815132 15.125176 NaN 35.000000 26.000000 32.449895 30.249515 31.887647 30.132663 28.199244 22.119181 16.165175
2001 12.657776 18.286759 23.560144 29.663181 32.987991 30.980764 30.911467 31.112365 31.173711 27.830927 21.616841 15.958141
2002 14.260843 17.462396 24.526204 32.056928 35.370900 34.215493 35.349460 30.679024 28.412801 27.146409 21.366164 16.896483
2003 11.925121 17.956166 23.532102 31.715728 34.945450 35.449781 30.287187 29.962944 28.945838 26.448644 20.367266 15.206563
2004 13.226186 18.730414 26.631785 31.856787 34.312103 33.068457 32.652663 29.600181 29.948052 24.257797 19.474359 15.530214
2005 13.471463 16.841903 23.682927 28.606047 32.276423 33.983122 30.176230 31.415323 28.786611 25.071230 19.516949 13.682441
2006 13.987500 21.123727 22.526440 29.582979 33.137097 32.070606 30.977536 30.555584 29.118231 26.059227 20.211207 15.464964
2007 13.394309 17.625687 21.741404 30.457983 31.766536 32.881445 31.134146 30.606893 29.179916 24.457164 19.460446 14.555584
2008 12.937759 15.565611 25.061475 28.560000 30.258333 30.246862 30.721992 29.591837 28.508403 26.453441 19.598291 16.210341
2009 14.837915 18.292793 23.450820 29.571019 32.356944 34.567568 32.032922 30.914286 28.886037 25.073171 19.283898 14.995781
2010 12.861224 18.134092 25.935482 32.769231 34.521186 34.273099 31.038069 29.585593 27.668718 26.013333 20.392330 14.008584
2011 12.320175 17.025000 22.810185 27.775246 32.952586 32.014131 30.631579 29.899563 29.079812 25.607759 20.815166 14.486364
2012 12.776256 16.130000 22.683962 28.222222 33.425339 36.037276 31.875358 29.182256 29.207921 24.701920 18.783784 15.157609
2013 12.259297 16.923023 23.030457 29.000000 33.676657 32.351032 30.701923 29.466947 29.843658 26.035176 18.895371 15.297752
2014 13.552879 15.807601 21.594262 28.067227 31.412955 34.770833 32.233983 31.038938 29.712777 26.489089 20.104167 14.780488
2015 12.763916 18.795746 21.570368 27.993141 33.344130 32.737500 30.366199 30.235507 30.665254 26.800630 20.754167 14.975709
2016 15.007752 19.554459 25.696391 32.527021 34.677354 34.898909 30.878223 30.985955 31.490694 28.951992 23.042437 17.769184
2017 15.791917 18.414062 23.553459 30.775120 NaN NaN NaN NaN NaN NaN NaN NaN
得到如上所示的表,这样很容易看出温度的变化。
之所以前面1996年前面几个月的值为nan,是因为那几个月没有温度信息,直到11月才有,同理,2017一样。
如何美观的用图片显示温度变化的趋势呢?
plt.figure(figsize=(15, 10))
sns.heatmap(temp_year, cmap='coolwarm', annot=True)
plt.title("Average Tempearture in Delhi from 1996 to 2017")
plt.show()
这会生成一张冷热图,意味着温度高的颜色深。显示结果如下。
是不是非常直接、美观和优雅【手动狗头】!
时间序列的温度预测
前面的热身都是为了查看数据中的特征,以及处理数据。
但是都是为了后面时间序列的温度预测做准备。
# taking only temperature feature as values and datetime feature as index in the dataframe for time series forecasting of temperature
data = pd.DataFrame(list(df[' _tempm']), index=df['datetime_utc'], columns=['temp'])
data
显示结果如下:
接着对时间数据进行下采样,对每个月求平均。
# resampling data with date frequency for time series forecasting
data = data.resample('D').mean()
填补nan值
data.fillna(data['temp'].mean(), inplace=True)
通过波形,观察温度的变化趋势
plt.figure(figsize=(25, 7))
plt.plot(data, linewidth=.5)
plt.grid()
plt.title("Time Series (Years vs Temp.)")
plt.show()