1 数据集介绍
1.1 训练集
训练集中共有116369个样本,每个样本有23个特征,特征具体介绍如下:
列名 | 解释 |
---|---|
Date: | 日期; |
Location: | 地点; |
MinTemp: | 最小温度; |
MaxTemp: | 最大温度; |
Rainfall: | 降雨量; |
Evaporation: | 蒸发量; |
Sunshine: | 一天中阳光明媚的小时数; |
WindGustDir: | 最强阵风方向; |
WindGustSpeed: | 最强阵风风速; |
WindDir9am: | 上午9点风向; |
WindDir3pm: | 下午3点风向; |
WindSpeed9am: | 上午9点风速; |
WindSpeed3pm: | 下午3点风速; |
Humidity9am: | 上午9点湿度; |
Humidity3pm: | 下午3点湿度; |
Pressure9am: | 上午9点压强; |
Pressure3pm: | 下午3点压强; |
Cloud9am: | 上午9点云层遮盖了天空的比例; |
Cloud3pm: | 下午3点云层遮盖了天空的比例; |
Temp9am: | 上午9点温度; |
Temp3pm: | 下午3点温度; |
RainToday: | 今天是否下雨; |
RainTomorr: | 明天是否下雨。 |
1.2 测试集
测试集中共有29093个样本,每个样本有22个特征,没有训练集中的RainTomorrow这一项特征。
列名 | 解释 |
---|---|
Date: | 日期; |
Location: | 地点; |
MinTemp: | 最小温度; |
MaxTemp: | 最大温度; |
Rainfall: | 降雨量; |
Evaporation: | 蒸发量; |
Sunshine: | 一天中阳光明媚的小时数; |
WindGustDir: | 最强阵风方向; |
WindGustSpeed: | 最强阵风风速; |
WindDir9am: | 上午9点风向; |
WindDir3pm: | 下午3点风向; |
WindSpeed9am: | 上午9点风速; |
WindSpeed3pm: | 下午3点风速; |
Humidity9am: | 上午9点湿度; |
Humidity3pm: | 下午3点湿度; |
Pressure9am: | 上午9点压强; |
Pressure3pm: | 下午3点压强; |
Cloud9am: | 上午9点云层遮盖了天空的比例; |
Cloud3pm: | 下午3点云层遮盖了天空的比例; |
Temp9am: | 上午9点温度; |
Temp3pm: | 下午3点温度; |
RainToday: | 今天是否下雨; |
2 导入数据进行数据分析
2.1 浏览数据
#%%
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
# 导入数据
weather = pd.read_csv(r"./work/train.csv",index_col=False)
# 观察前五行数据
print(weather.head(5))
Date Location MinTemp MaxTemp Rainfall Evaporation Sunshine \
0 2012-03-07 Dartmoor 10.1 24.6 1.2 2.6 11.3
1 2014-12-21 Newcastle 17.0 28.7 0.0 NaN NaN
2 2011-01-14 Albany 17.9 20.8 0.1 9.6 12.1
3 2011-10-19 Ballarat 8.9 25.5 0.0 NaN NaN
4 2013-11-04 Uluru 21.3 38.3 0.0 NaN NaN
WindGustDir WindGustSpeed WindDir9am ... Humidity9am \
0 ESE 54.0 SE ... 86.0
1 NaN NaN NE ... 63.0
2 NaN NaN NE ... 61.0
3 NNE 54.0 N ... 56.0
4 ENE 57.0 E ... 15.0
Humidity3pm Pressure9am Pressure3pm Cloud9am Cloud3pm Temp9am \
0 41.0 1028.6 1025.7 NaN NaN 13.9
1 58.0 NaN NaN 1.0 1.0 24.0
2 67.0 1005.1 1007.6 5.0 4.0 19.8
3 44.0 1027.1 1022.9 0.0 NaN 16.7
4 9.0 1018.4 1013.9 NaN NaN 28.8
Temp3pm RainToday RainTomorrow
0 23.0 Yes No
1 28.0 No No
2 20.0 No No
3 25.0 No No
4 36.9 No No
[5 rows x 23 columns]
通过简单的观察数据,我们发现有很多需要我们要作的事情,例如Nan值、字符型变量的处理,这些都是特征工程中的难点。
2.2 探索数据
2.2.1 查看数据类型
#%%
# 查看数据类型
weather.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 116368 entries, 0 to 116367
Data columns (total 23 columns):
Date 116368 non-null object
Location 116368 non-null object
MinTemp 115160 non-null float64
MaxTemp 115354 non-null float64
Rainfall 113762 non-null float64
Evaporation 66053 non-null float64
Sunshine 60402 non-null float64
WindGustDir 108111 non-null object
WindGustSpeed 108158 non-null float64
WindDir9am 107925 non-null object
WindDir3pm 112986 non-null object
WindSpeed9am 114940 non-null float64
WindSpeed3pm 113920 non-null float64
Humidity9am 114227 non-null float64
Humidity3pm 112736 non-null float64
Pressure9am 104345 non-null float64
Pressure3pm 104377 non-null float64
Cloud9am 71571 non-null float64
Cloud3pm 68773 non-null float64
Temp9am 114947 non-null float64
Temp3pm 113466 non-null float64
RainToday 113762 non-null object
RainTomorrow 113776 non-null object
dtypes: float64(16), object(7)
memory usage: 20.4+ MB