背景和任务
- 使用 Python 和 Scikit-Learn 实现了逻辑回归,并构建了一个分类器来预测澳大利亚明天是否会下雨。
- 使用逻辑回归训练二元分类模型。
- 这个项目中使用了澳大利亚的 Rain 数据集
数据集数据信息
- Date: 观察日期
- Location:气象站位置的通用名称
- MinTemp: 以摄氏度为单位的最低温度
- MaxTemp: 最高温度(摄氏度)
- Rainfall: 当天记录的降雨量,以毫米为单位
- Evaporation: 24小时至上午9点的所谓A类蒸发量(mm)
- Sunshine: 一天中明亮日照的小时数
- WindGustDir: 24小时至午夜最强阵风方向
- WindGustSpeed: 至午夜 24 小时内最强阵风的速度(km/h)
- WindDir9am: 上午 9 点的风向
- WindDir3pm: 下午三点的风向
- WindSpeed9am: 上午 9 点前 10 分钟的平均风速 (km/hr)
- WindSpeed3pm: 下午 3 点前 10 分钟内的平均风速 (km/hr)
- Humidity9am: 上午 9 点的湿度(百分比)
- Humidity3pm: 下午 3 点的湿度(百分比)
- Pressure9am: 上午 9 点,大气压力 (hpa) 降至平均海平面
- Pressure3pm: 下午 3 点,大气压力 (hpa) 降至平均海平面
- Cloud9am: 上午 9 点被云层遮挡的部分天空。这是用八分之一单位“oktas”来衡量的。它记录了多少
- Cloud3pm: 下午 3 点被云遮挡的天空部分(在“oktas”中:八分之一)
- Temp9am: 上午 9 点的温度(摄氏度)
- Temp3pm: 下午 3 点的温度(摄氏度)
- RainToday: 布尔值:如果 24 小时内到上午 9 点的降水量 (mm) 超过 1mm,则为 1,否则为 0
- RainTomorrow: 次日是否降雨
读取数据
import pandas as pd
import numpy as np
raw_df = pd.read_csv('./weather-dataset-rattle-package/weatherAUS.csv')
raw_df.head()
|
Date |
Location |
MinTemp |
MaxTemp |
Rainfall |
Evaporation |
Sunshine |
WindGustDir |
WindGustSpeed |
WindDir9am |
... |
Humidity9am |
Humidity3pm |
Pressure9am |
Pressure3pm |
Cloud9am |
Cloud3pm |
Temp9am |
Temp3pm |
RainToday |
RainTomorrow |
0 |
2008-12-01 |
Albury |
13.4 |
22.9 |
0.6 |
NaN |
NaN |
W |
44.0 |
W |
... |
71.0 |
22.0 |
1007.7 |
1007.1 |
8.0 |
NaN |
16.9 |
21.8 |
No |
No |
1 |
2008-12-02 |
Albury |
7.4 |
25.1 |
0.0 |
NaN |
NaN |
WNW |
44.0 |
NNW |
... |
44.0 |
25.0 |
1010.6 |
1007.8 |
NaN |
NaN |
17.2 |
24.3 |
No |
No |
2 |
2008-12-03 |
Albury |
12.9 |
25.7 |
0.0 |
NaN |
NaN |
WSW |
46.0 |
W |
... |
38.0 |
30.0 |
1007.6 |
1008.7 |
NaN |
2.0 |
21.0 |
23.2 |
No |
No |
3 |
2008-12-04 |
Albury |
9.2 |
28.0 |
0.0 |
NaN |
NaN |
NE |
24.0 |
SE |
... |
45.0 |
16.0 |
1017.6 |
1012.8 |
NaN |
NaN |
18.1 |
26.5 |
No |
No |
4 |
2008-12-05 |
Albury |
17.5 |
32.3 |
1.0 |
NaN |
NaN |
W |
41.0 |
ENE |
... |
82.0 |
33.0 |
1010.8 |
1006.0 |
7.0 |
8.0 |
17.8 |
29.7 |
No |
No |
5 rows × 23 columns
raw_df.shape
(145460, 23)
raw_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 145460 entries, 0 to 145459
Data columns (total 23 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 145460 non-null object
1 Location 145460 non-null object
2 MinTemp 143975 non-null float64
3 MaxTemp 144199 non-null float64
4 Rainfall 142199 non-null float64
5 Evaporation 82670 non-null float64
6 Sunshine 75625 non-null float64
7 WindGustDir 135134 non-null object
8 WindGustSpeed 135197 non-null float64
9 WindDir9am 134894 non-null object
10 WindDir3pm 141232 non-null object
11 WindSpeed9am 143693 non-null float64
12 WindSpeed3pm 142398 non-null float64
13 Humidity9am 142806 non-null float64
14 Humidity3pm 140953 non-null float64
15 Pressure9am 130395 non-null float64
16 Pressure3pm 130432 non-null float64
17 Cloud9am 89572 non-null float64
18 Cloud3pm 86102 non-null float64
19 Temp9am 143693 non-null float64
20 Temp3pm 141851 non-null float64
21 RainToday 142199 non-null object
22 RainTomorrow 142193 non-null object
dtypes: float64(16), object(7)
memory usage: 25.5+ MB
raw_df.describe()
|
MinTemp |
MaxTemp |
Rainfall |
Evaporation |
Sunshine |
WindGustSpeed |
WindSpeed9am |
WindSpeed3pm |
Humidity9am |
Humidity3pm |
Pressure9am |
Pressure3pm |
Cloud9am |
Cloud3pm |
Temp9am |
Temp3pm |
count |
143975.000000 |
144199.000000 |
142199.000000 |
82670.000000 |
75625.000000 |
135197.000000 |
143693.000000 |
142398.000000 |
142806.000000 |
140953.000000 |
130395.00000 |
130432.000000 |
89572.000000 |
86102.000000 |
143693.000000 |
141851.00000 |
mean |
12.194034 |
23.221348 |
2.360918 |
5.468232 |
7.611178 |
40.035230 |
14.043426 |
18.662657 |
68.880831 |
51.539116 |
1017.64994 |
1015.255889 |
4.447461 |
4.509930 |
16.990631 |
21.68339 |
std |
6.398495 |
7.119049 |
8.478060 |
4.193704 |
3.785483 |
13.607062 |
8.915375 |
8.809800 |
19.029164 |
20.795902 |
7.10653 |
7.037414 |
2.887159 |
2.720357 |
6.488753 |
6.93665 |
min |
-8.500000 |
-4.800000 |
0.000000 |
0.000000 |
0.000000 |
6.000000 |
0.000000 |
0.000000 |
0.000000 |
0.000000 |
980.50000 |
977.100000 |
0.000000 |
0.000000 |
-7.200000 |
-5.40000 |
25% |
7.600000 |
17.900000 |
0.000000 |
2.600000 |
4.800000 |
31.000000 |
7.000000 |
13.000000 |
57.000000 |
37.000000 |
1012.90000 |
1010.400000 |
1.000000 |
2.000000 |
12.300000 |
16.60000 |
50% |
12.000000 |
22.600000 |
0.000000 |
4.800000 |
8.400000 |
39.000000 |
13.000000 |
19.000000 |
70.000000 |
52.000000 |
1017.60000 |
1015.200000 |
5.000000 |
5.000000 |
16.700000 |
21.10000 |
75% |
16.900000 |
28.200000 |
0.800000 |
7.400000 |
10.600000 |
48.000000 |
19.000000 |
24.000000 |
83.000000 |
66.000000 |
1022.40000 |
1020.000000 |
7.000000 |
7.000000 |
21.600000 |
26.40000 |
max |
33.900000 |
48.100000 |
371.000000 |
145.000000 |
14.500000 |
135.000000 |
130.000000 |
87.000000 |
100.000000 |
100.000000 |
1041.00000 |
1039.600000 |
9.000000 |
9.000000 |
40.200000 |
46.70000 |
探索性数据分析
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set_style('darkgrid')
sns.boxplot(y = raw_df.Rainfall);
sns.distplot(a = raw_df.Rainfall, label = 'Rainfall'