背景和任务
- 使用 Python 和 Scikit-Learn 实现了逻辑回归,并构建了一个分类器来预测澳大利亚明天是否会下雨。
- 使用逻辑回归训练二元分类模型。
- 这个项目中使用了澳大利亚的 Rain 数据集
数据集数据信息
- Date: 观察日期
- Location:气象站位置的通用名称
- MinTemp: 以摄氏度为单位的最低温度
- MaxTemp: 最高温度(摄氏度)
- Rainfall: 当天记录的降雨量,以毫米为单位
- Evaporation: 24小时至上午9点的所谓A类蒸发量(mm)
- Sunshine: 一天中明亮日照的小时数
- WindGustDir: 24小时至午夜最强阵风方向
- WindGustSpeed: 至午夜 24 小时内最强阵风的速度(km/h)
- WindDir9am: 上午 9 点的风向
- WindDir3pm: 下午三点的风向
- WindSpeed9am: 上午 9 点前 10 分钟的平均风速 (km/hr)
- WindSpeed3pm: 下午 3 点前 10 分钟内的平均风速 (km/hr)
- Humidity9am: 上午 9 点的湿度(百分比)
- Humidity3pm: 下午 3 点的湿度(百分比)
- Pressure9am: 上午 9 点,大气压力 (hpa) 降至平均海平面
- Pressure3pm: 下午 3 点,大气压力 (hpa) 降至平均海平面
- Cloud9am: 上午 9 点被云层遮挡的部分天空。这是用八分之一单位“oktas”来衡量的。它记录了多少
- Cloud3pm: 下午 3 点被云遮挡的天空部分(在“oktas”中:八分之一)
- Temp9am: 上午 9 点的温度(摄氏度)
- Temp3pm: 下午 3 点的温度(摄氏度)
- RainToday: 布尔值:如果 24 小时内到上午 9 点的降水量 (mm) 超过 1mm,则为 1,否则为 0
- RainTomorrow: 次日是否降雨
读取数据
import pandas as pd
import numpy as np
raw_df = pd.read_csv('./weather-dataset-rattle-package/weatherAUS.csv')
raw_df.head()
| Date | Location | MinTemp | MaxTemp | Rainfall | Evaporation | Sunshine | WindGustDir | WindGustSpeed | WindDir9am | ... | Humidity9am | Humidity3pm | Pressure9am | Pressure3pm | Cloud9am | Cloud3pm | Temp9am | Temp3pm | RainToday | RainTomorrow | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2008-12-01 | Albury | 13.4 | 22.9 | 0.6 | NaN | NaN | W | 44.0 | W | ... | 71.0 | 22.0 | 1007.7 | 1007.1 | 8.0 | NaN | 16.9 | 21.8 | No | No |
| 1 | 2008-12-02 | Albury | 7.4 | 25.1 | 0.0 | NaN | NaN | WNW | 44.0 | NNW | ... | 44.0 | 25.0 | 1010.6 | 1007.8 | NaN | NaN | 17.2 | 24.3 | No | No |
| 2 | 2008-12-03 | Albury | 12.9 | 25.7 | 0.0 | NaN | NaN | WSW | 46.0 | W | ... | 38.0 | 30.0 | 1007.6 | 1008.7 | NaN | 2.0 | 21.0 | 23.2 | No | No |
| 3 | 2008-12-04 | Albury | 9.2 | 28.0 | 0.0 | NaN | NaN | NE | 24.0 | SE | ... | 45.0 | 16.0 | 1017.6 | 1012.8 | NaN | NaN | 18.1 | 26.5 | No | No |
| 4 | 2008-12-05 | Albury | 17.5 | 32.3 | 1.0 | NaN | NaN | W | 41.0 | ENE | ... | 82.0 | 33.0 | 1010.8 | 1006.0 | 7.0 | 8.0 | 17.8 | 29.7 | No | No |
5 rows × 23 columns
raw_df.shape
(145460, 23)
raw_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 145460 entries, 0 to 145459
Data columns (total 23 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 145460 non-null object
1 Location 145460 non-null object
2 MinTemp 143975 non-null float64
3 MaxTemp 144199 non-null float64
4 Rainfall 142199 non-null float64
5 Evaporation 82670 non-null float64
6 Sunshine 75625 non-null float64
7 WindGustDir 135134 non-null object
8 WindGustSpeed 135197 non-null float64
9 WindDir9am 134894 non-null object
10 WindDir3pm 141232 non-null object
11 WindSpeed9am 143693 non-null float64
12 WindSpeed3pm 142398 non-null float64
13 Humidity9am 142806 non-null float64
14 Humidity3pm 140953 non-null float64
15 Pressure9am 130395 non-null float64
16 Pressure3pm 130432 non-null float64
17 Cloud9am 89572 non-null float64
18 Cloud3pm 86102 non-null float64
19 Temp9am 143693 non-null float64
20 Temp3pm 141851 non-null float64
21 RainToday 142199 non-null object
22 RainTomorrow 142193 non-null object
dtypes: float64(16), object(7)
memory usage: 25.5+ MB
raw_df.describe()
| MinTemp | MaxTemp | Rainfall | Evaporation | Sunshine | WindGustSpeed | WindSpeed9am | WindSpeed3pm | Humidity9am | Humidity3pm | Pressure9am | Pressure3pm | Cloud9am | Cloud3pm | Temp9am | Temp3pm | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 143975.000000 | 144199.000000 | 142199.000000 | 82670.000000 | 75625.000000 | 135197.000000 | 143693.000000 | 142398.000000 | 142806.000000 | 140953.000000 | 130395.00000 | 130432.000000 | 89572.000000 | 86102.000000 | 143693.000000 | 141851.00000 |
| mean | 12.194034 | 23.221348 | 2.360918 | 5.468232 | 7.611178 | 40.035230 | 14.043426 | 18.662657 | 68.880831 | 51.539116 | 1017.64994 | 1015.255889 | 4.447461 | 4.509930 | 16.990631 | 21.68339 |
| std | 6.398495 | 7.119049 | 8.478060 | 4.193704 | 3.785483 | 13.607062 | 8.915375 | 8.809800 | 19.029164 | 20.795902 | 7.10653 | 7.037414 | 2.887159 | 2.720357 | 6.488753 | 6.93665 |
| min | -8.500000 | -4.800000 | 0.000000 | 0.000000 | 0.000000 | 6.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 980.50000 | 977.100000 | 0.000000 | 0.000000 | -7.200000 | -5.40000 |
| 25% | 7.600000 | 17.900000 | 0.000000 | 2.600000 | 4.800000 | 31.000000 | 7.000000 | 13.000000 | 57.000000 | 37.000000 | 1012.90000 | 1010.400000 | 1.000000 | 2.000000 | 12.300000 | 16.60000 |
| 50% | 12.000000 | 22.600000 | 0.000000 | 4.800000 | 8.400000 | 39.000000 | 13.000000 | 19.000000 | 70.000000 | 52.000000 | 1017.60000 | 1015.200000 | 5.000000 | 5.000000 | 16.700000 | 21.10000 |
| 75% | 16.900000 | 28.200000 | 0.800000 | 7.400000 | 10.600000 | 48.000000 | 19.000000 | 24.000000 | 83.000000 | 66.000000 | 1022.40000 | 1020.000000 | 7.000000 | 7.000000 | 21.600000 | 26.40000 |
| max | 33.900000 | 48.100000 | 371.000000 | 145.000000 | 14.500000 | 135.000000 | 130.000000 | 87.000000 | 100.000000 | 100.000000 | 1041.00000 | 1039.600000 | 9.000000 | 9.000000 | 40.200000 | 46.70000 |
探索性数据分析
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set_style('darkgrid')
# matplotlib.rcParams['font.size'] = 14
# matplotlib.rcParams['figure.figsize'] = (14, 6)
# matplotlib.rcParams['figure.facecolor'] = '#00000000'
sns.boxplot(y = raw_df.Rainfall);

sns.distplot(a = raw_df.Rainfall, label = 'Rainfall'

本文档详细介绍了如何利用Python的Scikit-Learn库进行逻辑回归训练,以预测澳大利亚的天气,特别是明日是否会下雨。通过分析Rain数据集,包括各种气象参数,进行数据预处理、模型构建和性能评估,最终使用GridSearchCV优化模型性能。
最低0.47元/天 解锁文章
542

被折叠的 条评论
为什么被折叠?



