Kaggle实战:Rain in Australia 数据集建模预测


数据集来源 https://www.kaggle.com/jsphyg/weather-dataset-rattle-package

数据集和 notebook 地址

数据详情

包含了某段时间内,每一天的天气观测值,目的是为了预测明天是否会下雨

Date:The date of observation
Location:The common name of the location of the weather station
MinTemp:The minimum temperature in degrees celsius
MaxTemp:The maximum temperature in degrees celsius
Rainfall:The amount of rainfall recorded for the day in mm
Evaporation:The so-called Class A pan evaporation (mm) in the 24 hours to 9am
Sunshine:The number of hours of bright sunshine in the day.
WindGustDir:The direction of the strongest wind gust in the 24 hours to midnight
WindGustSpeed:The speed (km/h) of the strongest wind gust in the 24 hours to midnight
WindDir9am:Direction of the wind at 9am
WindDir3pm:Direction of the wind at 3pm
WindSpeed9am:Wind speed (km/hr) averaged over 10 minutes prior to 9am
WindSpeed3pm:Wind speed (km/hr) averaged over 10 minutes prior to 3pm
Humidity9am:Humidity (percent) at 9am
Humidity3pm:Humidity (percent) at 3pm
Pressure9am:Atmospheric pressure (hpa) reduced to mean sea level at 9am
Pressure3pm:Atmospheric pressure (hpa) reduced to mean sea level at 3pm
Cloud9am:Fraction of sky obscured by cloud at 9am. This is measured in "oktas", which are a unit of eigths. It records how many eigths of the sky are obscured by cloud. A 0 measure indicates completely clear sky whilst an 8 indicates that it is completely overcast.
Cloud3pm:Fraction of sky obscured by cloud (in "oktas": eighths) at 3pm. See Cload9am for a description of the values
Temp9am:Temperature (degrees C) at 9am
Temp3pm:Temperature (degrees C) at 3pm
RainTodayBoolean: 1 if precipitation (mm) in the 24 hours to 9am exceeds 1mm, otherwise 0
RISK_MM:The amount of next day rain in mm. Used to create response variable RainTomorrow. A kind of measure of the "risk".
RainTomorrow:The target variable. Did it rain tomorrow?

特征详情

Date:观察特征的那一天
Location:观察的城市
MinTemp:当天最低温度(摄氏度)
MaxTemp:当天最高温度(摄氏度)温度都是 string
Rainfall:当天的降雨量(单位是毫米mm)
Evaporation:一个凹地上面水的蒸发量(单位是毫米mm),24小时内到早上9点
Sunshine:一天中出太阳的小时数
WindGustDir:最强劲的那股风的风向,24小时内到午夜
WindGustSpeed:最强劲的那股风的风速(km/h),24小时内到午夜
WindDir9am:上午9点的风向
WindDir3pm:下午3点的风向
WindSpeed9am:上午9点之前的十分钟里的平均风速,即 8:50~9:00的平均风速,单位是(km/hr)
WindSpeed3pm:下午3点之前的十分钟里的平均风速,即 14:50~15:00的平均风速,单位是(km/hr)
Humidity9am:上午9点的湿度
Humidity3pm:下午3点的湿度
Pressure9am:上午9点的大气压强(hpa)
Pressure3pm:下午3点的大气压强
Cloud9am:上午9点天空中云的密度,取值是[0, 8],以1位一个单位,0的话表示天空中几乎没云,8的话表示天空中几乎被云覆盖了
Cloud3pm:下午3点天空中云的密度
Temp9am:上午9点的温度(单位是摄氏度)
Temp3pm:下午3点的温度(单位是摄氏度)
RainTodayBoolean: 今天是否下雨
RISK_MM:明天下雨的风险值(应当是数据提供者创建的一个特征)
来自数据提供者的提醒:Note: You should exclude the variable Risk-MM when training a binary classification model. Not excluding it will leak the answers to your model and reduce its predictability. 就是建模的时候要删掉这个特征
RainTomorrow:标签
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
dataset = pd.read_csv('./weatherAUS.csv')
dataset.shape
(142193, 24)
dataset.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 142193 entries, 0 to 142192
Data columns (total 24 columns):
Date             142193 non-null object
Location         142193 non-null object
MinTemp          141556 non-null float64
MaxTemp          141871 non-null float64
Rainfall         140787 non-null float64
Evaporation      81350 non-null float64
Sunshine         74377 non-null float64
WindGustDir      132863 non-null object
WindGustSpeed    132923 non-null float64
WindDir9am       132180 non-null object
WindDir3pm       138415 non-null object
WindSpeed9am     140845 non-null float64
WindSpeed3pm     139563 non-null float64
Humidity9am      140419 non-null float64
Humidity3pm      138583 non-null float64
Pressure9am      128179 non-null float64
Pressure3pm      128212 non-null float64
Cloud9am         88536 non-null float64
Cloud3pm         85099 non-null float64
Temp9am          141289 non-null float64
Temp3pm          139467 non-null float64
RainToday        140787 non-null object
RISK_MM          142193 non-null float64
RainTomorrow     142193 non-null object
dtypes: float64(17), object(7)
memory usage: 26.0+ MB
dataset.isnull().sum() / len(dataset)
Date             0.000000
Location         0.000000
MinTemp          0.004480
MaxTemp          0.002265
Rainfall         0.009888
Evaporation      0.427890
Sunshine         0.476929
WindGustDir      0.065615
WindGustSpeed    0.065193
WindDir9am       0.070418
WindDir3pm       0.026570
WindSpeed9am     0.009480
WindSpeed3pm     0.018496
Humidity9am      0.012476
Humidity3pm      0.025388
Pressure9am      0.098556
Pressure3pm      0.098324
Cloud9am         0.377353
Cloud3pm         0.401525
Temp9am          0.006358
Temp3pm          0.019171
RainToday        0.009888
RISK_MM          0.000000
RainTomorrow     0.000000
dtype: float64

单变量分析

离散值
catgorical = [cat for cat in dataset.columns if dataset[cat].dtype == 'O']
catgorical
['Date',
 'Location',
 'WindGustDir',
 'WindDir9am',
 'WindDir3pm',
 'RainToday',
 'RainTomorrow']
for i in catgorical:
    print(i)
    print(len(dataset[i].unique()))
    print()
Date
3436

Location
49

WindGustDir
17

WindDir9am
17

WindDir3pm
17

RainToday
3

RainTomorrow
2

date有非常多的唯一值,说明这个特征是high-cardinality

分别提取出里面的年月日

dataset['Date'] = pd.to_datetime(dataset['Date'])
dataset['Date']
0        2008-12-01
1        2008-12-02
2        2008-12-03
3        2008-12-04
4        2008-12-05
            ...    
142188   2017-06-20
142189   2017-06-21
142190   2017-06-22
142191   2017-06-23
142192   2017-06-24
Name: Date, Length: 142193, dtype: datetime64[ns]
dataset['Date'].dt.year.unique()
array([2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2007],
      dtype=int64)
year = dataset['Date'].dt.year
month = dataset['Date'].dt.month
day = dataset['Date'].dt.day
dataset.drop(labels=['Date'], axis=1, inplace=True)
dataset['year'] = year
dataset['month'] = month
dataset['day'] = day
dataset.shape
(142193, 26)
dataset.loc[:, catgorical].isnull().sum() / len(dataset)
d:\anaconda_file\lib\site-packages\pandas\core\indexing.py:1404: FutureWarning: 
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike
  return self._getitem_tuple(key)





Date            1.000000
Location        0.000000
WindGustDir     0.065615
WindDir9am      0.070418
WindDir3pm      0.026570
RainToday       0.009888
RainTomorrow    0.000000
dtype: float64
# 对于这些离散值来说,缺失的值的占比都比较少,所以都使用众数来填充即可

dataset_ = dataset.copy()

fill_list = ['WindGustDir', 'WindDir9am', 'WindDir3pm', 'RainToday']
fill_dict = {key: dataset_[key].mode().values[0] for key in fill_list}
fill_dict


# 这种方法是不行的
# for j in fill_list:
#     dataset_[j].fillna(dataset_[j].mode(), inplace=True)


dataset_.fillna(value=fill_dict, inplace=True)
cat = [_ for _ in dataset_.columns if dataset[_].dtype=='O']
dataset_.loc[:, cat].isnull().sum()
Location        0
WindGustDir     0
WindDir9am      0
WindDir3pm      0
RainToday       0
RainTomorrow    0
dtype: int64
连续值
numerical = [_ for _ in dataset_.columns if dataset[_].dtype != 'O']
# 删去之前创建的 year, month, day
# 因为数据集的提供者已经说了 RISK_MM 会影响模型,所以连着删去
numerical = numerical[:-4]
numerical
['MinTemp',
 'MaxTemp',
 'Rainfall',
 'Evaporation',
 'Sunshine',
 'WindGustSpeed',
 'WindSpeed9am',
 'WindSpeed3pm',
 'Humidity9am',
 'Humidity3pm',
 'Pressure9am',
 'Pressure3pm',
 'Cloud9am',
 'Cloud3pm',
 'Temp9am',
 'Temp3pm']
numerical_features = dataset_.loc[:, numerical]

numerical_features.isnull().sum() / numerical_features.shape[0]
MinTemp          0.004480
MaxTemp          0.002265
Rainfall         0.009888
Evaporation      0.427890
Sunshine         0.476929
WindGustSpeed    0.065193
WindSpeed9am     0.009480
WindSpeed3pm     0.018496
Humidity9am      0.012476
Humidity3pm      0.025388
Pressure9am      0.098556
Pressure3pm      0.098324
Cloud9am         0.377353
Cloud3pm         0.401525
Temp9am          0.006358
Temp3pm          0.019171
dtype: float64
numerical_features.describe()
MinTempMaxTempRainfallEvaporationSunshineWindGustSpeedWindSpeed9amWindSpeed3pmHumidity9amHumidity3pmPressure9amPressure3pmCloud9amCloud3pmTemp9amTemp3pm
count141556.000000141871.000000140787.00000081350.00000074377.000000132923.000000140845.000000139563.000000140419.000000138583.000000128179.000000128212.00000088536.00000085099.000000141289.000000139467.000000
mean12.18640023.2267842.3499745.4698247.62485339.98429214.00198818.63757668.84381051.4826061017.6537581015.2582044.4371894.50316716.98750921.687235
std6.4032837.1176188.4651734.1885373.78152513.5888018.8933378.80334519.05129320.7977727.1054767.0366772.8870162.7206336.4928386.937594
min-8.500000-4.8000000.0000000.0000000.0000006.0000000.0000000.0000000.0000000.000000980.500000977.1000000.0000000.000000-7.200000-5.400000
25%7.60000017.9000000.0000002.6000004.90000031.0000007.00000013.00000057.00000037.0000001012.9000001010.4000001.0000002.00000012.30000016.600000
50%12.00000022.6000000.0000004.8000008.50000039.00000013.00000019.00000070.00000052.0000001017.6000001015.2000005.0000005.00000016.70000021.100000
75%16.80000028.2000000.8000007.40000010.60000048.00000019.00000024.00000083.00000066.0000001022.4000001020.0000007.0000007.00000021.60000026.400000
max33.90000048.100000371.000000145.00000014.500000135.000000130.00000087.000000100.000000100.0000001041.0000001039.6000009.0000009.00000040.20000046.700000

对比一下上四分位数和最大值,Rainfall、Evaporation、WindGustSpeed、WindSpeed9am、WindSpeed3pm

figure, axes = plt.subplots(2, 3, figsize=(15, 15))

sns.boxplot(
    x='RainTomorrow', y='Rainfall',
    data=dataset_, ax=axes[0, 0], palette="Set3"
)

sns.boxplot(
    x='RainTomorrow', y='Evaporation',
    data=dataset_, ax=axes[0, 1], palette="Set3"
)

sns.boxplot(
    x='RainTomorrow', y='WindGustSpeed',
    data=dataset_, ax=axes[1, 0], palette="Set3"
)

sns.boxplot(
    x='RainTomorrow', y='WindSpeed9am',
    data=dataset_, ax=axes[1, 1], palette="Set3"
)

sns.boxplot(
    x='RainTomorrow', y='WindSpeed3pm',
    data=dataset_, ax=axes[1, 2], palette="Set3"
)

plt.show()

在这里插入图片描述

在箱线图中看到,对于这几个连续值,里面的异常值还是挺多的

# 查看连续值的分布情况

figure, axes = plt.subplots(2, 3, figsize=(30, 15))

sns.distplot(
    a=dataset_['Rainfall'].dropna(),
    ax=axes[0, 0]
)

sns.distplot(
    a=dataset_['Evaporation'].dropna(),
    ax=axes[0, 1]
)

sns.distplot(
    a=dataset_['WindGustSpeed'].dropna(),
    ax=axes[1, 0]
)

sns.distplot(
    a=dataset_['WindSpeed9am'].dropna(),
    ax=axes[1, 1]
)

sns.distplot(
    a=dataset_['WindSpeed3pm'].dropna(),
    ax=axes[1, 2]
)

plt.show()

在这里插入图片描述

对于这几个连续型特征,都出现比较明显的偏斜,使用 interquantile range 去寻找离群值

_list = ['Rainfall','Evaporation','WindGustSpeed','WindSpeed9am','WindSpeed3pm']

def find_outliers(df, feature):
    IQR = df[feature].quantile(0.75) - df[feature].quantile(0.25)
    Lower_fence = df[feature].quantile(0.25) - (IQR * 3)
    Upper_fence = df[feature].quantile(0.75) + (IQR * 3)
    print('{feature} outliers are values < {lowerboundary} or > {upperboundary}'\
          .format(feature=feature, lowerboundary=Lower_fence, upperboundary=Upper_fence))
    out_of_middan = (df[feature] < Lower_fence).sum()
    out_of_top = (df[feature] > Upper_fence).sum()
    print(f'the number of upper outlier {out_of_top}')
    print(f'the number of lower outlier {out_of_middan}')
    
    
for feature in _list:
    find_outliers(dataset_, feature)
    print()
Rainfall outliers are values < -2.4000000000000004 or > 3.2
the number of upper outlier 20462
the number of lower outlier 0

Evaporation outliers are values < -11.800000000000002 or > 21.800000000000004
the number of upper outlier 471
the number of lower outlier 0

WindGustSpeed outliers are values < -20.0 or > 99.0
the number of upper outlier 150
the number of lower outlier 0

WindSpeed9am outliers are values < -29.0 or > 55.0
the number of upper outlier 107
the number of lower outlier 0

WindSpeed3pm outliers are values < -20.0 or > 57.0
the number of upper outlier 81
the number of lower outlier 0

对于 Rainfall 这个特征而言,Australia 的情况应该是,一下雨就降雨量很大的那种,所以不打算砍掉 Rainfall 的 上四分位数 + 3*IQR 以上的值,其他的4个特征全部砍掉这些异常值

# 先进行中位数填充缺失的连续值,在进行异常值的处理

# 当数据集中有离群值的时候,应当使用中位数进行填充
'''
进行缺失值填充的时候要注意的点是:
要进行填充的值的计算,一定是要使用训练集计算出来的,这样才能减少过拟合。

使用训练集计算出来的中位数对训练集和测试集的对应特征进行填充
'''
def fill_max(feature):
    pass
dataset_.drop(columns=['RISK_MM'], inplace=True)
dataset_.shape
(142193, 25)
X = dataset_.drop(columns=['RainTomorrow'])
y = dataset_['RainTomorrow']
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# train_test_split 进行训练集和测试集的分配之后,X_train 等都是 dataframe
X_train.shape, X_test.shape
((113754, 24), (28439, 24))
# 计算训练集的中位数,用这个中位数填充

for df1 in (X_train, X_test):
    for j in numerical:
        col_median = X_train[j].median()
        df1[j].fillna(col_median, inplace=True)
d:\anaconda_file\lib\site-packages\pandas\core\generic.py:6288: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._update_inplace(new_data)
X_train.isnull().sum(), X_test.isnull().sum()
(Location         0
 MinTemp          0
 MaxTemp          0
 Rainfall         0
 Evaporation      0
 Sunshine         0
 WindGustDir      0
 WindGustSpeed    0
 WindDir9am       0
 WindDir3pm       0
 WindSpeed9am     0
 WindSpeed3pm     0
 Humidity9am      0
 Humidity3pm      0
 Pressure9am      0
 Pressure3pm      0
 Cloud9am         0
 Cloud3pm         0
 Temp9am          0
 Temp3pm          0
 RainToday        0
 year             0
 month            0
 day              0
 dtype: int64, Location         0
 MinTemp          0
 MaxTemp          0
 Rainfall         0
 Evaporation      0
 Sunshine         0
 WindGustDir      0
 WindGustSpeed    0
 WindDir9am       0
 WindDir3pm       0
 WindSpeed9am     0
 WindSpeed3pm     0
 Humidity9am      0
 Humidity3pm      0
 Pressure9am      0
 Pressure3pm      0
 Cloud9am         0
 Cloud3pm         0
 Temp9am          0
 Temp3pm          0
 RainToday        0
 year             0
 month            0
 day              0
 dtype: int64)
# 处理除了 Rainfall 之外的连续值的超出 上四分位数 + 3*IQR 的离群值进行修改
# 前面可以看到小于 下四分位数 - 3*IQR 的值是没有的 


def process_outliers(df3, Top, feature_):
    return np.where(df3[feature_] > Top, Top, df3[feature_])

'''threshold
Evaporation

21.800000000000004

WindGustSpeed

99.0

WindSpeed9am

55.0

WindSpeed3pm

57.0
'''


threshold_dict = {'Evaporation': 21.8, 'WindGustSpeed': 99.0, 'WindSpeed9am': 55.0, 'WindSpeed3pm': 57.0}
_list = ['Evaporation', 'WindGustSpeed', 'WindSpeed9am', 'WindSpeed3pm']

for df3 in (X_train, X_test):
    for feature in _list:
        top = threshold_dict.get(feature)
        df3[feature] = process_outliers(df3, top, feature)
d:\anaconda_file\lib\site-packages\ipykernel_launcher.py:33: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
X_train[_list].max(), X_test[_list].max()
(Evaporation      21.8
 WindGustSpeed    99.0
 WindSpeed9am     55.0
 WindSpeed3pm     57.0
 dtype: float64, Evaporation      21.8
 WindGustSpeed    99.0
 WindSpeed9am     55.0
 WindSpeed3pm     57.0
 dtype: float64)
# 对离散值进行独热编码

catgorical = [
    'Location',
 'WindGustDir',
 'WindDir9am',
 'WindDir3pm',
 'RainToday',
]

for i in catgorical:
    print(X_train.loc[:, i].value_counts())
Canberra            2717
Sydney              2671
Hobart              2593
Darwin              2560
Brisbane            2550
Perth               2548
Ballarat            2448
Adelaide            2446
MountGambier        2434
Sale                2420
Watsonia            2420
MelbourneAirport    2415
Nuriootpa           2413
Bendigo             2412
PerthAirport        2412
AliceSprings        2411
Woomera             2410
Cobar               2408
SydneyAirport       2407
Launceston          2404
WaggaWagga          2403
Tuggeranong         2402
Albury              2393
Townsville          2389
Wollongong          2385
Portland            2384
Albany              2382
BadgerysCreek       2381
NorfolkIsland       2380
Penrith             2379
Newcastle           2378
CoffsHarbour        2377
Cairns              2372
Mildura             2369
Dartmoor            2358
Witchcliffe         2356
GoldCoast           2346
NorahHead           2344
Richmond            2340
SalmonGums          2338
MountGinini         2323
Moree               2300
Walpole             2256
PearceRAAF          2225
Williamtown         2022
Melbourne           1926
Nhil                1282
Uluru               1238
Katherine           1227
Name: Location, dtype: int64
W      15350
SE      7463
E       7255
N       7211
SSE     7197
S       7172
WSW     7164
SW      7028
SSW     6888
WNW     6427
ENE     6360
NW      6351
ESE     5797
NE      5674
NNW     5251
NNE     5166
Name: WindGustDir, dtype: int64
N      17112
SE      7286
E       7213
SSE     7107
NW      6904
S       6771
SW      6608
W       6557
NNE     6326
NNW     6294
ENE     6214
NE      6073
SSW     6034
ESE     6034
WNW     5755
WSW     5466
Name: WindDir9am, dtype: int64
SE     11540
W       7967
S       7646
WSW     7469
SW      7373
SSE     7275
N       6988
WNW     6913
NW      6732
ESE     6722
E       6669
NE      6524
SSW     6371
ENE     6226
NNW     6203
NNE     5136
Name: WindDir3pm, dtype: int64
No     88530
Yes    25224
Name: RainToday, dtype: int64
X_train = X_train.replace({'No': 0, 'Yes': 1})
X_test = X_test.replace({'No': 0, 'Yes': 1})
X_train_temp = X_train.copy()
X_test_temp = X_test.copy()

X_train_temp = pd.get_dummies(X_train_temp, columns=catgorical, drop_first=True)
X_test_temp = pd.get_dummies(X_test_temp, columns=catgorical, drop_first=True)

X_train_temp.shape, X_test_temp.shape
((113754, 113), (28439, 113))
X_train_temp
MinTempMaxTempRainfallEvaporationSunshineWindGustSpeedWindSpeed9amWindSpeed3pmHumidity9amHumidity3pm...WindDir3pm_NWWindDir3pm_SWindDir3pm_SEWindDir3pm_SSEWindDir3pm_SSWWindDir3pm_SWWindDir3pm_WWindDir3pm_WNWWindDir3pm_WSWRainToday_1
13635625.732.40.06.28.937.019.017.077.056.0...1000000000
78594.816.80.05.08.557.020.030.025.018.0...0000000100
506873.821.20.04.88.522.04.09.073.032.0...0000000000
9884312.119.27.61.65.331.015.020.0100.069.0...0000000101
55688.420.10.24.88.520.06.04.093.052.0...0000000000
6568015.222.18.42.25.946.030.028.083.064.0...0000000001
7758616.017.317.21.80.020.013.011.070.0100.0...0000000001
2732316.430.60.05.28.539.02.026.084.060.0...0010000000
1408687.634.20.04.88.561.04.037.022.011.0...1000000000
238910.223.10.04.88.517.07.011.066.053.0...0001000000
13737718.230.40.03.89.828.07.017.069.053.0...0000000000
2173220.826.70.06.08.039.022.028.066.066.0...0000000000
1770219.826.70.04.88.552.022.041.078.069.0...0000000000
1277836.613.60.00.64.433.06.022.081.072.0...0000000000
525454.97.00.04.88.554.017.015.099.098.0...0000000010
39355.624.60.04.88.530.07.011.043.025.0...0000000000
1162329.719.20.03.09.139.028.019.054.040.0...0000000000
579322.916.10.04.88.535.017.013.073.055.0...0001000000
91819.125.60.04.211.156.04.041.037.058.0...0000000000
9868612.917.22.45.68.956.028.019.057.076.0...0000010001
6306912.719.69.44.610.352.028.033.070.051.0...0001000001
11014013.828.00.04.88.528.09.09.070.052.0...0000000000
2230015.821.10.05.611.039.020.026.062.063.0...0100000000
10287314.621.10.55.16.661.026.031.088.069.0...0000001000
7598711.517.50.05.68.863.015.035.074.039.0...0000000000
1310529.621.70.04.88.531.04.020.089.066.0...0000000000
2761219.128.20.44.88.530.06.022.089.060.0...0000000000
800277.516.80.81.46.430.011.013.091.071.0...0000010000
556882.017.60.24.88.539.09.09.079.046.0...0000000000
12572814.617.211.64.88.548.019.020.096.081.0...0000000011
..................................................................
3202614.722.40.45.26.035.013.020.082.058.0...0100000000
12136.219.50.04.88.533.00.019.093.064.0...0000000100
9113521.929.60.011.011.146.026.031.054.043.0...0000000000
536591.78.20.04.88.526.013.09.095.079.0...0000000010
38115.121.30.04.88.526.02.013.0100.055.0...0000000000
10902716.429.60.04.88.556.030.026.078.053.0...0001000000
1342082.119.50.06.010.743.017.019.040.013.0...0000000000
7119016.932.40.04.88.548.020.017.052.023.0...0000000000
1334623.718.20.03.89.937.09.020.068.040.0...0000000000
7153412.738.50.04.88.533.022.07.051.08.0...0100000000
11055311.115.37.44.88.544.017.013.053.074.0...0000000101
467627.215.20.04.88.550.024.033.046.046.0...0000000100
4960913.419.87.04.88.557.026.022.068.048.0...0100000001
563317.911.82.64.88.548.030.022.0100.096.0...0100000001
1259948.614.521.84.88.546.030.015.066.066.0...0000000011
628215.525.00.010.48.548.019.017.060.017.0...0000000010
13664821.532.60.08.08.943.020.015.048.034.0...0000000000
609468.215.027.02.05.274.09.028.082.063.0...1000000001
582247.218.40.01.98.522.013.09.064.042.0...0000100000
440158.916.80.04.88.537.09.017.046.037.0...0001000000
12865012.720.90.24.08.759.09.022.051.029.0...1000000000
992615.723.76.61.64.726.09.019.078.064.0...0000000001
378513.126.70.24.88.552.013.028.058.035.0...0000000100
553145.521.50.04.88.522.011.07.099.048.0...0000000000
804505.732.90.06.413.231.011.07.066.023.0...0000000000
10334416.329.30.019.213.459.037.030.043.021.0...0100000000
8941913.719.00.04.88.546.020.015.092.099.0...0000000000
639477.115.922.42.43.243.019.020.082.067.0...0000000011
582679.511.210.41.48.531.019.013.096.098.0...0010000001
8294618.825.044.43.60.243.09.013.090.075.0...0000000001

113754 rows × 113 columns

from sklearn.preprocessing import StandardScaler

X = pd.concat([X_train_temp, X_test_temp])
y = pd.concat([y_train, y_test])
print(X.shape)
print(y.shape)


scaler = StandardScaler()
X_train_temp = scaler.fit_transform(X_train_temp)
X_test_temp = scaler.fit_transform(X_test_temp)


y
(142193, 113)
(142193,)





136356    Yes
7859       No
50687      No
98843      No
5568       No
         ... 
67274      No
107403    Yes
69336      No
48522      No
4650      Yes
Name: RainTomorrow, Length: 142193, dtype: object

建模

逻辑回归
1.直接套模型
2.递归特征消除
3.嵌入法
from sklearn.linear_model import LogisticRegression

LR = LogisticRegression()
LR.fit(X_train_temp, y_train)
d:\anaconda_file\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)





LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)
LR.score(X_test_temp, y_test)
0.8482365765322268
%%time
import warnings
from sklearn.model_selection import cross_val_score

warnings.filterwarnings('ignore')

LR = LogisticRegression(n_jobs=-1)

cross_val_score(LR, X, y, cv=10, n_jobs=-1)
Wall time: 25.1 s





array([0.84331927, 0.84472574, 0.84739803, 0.84704641, 0.84535865,
       0.84936709, 0.84893452, 0.8432269 , 0.84589956, 0.84414123])
LR = LogisticRegression()
LR.fit(X_train_temp, y_train)
LR.score(X_train_temp, y_train)
0.8483218172547778
LR.score(X_test_temp, y_test)
0.8482365765322268

使用训练集的准确率和使用测试集的准确率差不多

模型评估
from sklearn.metrics import confusion_matrix

y_pre_test = LR.predict(X_test_temp)

cm = confusion_matrix(y_test, y_pre_test)
cm
array([[20895,  1217],
       [ 3099,  3228]], dtype=int64)
cm_matrix = pd.DataFrame(cm, columns=['true', 'false'], index=['postive', 'negative'])
plt.figure(figsize=(7,7))
sns.heatmap(cm_matrix, annot=True, fmt='d', cmap='YlGnBu')
<matplotlib.axes._subplots.AxesSubplot at 0x24e0a40f240>

在这里插入图片描述

from sklearn.metrics import classification_report

print(classification_report(y_test, y_pre_test))
              precision    recall  f1-score   support

          No       0.87      0.94      0.91     22112
         Yes       0.73      0.51      0.60      6327

    accuracy                           0.85     28439
   macro avg       0.80      0.73      0.75     28439
weighted avg       0.84      0.85      0.84     28439

from sklearn.metrics import recall_score, precision_score

y_test = np.where(y_test=='No', 0, 1)
y_pre_test = np.where(y_pre_test=='No', 0, 1)

print(precision_score(y_test, y_pre_test)) # 原来这个指标是可以设置 pos_label 的,这样即使是字符串的标签 'No' 'Yes' 也不用鸟了
print(recall_score(y_test, y_pre_test))
0.7262092238470191
0.510194404931247
from sklearn.metrics import f1_score

f1_score(y_test, y_pre_test)
0.5993316004455997
0.8*0.8*2/1.6
0.8000000000000002
from sklearn.metrics import roc_curve

fpr, tpr, threshold = roc_curve(y_test, y_pre_test)
plt.plot(fpr, tpr, c='b')
plt.plot([0,1], [0,1], 'k--')
[<matplotlib.lines.Line2D at 0x24e07f3b1d0>]

在这里插入图片描述

from sklearn.metrics import roc_auc_score

ROC_AUC = roc_auc_score(y_test, y_pre_test)
ROC_AUC
0.7275782082543355
%%time
from sklearn.feature_selection import RFECV


rfecv = RFECV(estimator=LR, step=1, cv=5, scoring='accuracy')

rfecv = rfecv.fit(X_train_temp, y_train)
Wall time: 17min 18s
X_train_rfecv = rfecv.transform(X_train_temp)
print(X_train_rfecv.shape)
LR.fit(X_train_rfecv, y_train)
X_test_rfecv = rfecv.transform(X_test_temp)
y_pred_rfecv = LR.predict(X_test_rfecv)
(113754, 100)
from sklearn.metrics import accuracy_score

y_test = np.where(y_test==0, 'No', 'Yes')
LR.score(X_test_rfecv, y_test)
0.8479552726889131
%%time
from sklearn.model_selection import GridSearchCV


parameters = [
              {'C':list(range(50, 500, 25))}
               ]
LR = LogisticRegression(penalty='l1', n_jobs=-1)


grid_search = GridSearchCV(estimator = LR,  
                           param_grid = parameters,
                           scoring = 'accuracy',
                           cv = 5,
                           verbose=0)


grid_search.fit(X_train_temp, y_train)
Wall time: 9min 8s





GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=LogisticRegression(C=1.0, class_weight=None, dual=False,
                                          fit_intercept=True,
                                          intercept_scaling=1, l1_ratio=None,
                                          max_iter=100, multi_class='warn',
                                          n_jobs=-1, penalty='l1',
                                          random_state=None, solver='warn',
                                          tol=0.0001, verbose=0,
                                          warm_start=False),
             iid='warn', n_jobs=None,
             param_grid=[{'C': [50, 75, 100, 125, 150, 175, 200, 225, 250, 275,
                                300, 325, 350, 375, 400, 425, 450, 475]}],
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring='accuracy', verbose=0)
grid_search.best_score_, grid_search.best_params_
(0.8479174358703869, {'C': 125})
%%time
# 嵌入法
from sklearn.feature_selection import SelectFromModel

X_embedded = SelectFromModel(grid_search.best_estimator_, norm_order=1).fit_transform(X_train_temp, y_train)
Wall time: 3.6 s

查看模型的效果----通过模型预测概率的分布

# LR.predict_proba(X_test) 可以获得对应的样本的预测值为 1 的概率

LR = LogisticRegression(C=125)
LR.fit(X_train_temp, y_train)
y_predict_proba = LR.predict_proba(X_test_temp)
y_predict_proba # 第0列和第一列分别为这个样本对应的标签为0和1的概率
array([[0.79898701, 0.20101299],
       [0.86553243, 0.13446757],
       [0.77149679, 0.22850321],
       ...,
       [0.98964495, 0.01035505],
       [0.83904762, 0.16095238],
       [0.17714232, 0.82285768]])
sns.set(style="whitegrid")
sns.distplot(a=y_predict_proba[:, 1], bins=10)
<matplotlib.axes._subplots.AxesSubplot at 0x24e42f53cc0>

在这里插入图片描述

plt.hist(y_predict_proba[:, 1], bins=10)
(array([13451.,  4906.,  2661.,  1638.,  1341.,  1073.,   908.,   809.,
          838.,   814.]),
 array([8.40201955e-04, 1.00724269e-01, 2.00608337e-01, 3.00492404e-01,
        4.00376471e-01, 5.00260538e-01, 6.00144606e-01, 7.00028673e-01,
        7.99912740e-01, 8.99796807e-01, 9.99680875e-01]),
 <a list of 10 Patch objects>)

在这里插入图片描述

sns.distplot(a=y_predict_proba[:, 0], bins=10)
<matplotlib.axes._subplots.AxesSubplot at 0x24e67ab34e0>

在这里插入图片描述

1.概率分布严重偏斜
2.可以发现对于标签为 1 的样本,大部分预测的概率都是小于0.5的,所以感觉置信度不太高

# todo 对于样本严重的不平衡问题,使用上采样的方法

随机森林
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier(n_estimators=200)
rfc.fit(X_train_temp, y_train)
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=200,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)
rfc.score(X_test_temp, y_test) # 我吐了,调了那么久的LR,应该全部模型都跑一遍。。
0.856886669714125
随机森林调参
%%time

m_list = []

for depth in range(5, 100, 4):
    rfc = RandomForestClassifier(n_estimators=150, max_depth=depth)
    rfc.fit(X_train_temp, y_train)
    score = rfc.score(X_test_temp, y_test)
    m_list.append(score)
    
plt.plot(m_list)
plt.show()

在这里插入图片描述

Wall time: 17min 9s
m_list.index(max(m_list))
14
m_list[14]
0.858328351911108
score_list = list()

for num in range(500, 1500, 100):
    rfc = RandomForestClassifier(n_estimators=200, max_depth=14, max_leaf_nodes=num, n_jobs=-1)
    rfc.fit(X_train_temp, y_train)
    score = rfc.score(X_test_temp, y_test)
    score_list.append(score)
    
plt.plot(list(range(500, 1500, 100)),score_list)
plt.show()

在这里插入图片描述

%matplotlib inline
# 显然预测的评分是被限制了,max_leaf_nodes还要提高
del score_list
score_list = []

for k in range(1500, 2500, 100):
    rfc = RandomForestClassifier(n_estimators=200, max_depth=14, max_leaf_nodes=k, n_jobs=-1)
    rfc.fit(X_train_temp, y_train)
    score = rfc.score(X_test_temp, y_test)
    score_list.append(score)
    
    
plt.plot(range(1500, 2500, 100), score_list)
plt.show()

在这里插入图片描述

%%time
rfc = RandomForestClassifier(n_estimators=200, max_depth=19, max_features=17, max_leaf_nodes=1100, n_jobs=-1)
rfc.fit(X_train_temp, y_train)
rfc.score(X_test_temp, y_test)
Wall time: 24.9 s





0.8518583635148915
朴素贝叶斯
# 高斯朴素贝叶斯
from sklearn.naive_bayes import GaussianNB, MultinomialNB, ComplementNB

gaussian_ = GaussianNB()
gaussian_.fit(X_train_temp, y_train)
gaussian_.score(X_test_temp, y_test)
0.6483701958578009
人工神经网络
%%time

from sklearn.neural_network import MLPClassifier

clf = MLPClassifier(solver='lbfgs', alpha=1e-5,
                        hidden_layer_sizes=(50, 50), random_state=1)

clf.fit(X_train_temp, y_train)
Wall time: 2min 44s





MLPClassifier(activation='relu', alpha=1e-05, batch_size='auto', beta_1=0.9,
              beta_2=0.999, early_stopping=False, epsilon=1e-08,
              hidden_layer_sizes=(50, 50, 50), learning_rate='constant',
              learning_rate_init=0.001, max_iter=200, momentum=0.9,
              n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5,
              random_state=1, shuffle=True, solver='lbfgs', tol=0.0001,
              validation_fraction=0.1, verbose=False, warm_start=False)
clf.score(X_test_temp, y_test)
0.8431731073525792
  • 6
    点赞
  • 80
    收藏
    觉得还不错? 一键收藏
  • 20
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 20
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值