文章目录
1.问题描述
在某个区域,已知的环境特征包括海拔、(山坡)面向的角度( 0 o 为 北 0^o为北 0o为北)、山坡的坡度、距离水源的垂直水平距离、距离公路以及距离火点的水平距离、山体阴影、土壤类型等。通过训练集训练分类器模型并对未知区域判断覆盖的植被类型。
2. 解决方法描述
该competition的训练集与测试集没有出现NaN
数据,因此省去了对缺失数据的处理。如果是存在缺失数据的情况, 1 ◯ \raisebox{.5pt}{\textcircled{\raisebox{-.9pt} {1}}} 1◯如果某个特征缺失值较多,可以直接drop掉; 2 ◯ \raisebox{.5pt}{\textcircled{\raisebox{-.9pt} {2}}} 2◯如果缺失值占比不是很多,可以简单的使用mean
均值、使用已知值训练regressor/classifier
预测缺失值等方法。
如果直接使用原始的训练集训练分类器可能并不会取得较好的效果。通过观察特征的含义以及特征之间的关联性并组合出新的特征会有效地提高分类器的性能。
在上述特征中,距离水源的水平与垂直距离可以组合新的特征——欧式距离;山坡面向的角度、山体阴影与山体坡度等可能具有某种关系;水源、火点以及距离公路的距离(人类活动影响)对植被类型可能会有影响 etc. 此外,可以使用heatmap
观察各个属性之间的相关性程度。
3. 实现过程
代码是在kaggle的kernel上写的,下面是导出的ipynb代码以及输出
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
# Any results you write to the current directory are saved as output.
/kaggle/input/learn-together/train.csv
/kaggle/input/learn-together/sample_submission.csv
/kaggle/input/learn-together/test.csv
train=pd.read_csv('/kaggle/input/learn-together/train.csv')
test=pd.read_csv('/kaggle/input/learn-together/test.csv')
train_target=train['Cover_Type']
train.drop(['Id','Cover_Type'], axis=1, inplace=True)
test_ids=test['Id'].values
test.drop(['Id'], axis=1, inplace=True)
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
Distances analysis
try different combination of distance features and display the distribution
def eculid_dis_hydrology(df):
df['Eculid_Dis_Hydrology']=(df['Horizontal_Distance_To_Hydrology']**2+df['Vertical_Distance_To_Hydrology']**2)**0.5
return df
train=eculid_dis_hydrology(train)
test=eculid_dis_hydrology(test)
# draw the distribution of different cover type by eculid distance
sns.scatterplot(train['Eculid_Dis_Hydrology'], train_target.values)
<matplotlib.axes._subplots.AxesSubplot at 0x7fd50413d080>
try different combination of horizontal distance features
def horizontal_fire_hydrology(df):
df['Horizontal_Fire_Hydrology_Gap']=np.absolute(df['Horizontal_Distance_To_Hydrology']-df['Horizontal_Distance_To_Fire_Points'])
df['Horizontal_Fire_Hydrology_Sum']=df['Horizontal_Distance_To_Hydrology']+