Kaggle 入门——titanic(SVM)

本文详述了机器学习的基本框架,包括定义问题、获取数据、数据准备、探索性分析、选择模型、验证数据模型及参数优化。重点介绍了使用Python进行数据预处理的方法,如缺失值填充、数据转换、特征工程等,以及特征编码和数据集划分。
摘要由CSDN通过智能技术生成

some of this article gets from 官网优秀指导

The framework of machine learning:

1、 define the problem
2、 get the data
3、 prepare the data
4、 perform exploratory analysis
5、 choose the appropriate model
6、 validate the data model
7、 optimize the parameter

data preparing

the method could be MATLAB or PYTHON to crush the data.
here i choose python to handle this.

data analysis`s tips:

1、 correct. There are some special columu of data such as age which couldn`t be 180. So check the data.
2、 complete. Some algorithm couldn`t get the NULL. So check it again.
3、 create. ceate some new columu or paremeter which is more meaningful.
4、 convering.

font converting is the most important thing in this period.
pandas 读取数据后得到的是DataFrame类型

数据处理部分

import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn import feature_selection
from sklearn import model_selection
from sklearn import metrics
path_train = 'D:\\AboutWork\\机器学习\\Example1_titanic\\train.csv'
path_test = 'D:\\AboutWork\\机器学习\\Example1_titanic\\test.csv'
raw_data = pd.read_csv(path_train)
test_data = pd.read_csv(path_test)
data_cleaner = [raw_data, test_data]
print(raw_data.sample(10))
print(raw_data.isnull().sum())
     PassengerId  Survived  Pclass  \
817          818         0       2   
54            55         0       1   
566          567         0       3   
769          770         0       3   
186          187         1       3   
208          209         1       3   
559          560         1       3   
340          341         1       2   
358          359         1       3   
483          484         1       3   

                                                Name     Sex   Age  SibSp  \
817                               Mallet, Mr. Albert    male  31.0      1   
54                    Ostby, Mr. Engelhart Cornelius    male  65.0      0   
566                             Stoytcheff, Mr. Ilia    male  19.0      0   
769                 Gronnestad, Mr. Daniel Danielsen    male  32.0      0   
186  O'Brien, Mrs. Thomas (Johanna "Hannah" Godfrey)  female   NaN      1   
208                        Carr, Miss. Helen "Ellen"  female  16.0      0   
559     de Messemaeker, Mrs. Guillaume Joseph (Emma)  female  36.0      1   
340                   Navratil, Master. Edmond Roger    male   2.0      1   
358                             McGovern, Miss. Mary  female   NaN      0   
483                           Turkula, Mrs. (Hedwig)  female  63.0      0   

     Parch           Ticket     Fare Cabin Embarked  
817      1  S.C./PARIS 2079  37.0042   NaN        C  
54       1           113509  61.9792   B30        C  
566      0           349205   7.8958   NaN        S  
769      0             8471   8.3625   NaN        S  
186      0           370365  15.5000   NaN        Q  
208      0           367231   7.7500   NaN        Q  
559      0           345572  17.4000   NaN        S  
340      1           230080  26.0000    F2        S  
358      0           330931   7.8792   NaN        Q  
483      0             4134   9.5875   NaN        S  
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64
raw_data = pd.read_csv(path_train)
test_data = pd.read_csv(path_test)
data_cleaner = [raw_data, test_data]
for data in data_cleaner:
    data['Age'].fillna(data['Age'].median(), inplace=True)
    data['Embarked'].fillna(data['Embarked'].mode()[0], inplace=True)
    data.drop(['Ticket', 'Cabin', 'PassengerId'], axis=1, inplace=True)  # 删除列
print(raw_data.sample(10))
print(raw_data.isnull().sum())
     Survived  Pclass                                              Name  \
620         0       3                               Yasbeck, Mr. Antoni   
176         0       3                     Lefebre, Master. Henry Forbes   
568         0       3                               Doharr, Mr. Tannous   
863         0       3                 Sage, Miss. Dorothy Edith "Dolly"   
742         1       1             Ryerson, Miss. Susan Parker "Suzette"   
748         0       1                         Marvin, Mr. Daniel Warner   
279         1       3                  Abbott, Mrs. Stanton (Rosa Hunt)   
206         0       3                        Backstrom, Mr. Karl Alfred   
871         1       1  Beckwith, Mrs. Richard Leonard (Sallie Monypeny)   
317         0       2                              Moraweck, Dr. Ernest   

        Sex   Age  SibSp  Parch      Fare Embarked  
620    male  27.0      1      0   14.4542        C  
176    male  28.0      3      1   25.4667        S  
568    male  28.0      0      0    7.2292        C  
863  female  28.0      8      2   69.5500        S  
742  female  21.0      2      2  262.3750        C  
748    male  19.0      1      0   53.1000        S  
279  female  35.0      1      1   20.2500        S  
206    male  32.0      1      0   15.8500        S  
871  female  47.0      1      1   52.5542        S  
317    male  54.0      0      0   14.0000        S  
Survived    0
Pclass      0
Name        0
Sex         0
Age         0
SibSp       0
Parch       0
Fare        0
Embarked    0
dtype: int64
for data in data_cleaner:
    data['FamilySize'] = data['SibSp'] + data['Parch'] + 1
    data['IsAlone'] = 1
    data['IsAlone'].loc[data['FamilySize'] > 1] = 0  # DataFrame的loc函数需要会
    data['FareBin'] = pd.qcut(data['Fare'], 4)  # qcut将数据区分成几个区间,具体参照方式由系统自动确定
    data['AgeBin'] = pd.cut(data['Age'].astype(float), 5)  # cut是用数据里面的最大值减去最小值除以n作为区间间距分类。
print(raw_data.info())
print(raw_data.isnull().sum())

pandas.qcut() 将数据区分成几个区间,具体参照方式由系统自动确定
pandas.cut() 也是将数据分成几个区间,方式和区间长度由最大值减去最小值除以n决定

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 13 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   Survived    891 non-null    int64   
 1   Pclass      891 non-null    int64   
 2   Name        891 non-null    object  
 3   Sex         891 non-null    object  
 4   Age         891 non-null    float64 
 5   SibSp       891 non-null    int64   
 6   Parch       891 non-null    int64   
 7   Fare        891 non-null    float64 
 8   Embarked    891 non-null    object  
 9   FamilySize  891 non-null    int64   
 10  IsAlone     891 non-null    int64   
 11  FareBin     891 non-null    category
 12  AgeBin      891 non-null    category
dtypes: category(2), float64(2), int64(6), object(3)
memory usage: 78.9+ KB
None
Survived      0
Pclass        0
Name          0
Sex           0
Age           0
SibSp         0
Parch         0
Fare          0
Embarked      0
FamilySize    0
IsAlone       0
FareBin       0
AgeBin        0
dtype: int64
label = LabelEncoder()
for data in data_cleaner:  # 分类为1与0属性
    data['Sex_Code'] = label.fit_transform(data['Sex'])
    data['Embarked_Code'] = label.fit_transform(data['Embarked'])
    data['AgeBin_Code'] = label.fit_transform(data['AgeBin'])
    # data['FareBin_Code'] = label.fit_transform(data['FareBin'])
raw_data_x = ['Sex', 'Pclass', 'Embarked', 'SibSp', 'Parch', 'Age', 'Fare', 'FamilySize', 'IsAlone']
raw_data_calc = ['Sex_Code', 'Pclass', 'Embarked_Code', 'SibSp', 'Parch', 'Age', 'Fare']
raw_data_xy = ['Survived'] + raw_data_x
print(raw_data_xy)
['Survived', 'Sex', 'Pclass', 'Embarked', 'SibSp', 'Parch', 'Age', 'Fare', 'FamilySize', 'IsAlone']
raw_data['FareBin_Code'] = label.fit_transform(raw_data['FareBin']) # 注意没有将test数据进行转化。
raw_data_x_bin = ['Sex_Code','Pclass', 'Embarked_Code', 'FamilySize', 'AgeBin_Code', 'FareBin_Code']
raw_data_bin = ['Survived'] + raw_data_x_bin
raw_data_dummy = pd.get_dummies(raw_data[raw_data_x])
raw_data_x_dummy = raw_data_dummy.columns.tolist()
raw_data_xy_dummy = ['Survived'] + raw_data_x_dummy
print('Dummy X Y: ', raw_data_xy_dummy, '\n')
Dummy X Y:  ['Survived', 'Pclass', 'SibSp', 'Parch', 'Age', 'Fare', 'FamilySize', 'IsAlone', 'Sex_female', 'Sex_male', 'Embarked_C', 'Embarked_Q', 'Embarked_S'] 
raw_data.describe(include = 'all')
SurvivedPclassNameSexAgeSibSpParchFareEmbarkedFamilySizeIsAloneFareBinAgeBinSex_CodeEmbarked_CodeAgeBin_CodeFareBin_Code
count891.000000891.000000891891891.000000891.000000891.000000891.000000891891.000000891.000000891891891.000000891.000000891.000000891.000000
uniqueNaNNaN8912NaNNaNNaNNaN3NaNNaN45NaNNaNNaNNaN
topNaNNaNLundahl, Mr. Johan SvenssonmaleNaNNaNNaNNaNSNaNNaN(7.91, 14.454](16.336, 32.252]NaNNaNNaNNaN
freqNaNNaN1577NaNNaNNaNNaN646NaNNaN224523NaNNaNNaNNaN
mean0.3838382.308642NaNNaN29.3615820.5230080.38159432.204208NaN1.9046020.602694NaNNaN0.6475871.5364761.2906851.497194
std0.4865920.836071NaNNaN13.0196971.1027430.80605749.693429NaN1.6134590.489615NaNNaN0.4779900.7915030.8126201.118156
min0.0000001.000000NaNNaN0.4200000.0000000.0000000.000000NaN1.0000000.000000NaNNaN0.0000000.0000000.0000000.000000
25%0.0000002.000000NaNNaN22.0000000.0000000.0000007.910400NaN1.0000000.000000NaNNaN0.0000001.0000001.0000000.500000
50%0.0000003.000000NaNNaN28.0000000.0000000.00000014.454200NaN1.0000001.000000NaNNaN1.0000002.0000001.0000001.000000
75%1.0000003.000000NaNNaN35.0000001.0000000.00000031.000000NaN2.0000001.000000NaNNaN1.0000002.0000002.0000002.000000
max1.0000003.000000NaNNaN80.0000008.0000006.000000512.329200NaN11.0000001.000000NaNNaN1.0000002.0000004.0000003.000000
train_x, test_x, train_y, test_y = model_selection.train_test_split(raw_data[raw_data_calc], raw_data['Survived'], random_state=0)
train_x_bin, text_x_bin, train_y_bin, test_y_bin = model_selection.train_test_split(raw_data[raw_data_x_bin], raw_data['Survived'], random_state=0)
train_x_dummy, test_x_dummy, train_y_dummy, test_y_dummy = model_selection.train_test_split(raw_data_dummy[raw_data_x_dummy], raw_data['Survived'], random_state = 0)
train_x_bin.head()
Sex_CodePclassEmbarked_CodeFamilySizeAgeBin_CodeFareBin_Code
105132110
68032711
253132212
320132110
706022121

数据图像处理部分

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值