Kaggle入门

最新推荐文章于 2022-07-11 07:37:00 发布

weixin_30685047

最新推荐文章于 2022-07-11 07:37:00 发布

阅读量85

点赞数

原文链接：http://www.cnblogs.com/leizongfei/p/7507055.html

版权

Kaggle入门

一、dataquest上的一个入门课程

这个小课程是以Titanic为例子的

0.先导入常用的库

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

1.读取csv格式的数据

用Pandas的read_csv()函数读取csv格式的数据：

titanic = pd.read_csv("titanic_train.csv")

可以用head()函数查看数据集的前5行数据，帮助我们初步了解数据有哪些特征，要预测什么？

titanic.head()

得到这样的一个表格：

2. 用 `describe()`了解数据

通过使用Pandas的describe()方法，我们可以查看数据集的不同特征值。

3.清理缺失数据

仔细观察一下上面，Age这一栏有714个数据，其他栏都是891个，这就说明这一栏有一些缺失数据（null， NA，或者不是数字）。
如果要把有缺失数据的整行都删掉就太可惜了，因为大量数据有助于更好地训练我们的模型。

最简单的方法是利用.fillna()函数把Age这栏的缺失数据都用中位数代替。

titanic["Age"] = titanic["Age"].fillna(titanic["Age"].median())

4.处理非数值的特征

之前用describe()了解数据特征的时候，我们发现并不是所有栏都被展示出来了，只有内容是数值的栏才被展示了出来。大多数情况下，我们并不指望这些非数值的特征能做出贡献。有两种常见做法

把无价值的非数值特征扔掉，比如Ticket、Cabin、Name。它们没有提供什么有价值的信息。
把有价值的非数值特征转化为数值特征

4.1 将sex这一栏转化为数值

先查看一下Sex这一栏都有什么：

print(titanic.Sex.value_counts())

这样看来数据还是挺整齐的，没有什么man、woman什么的混进来。
我们定义male为0，female为1

titanic.loc[titanic["Sex"] == "male", "Sex"] = 0
titanic.loc[titanic["Sex"] == "female", "Sex"] = 1

注意：这里应用了Pandas的精确定位（loc()）和掩码（mask）

4.2 将Embarked这一栏转化为数值

Embarked这一栏有S、C、Q和缺失数据（nan）
因为S出现次数最多，所以把缺失值都记为S

titanic['Embarked'] = titanic['Embarked'].fillna('S')
titanic.loc[titanic['Embarked'] == 'S', 'Embarked'] = 0
titanic.loc[titanic['Embarked'] == 'C', 'Embarked'] = 1
titanic.loc[titanic['Embarked'] == 'Q', 'Embarked'] = 2

5、利用机器学习算法进行预测

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import KFold

# The columns we'll use to predict the target
predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]

# Initialize our algorithm class
alg = LinearRegression()
# Generate cross-validation folds for the titanic data set
# It returns the row indices corresponding to train and test
# We set random_state to ensure we get the same splits every time we run this
kf = KFold(3, random_state=1)

predictions = []
for train, test in kf.split(titanic):
        # The predictors we're using to train the algorithm  
    # Note how we only take the rows in the train folds
    train_predictors = (titanic[predictors].iloc[train,:])
    # The target we're using to train the algorithm
    train_target = titanic["Survived"].iloc[train]
    # Training the algorithm using the predictors and target
    alg.fit(train_predictors, train_target)
    # We can now make predictions on the test fold
    test_predictions = alg.predict(titanic[predictors].iloc[test,:])
    predictions.append(test_predictions)

6、评估预测误差

现在predictions分在三个array里，利用concatenate()方法可以把若干个array连在一起。

predictions = np.concatenate(predictions, axis=0)
predictions[predictions > .5] = 1
predictions[predictions <=.5] = 0
accuracy = len(predictions[predictions == titanic["Survived"]]) / len(predictions)

7.改用Logistic Regression

from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

alg = LogisticRegression(random_state=1)

scores = cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=3)

print(scores.mean())

8.对测试数据集进行预处理

处理方式与之前对训练数据集的方式相同。

titanic_test = pd.read_csv("test.csv")
titanic_test["Age"] = titanic_test["Age"].fillna(titanic["Age"].median())
titanic_test["Fare"] = titanic_test["Fare"].fillna(titanic_test["Fare"].median())
titanic_test.loc[titanic_test["Sex"] == "male", "Sex"] = 0 
titanic_test.loc[titanic_test["Sex"] == "female", "Sex"] = 1
titanic_test["Embarked"] = titanic_test["Embarked"].fillna("S")

titanic_test.loc[titanic_test["Embarked"] == "S", "Embarked"] = 0
titanic_test.loc[titanic_test["Embarked"] == "C", "Embarked"] = 1
titanic_test.loc[titanic_test["Embarked"] == "Q", "Embarked"] = 2

9.预测并产生提交文件

alg = LogisticRegression(random_state=1)
alg.fit(titanic[predictors], titanic["Survived"])
predictions = alg.predict(titanic_test[predictors])

# Create a new dataframe with only the columns Kaggle wants from the data set
submission = pandas.DataFrame({
        "PassengerId": titanic_test["PassengerId"],
        "Survived": predictions
    })

转载于:https://www.cnblogs.com/leizongfei/p/7507055.html