Kaggle入门
一、dataquest上的一个入门课程
这个小课程是以Titanic为例子的
0.先导入常用的库
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
1.读取csv格式的数据
用Pandas的read_csv()
函数读取csv格式的数据:
titanic = pd.read_csv("titanic_train.csv")
可以用head()
函数查看数据集的前5行数据,帮助我们初步了解数据有哪些特征,要预测什么?
titanic.head()
得到这样的一个表格:
2. 用 describe()
了解数据
通过使用Pandas的describe()
方法,我们可以查看数据集的不同特征值。
3.清理缺失数据
仔细观察一下上面,Age这一栏有714个数据,其他栏都是891个,这就说明这一栏有一些缺失数据(null, NA, 或者不是数字)。
如果要把有缺失数据的整行都删掉就太可惜了,因为大量数据有助于更好地训练我们的模型。
最简单的方法是利用.fillna()
函数把Age这栏的缺失数据都用中位数代替。
titanic["Age"] = titanic["Age"].fillna(titanic["Age"].median())
4.处理非数值的特征
之前用describe()
了解数据特征的时候,我们发现并不是所有栏都被展示出来了,只有内容是数值的栏才被展示了出来。大多数情况下,我们并不指望这些非数值的特征能做出贡献。有两种常见做法
- 把无价值的非数值特征扔掉,比如
Ticket
、Cabin
、Name
。它们没有提供什么有价值的信息。 - 把有价值的非数值特征转化为数值特征
4.1 将sex这一栏转化为数值
先查看一下Sex这一栏都有什么:
print(titanic.Sex.value_counts())
这样看来数据还是挺整齐的,没有什么man、woman什么的混进来。
我们定义male
为0
,female
为1
titanic.loc[titanic["Sex"] == "male", "Sex"] = 0
titanic.loc[titanic["Sex"] == "female", "Sex"] = 1
注意:这里应用了Pandas
的精确定位(loc()
)和掩码(mask)
4.2 将Embarked这一栏转化为数值
Embarked
这一栏有S
、C
、Q
和缺失数据(nan)
因为S
出现次数最多,所以把缺失值都记为S
titanic['Embarked'] = titanic['Embarked'].fillna('S')
titanic.loc[titanic['Embarked'] == 'S', 'Embarked'] = 0
titanic.loc[titanic['Embarked'] == 'C', 'Embarked'] = 1
titanic.loc[titanic['Embarked'] == 'Q', 'Embarked'] = 2
5、利用机器学习算法进行预测
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import KFold
# The columns we'll use to predict the target
predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]
# Initialize our algorithm class
alg = LinearRegression()
# Generate cross-validation folds for the titanic data set
# It returns the row indices corresponding to train and test
# We set random_state to ensure we get the same splits every time we run this
kf = KFold(3, random_state=1)
predictions = []
for train, test in kf.split(titanic):
# The predictors we're using to train the algorithm
# Note how we only take the rows in the train folds
train_predictors = (titanic[predictors].iloc[train,:])
# The target we're using to train the algorithm
train_target = titanic["Survived"].iloc[train]
# Training the algorithm using the predictors and target
alg.fit(train_predictors, train_target)
# We can now make predictions on the test fold
test_predictions = alg.predict(titanic[predictors].iloc[test,:])
predictions.append(test_predictions)
6、评估预测误差
现在predictions分在三个array里,利用concatenate()
方法可以把若干个array连在一起。
predictions = np.concatenate(predictions, axis=0)
predictions[predictions > .5] = 1
predictions[predictions <=.5] = 0
accuracy = len(predictions[predictions == titanic["Survived"]]) / len(predictions)
7.改用Logistic Regression
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
alg = LogisticRegression(random_state=1)
scores = cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=3)
print(scores.mean())
8.对测试数据集进行预处理
处理方式与之前对训练数据集的方式相同。
titanic_test = pd.read_csv("test.csv")
titanic_test["Age"] = titanic_test["Age"].fillna(titanic["Age"].median())
titanic_test["Fare"] = titanic_test["Fare"].fillna(titanic_test["Fare"].median())
titanic_test.loc[titanic_test["Sex"] == "male", "Sex"] = 0
titanic_test.loc[titanic_test["Sex"] == "female", "Sex"] = 1
titanic_test["Embarked"] = titanic_test["Embarked"].fillna("S")
titanic_test.loc[titanic_test["Embarked"] == "S", "Embarked"] = 0
titanic_test.loc[titanic_test["Embarked"] == "C", "Embarked"] = 1
titanic_test.loc[titanic_test["Embarked"] == "Q", "Embarked"] = 2
9.预测并产生提交文件
alg = LogisticRegression(random_state=1)
alg.fit(titanic[predictors], titanic["Survived"])
predictions = alg.predict(titanic_test[predictors])
# Create a new dataframe with only the columns Kaggle wants from the data set
submission = pandas.DataFrame({
"PassengerId": titanic_test["PassengerId"],
"Survived": predictions
})