Kaggle 新手教程(一)

  在DATAQUEST上学习kaggle的教程,感觉有些数据预处理的代码很实用,并且用的是之前没接触过的pandas写的,所以记录下来。原文链接:https://www.dataquest.io/mission/74/getting-started-with-kaggle
  本教程解决的问题是泰坦尼克,链接为https://www.kaggle.com/c/titanic 这个题目比较简单,之后可能还会在针对这个问题学习更多代码知识。

  关于pandas的一些基本用法,可以查阅http://pandas.pydata.org/pandas-docs/stable/10min.html

  首先是读取.CSV格式的文件,再利用.describe()做一些基本的统计。

# We can use the pandas library in python to read in the csv file.
# This creates a pandas dataframe and assigns it to the titanic variable.
titanic = pandas.read_csv("titanic_train.csv")

# Print the first 5 rows of the dataframe.
print(titanic.head(5))
print(titanic.describe())
  通过统计我们会发现有一些数据有所缺失,还有一些数据并没有什么用。在这个时候我们需要考虑使用什么数据,补全什么数据,舍弃什么数据。这是依据我们对这个问题的常识去理解的。比如对于这个问题,name对存活的影响很小,并且我们很难对name进行处理,所以舍弃。

  而针对age来说,缺失了少量数据,我们需要对它进行补全。补全使用了均值。

titanic["Age"] = titanic["Age"].fillna(titanic["Age"].median())

  

  接下来,要对非数字化的数据进行数字化,从而可以进行机器学习。其中,输出.unique()可以看出在该列中有多少种文字描述,以免疏漏。在这里,将male定为0,female定为1.

# Find all the unique genders -- the column appears to contain only male and female.
print(titanic["Sex"].unique())

# Replace all the occurences of male with the number 0.
titanic.loc[titanic["Sex"] == "male", "Sex"] = 0
titanic.loc[titanic["Sex"] == "female", "Sex"] = 1


  对Embarked也做相似的处理。

# Find all the unique values for "Embarked".
print(titanic["Embarked"].unique())
titanic["Embarked"] = titanic["Embarked"].fillna("S");
titanic.loc[titanic["Embarked"] == "S", "Embarked"]= 0 
titanic.loc[titanic["Embarked"] == "C", "Embarked"]= 1 
titanic.loc[titanic["Embarked"] == "Q", "Embarked"]= 2

  到这里,预处理部分基本完成,开始算法部分。原文讲了一些线性回归和交叉验证基础知识,这里不赘述。利用scikit-learn库进行预测,生成预测文件。
# Import the linear regression class
from sklearn.linear_model import LinearRegression
# Sklearn also has a helper that makes it easy to do cross validation
from sklearn.cross_validation import KFold

# The columns we'll use to predict the target
predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]

# Initialize our algorithm class
alg = LinearRegression()
# Generate cross validation folds for the titanic dataset.  It return the row indices corresponding to train and test.
# We set random_state to ensure we get the same splits every time we run this.
kf = KFold(titanic.shape[0], n_folds=3, random_state=1)

predictions = []
for train, test in kf:
    # The predictors we're using the train the algorithm.  Note how we only take the rows in the train folds.
    train_predictors = (titanic[predictors].iloc[train,:])
    # The target we're using to train the algorithm.
    train_target = titanic["Survived"].iloc[train]
    # Training the algorithm using the predictors and target.
    alg.fit(train_predictors, train_target)
    # We can now make predictions on the test fold
    test_predictions = alg.predict(titanic[predictors].iloc[test,:])
    predictions.append(test_predictions)

  接下来计算一下误差。

import numpy as np

# The predictions are in three separate numpy arrays.  Concatenate them into one.  
# We concatenate them on axis 0, as they only have one axis.
predictions = np.concatenate(predictions, axis=0)

# Map predictions to outcomes (only possible outcomes are 1 and 0)
predictions[predictions > .5] = 1
predictions[predictions <=.5] = 0

result=sum(np.array(titanic["Survived"]==predictions))
accuracy=result/len(predictions)


  预测结果准确率为78.3%,实在低的可以。接下来试一下逻辑回归,准确率基本持平。
from sklearn import cross_validation

# Initialize our algorithm
alg = LogisticRegression(random_state=1)
# Compute the accuracy score for all the cross validation folds.  (much simpler than what we did before!)
scores = cross_validation.cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=3)
# Take the mean of the scores (because we have one for each fold)
print(scores.mean())


最终,在测试集上进行出具的预处理,仿照 上文。

titanic_test = pandas.read_csv("titanic_test.csv")
titanic_test["Age"] = titanic_test["Age"].fillna(titanic["Age"].median())
titanic_test.loc[titanic_test["Sex"]=="male","Sex"]=0
titanic_test.loc[titanic_test["Sex"]=="female","Sex"]=1
titanic_test["Embarked"] = titanic_test["Embarked"].fillna("S")
titanic_test.loc[titanic_test["Embarked"]=="S","Embarked"] = 0
titanic_test.loc[titanic_test["Embarked"]=="C","Embarked"] = 1
titanic_test.loc[titanic_test["Embarked"]=="Q","Embarked"] = 2
titanic_test["Fare"] = titanic_test["Fare"].fillna(titanic["Fare"].median())

生成我们的提交文件:

# Initialize the algorithm class
alg = LogisticRegression(random_state=1)

# Train the algorithm using all the training data
alg.fit(titanic[predictors], titanic["Survived"])

# Make predictions using the test set.
predictions = alg.predict(titanic_test[predictors])

# Create a new dataframe with only the columns Kaggle wants from the dataset.
submission = pandas.DataFrame({
        "PassengerId": titanic_test["PassengerId"],
        "Survived": predictions
    })



  • 4
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值