数据挖掘随记

对于缺失数据:
目前有三类处理方法:

  1. 用平均值、中值、分位数、众数、随机值等替代。效果一般,因为等于人为增加了噪声。

  2. 用其他变量做预测模型来算出缺失变量。效果比方法1略好。有一个根本缺陷,如果其他变量和缺失变量无关,则预测的结果无意义。如果预测结果相当准确,则又说明这个变量是没必要加入建模的。一般情况下,介于两者之间。

  3. 最精确的做法,把变量映射到高维空间。比如性别,有男、女、缺失三种情况,则映射成3个变量:是否男、是否女、是否缺失。连续型变量也可以这样处理。比如Google、百度的CTR预估模型,预处理时会把所有变量都这样处理,达到几亿维。这样做的好处是完整保留了原始数据的全部信息、不用考虑缺失值、不用考虑线性不可分之类的问题。缺点是计算量大大提升。
    而且只有在样本量非常大的时候效果才好,否则会因为过于稀疏,效果很差。

One step in any data analysis is the data cleaning.Thankfully pandas makes things easier to filter, manipulate, drop out, fill in, transform and replace values inside the dataframe.

A single column is neither an numpy array, nor a pandas dataframe – but rather a pandas-specific object called a data Series.

It appears that I have not improved my score! This seems strange, as your initial thoughts are “This is more complicated, therefore should be better!” This gives us three lessons to bear in mind:

1.A simple model is not always a bad model. Sometimes, concise, simple views of data reveal their true patterns and nature.

2.This is not the final score for my new submission! I have not done as well on the public leaderboard, but who knows what the private score may hold? I made my previous model on the assumptions of the training data: we still don’t know how these will hold up in the private leaderboard.

3.Because the data set is very small, the differences in scores can be just one or two flips in decisions between survived or not survived. This means it will be very hard to determine the quality of the model from this data set. The aim of our Titanic Tutorial was to show you an easy way into more difficult problems, so don’t be too disheartened if your super-complicated random forest doesn’t beat the gender based model!

In terms of improving your model from here, you could consider any of these paths to try on your own:

1.Revisit your assumptions about how you cleaned and filled the data.

2.Be creative with additional feature engineering, so that your chosen model has more columns to train from.

3.Use the sklearn documentation to experiment with different parameters for your random forest.

4.Consider a different model approach. For example, a logistic regression model is often used to predict binary outcomes like 0/1.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值