数据挖掘随记

最新推荐文章于 2024-06-20 01:19:27 发布

he_world

最新推荐文章于 2024-06-20 01:19:27 发布

阅读量329

点赞数

分类专栏：机器学习

本文链接：https://blog.csdn.net/he_world/article/details/54647565

版权

机器学习专栏收录该内容

6 篇文章 0 订阅

订阅专栏

对于缺失数据：
目前有三类处理方法：

用平均值、中值、分位数、众数、随机值等替代。效果一般，因为等于人为增加了噪声。
用其他变量做预测模型来算出缺失变量。效果比方法1略好。有一个根本缺陷，如果其他变量和缺失变量无关，则预测的结果无意义。如果预测结果相当准确，则又说明这个变量是没必要加入建模的。一般情况下，介于两者之间。
最精确的做法，把变量映射到高维空间。比如性别，有男、女、缺失三种情况，则映射成3个变量：是否男、是否女、是否缺失。连续型变量也可以这样处理。比如Google、百度的CTR预估模型，预处理时会把所有变量都这样处理，达到几亿维。这样做的好处是完整保留了原始数据的全部信息、不用考虑缺失值、不用考虑线性不可分之类的问题。缺点是计算量大大提升。
而且只有在样本量非常大的时候效果才好，否则会因为过于稀疏，效果很差。

One step in any data analysis is the data cleaning.Thankfully pandas makes things easier to filter, manipulate, drop out, fill in, transform and replace values inside the dataframe.

A single column is neither an numpy array, nor a pandas dataframe – but rather a pandas-specific object called a data Series.

It appears that I have not improved my score! This seems strange, as your initial thoughts are “This is more complicated, therefore should be better!” This gives us three lessons to bear in mind:

1.A simple model is not always a bad model. Sometimes, concise, simple views of data reveal their true patterns and nature.

2.This is not the final score for my new submission! I have not done as well on the public leaderboard, but who knows what the private score may hold? I made my previous model on the assumptions of the training data: we still don’t know how these will hold up in the private leaderboard.

3.Because the data set is very small, the differences in scores can be just one or two flips in decisions between survived or not survived. This means it will be very hard to determine the quality of the model from this data set. The aim of our Titanic Tutorial was to show you an easy way into more difficult problems, so don’t be too disheartened if your super-complicated random forest doesn’t beat the gender based model!

In terms of improving your model from here, you could consider any of these paths to try on your own:

1.Revisit your assumptions about how you cleaned and filled the data.

2.Be creative with additional feature engineering, so that your chosen model has more columns to train from.

3.Use the sklearn documentation to experiment with different parameters for your random forest.

4.Consider a different model approach. For example, a logistic regression model is often used to predict binary outcomes like 0/1.

he_world

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
数据挖掘随记

对于缺失数据：目前有三类处理方法：用平均值、中值、分位数、众数、随机值等替代。效果一般，因为等于人为增加了噪声。用其他变量做预测模型来算出缺失变量。效果比方法1略好。有一个根本缺陷，如果其他变量和缺失变量无关，则预测的结果无意义。如果预测结果相当准确，则又说明这个变量是没必要加入建模的。一般情况下，介于两者之间。最精确的做法，把变量映射到高维空间。比如性别，有男、女、缺失三种情况，则映射成
复制链接

扫一扫

专栏目录