sklearn-基础使用

最新推荐文章于 2024-06-03 11:39:36 发布

ivysister

最新推荐文章于 2024-06-03 11:39:36 发布

阅读量9.5k

点赞数 1

分类专栏：机器学习基础知识文章标签： sklearn kaggle

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/ivysister/article/details/50904028

版权

数据清洗可以用pandas，数据predict的时候就要用到大名鼎鼎的sklearn了，里面包含了很多基础的算法，可以帮助Data Scientist 解决很多问题。

（a）data normalization

from sklearn import preprocessing

# normalize the data attributes

normalized_X = preprocessing.normalize(X) #range from 0 to 1

# standardize the data attributes

standardized_X = preprocessing.scale(X) #均值为0 方差为1

（b）feature extraction

from sklearn import metrics

from sklearn.ensemble import ExtraTreesClassifier

model = ExtraTreesClassifier() #基于决策树接近树根

model.fit(X, y)

# display the relative importance of each attribute

print(model.feature_importances_)

rom sklearn.feature_selection import RFE

from sklearn.linear_model import LogisticRegression

model = LogisticRegression()

# create the RFE model and select 3 attributes

rfe = RFE(model, 3) #暴力破解呀，选出所有大小为3的子集算出误差最小的

rfe = rfe.fit(X, y)

# summarize the selection of the attributes

print(rfe.support_)

print(rfe.ranking_)

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

feature_selection模块

Univariate feature selection：单变量的特征选择

单变量特征选择的原理是分别单独的计算每个变量的某个统计指标，根据该指标来判断哪些指标重要。剔除那些不重要的指标。

sklearn.feature_selection模块中主要有以下几个方法：

SelectKBest和SelectPercentile比较相似，前者选择排名排在前n个的变量，后者选择排名排在前n%的变量。而他们通过什么指标来给变量排名呢？这需要二外的指定。

对于regression问题，可以使用f_regression指标。对于classification问题，可以使用chi2或者f_classif变量。

使用的例子：

from sklearn.feature_selection import SelectPercentile, f_classif

selector = SelectPercentile(f_classif, percentile=10)

还有其他的几个方法，似乎是使用其他的统计指标来选择变量：using common univariate statistical tests for each feature: false positive rate SelectFpr, false discovery rate SelectFdr, or family wise error SelectFwe.

文档中说，如果是使用稀疏矩阵，只有chi2指标可用，其他的都必须转变成dense matrix。但是我实际使用中发现f_classif也是可以使用稀疏矩阵的。

Recursive featu

最低0.47元/天解锁文章

关注

1
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
sklearn-基础使用

数据清洗可以用pandas，数据predict的时候就要用到大名鼎鼎的sklearn了，里面包含了很多基础的算法，可以帮助Data Scientist 解决很多问题。（a）data normalizationfrom sklearn import preprocessing# normalize the data attributesnormalized_X = prepr
复制链接

扫一扫

专栏目录

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。