sklearn kfold_Sklearn快速入门

db644e1db6534bc5a74aed21e3387e72

特征工程:

使用sklearn做单机特征工程

  • 标准化(需要使用距离来度量相似性或用PCA降维时):
from sklearn.preprocessing import StandardScalerdata_train = StandardScaler().fit_transform(data_train)data_test = StandardScaler().fit_transform(data_test)
  • 区间缩放:
from sklearn.preprocessing import MinMaxScalerdata = MinMaxScaler().fit_transform(data)
  • 归一化(利于计算梯度下降,消除量纲):
from sklearn.preprocessing import Normalizerdata = Normalizer().fit_transform(data)
  • 定量特征二值化(大于epsilon为1,小于等于epsilon为0):
from sklearn.preprocessing import Binarizerdata = Binarizer(threshold = epsilon).fit_transform(data)
  • 类别型特征转换为数值型特征:

实际上就是保留数值型特征,并将不同的类别转换为哑变量(独热编码),可参考:python中DictVectorizer的使用

from sklearn.feature_extraction import DictVectorizervec = DictVectorizer(sparse = False)X_train = vec.fit_transform(X_train.to_dict(orient = 'recoed'))
  • 卡方检验:
from sklearn.feature_selection import SelectKBestfrom sklearn.feature_selection import chi2#选择K个最好的特征,返回选择特征后的数据skb = SelectKBest(chi2, k = 10).fit(X_train, y_train)X_train = skb.transform(X_train)X_test = skb.transform(X_test)
  • 互信息法:
from sklearn.feature_selection import SelectKBestfrom minepy import MINE#由于MINE的设计不是函数式的,定义mic方法将其为函数式的,返回一个二元组,二元组的第2项设置成固定的P值0.5def mic(x, y): m = MINE() m.compute_score(x, y) return (m.mic(), 0.5)#选择K个最好的特征,返回特征选择后的数据SelectKBest(lambda X, Y: array(map(lambda x:mic(x, Y), X.T)).T, k=2).fit_transform(iris.data, iris.target)
  • 主成分分析(PCA):
from sklearn.decomposition import PCAestimator = PCA(n_components=2)#几个主成分X_pca = estimator.fit_transform(X_data)

学习算法:

划分训练集和测试集:

from sklearn.cross_validation import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 33)

训练:

from sklearn import LearnAlgorithm#导入对应的学习算法包la = LearnAlgorithm()la.fit(X_train, y_train)y_predict = la.predict(x_test)

随机梯度下降法(SGD):

from sklearn.linear_model import SGDClassifiersgd = SGDClassifier()from sklearn.linear_model import SGDRegressorsgd = SGDRegressor(loss='squared_loss', penalty=None, random_state=42)

支持向量机(SVM):

支持向量分类(SVC):

from sklearn.svm import SVCsvc_linear = SVC(kernel='linear')#线性核,可以选用不同的核

支持向量回归(SVR):

from sklearn.svm import SVRsvr_linear = SVR(kernel='linear')#线性核,可以选用不同的核如poly,rbf

朴素贝叶斯(NaiveBayes):

from sklearn.naive_bayes import MultinomialNBmnb = MultinomialNB()

决策树(DecisionTreeClassifier):

from sklearn.tree import DecisionTreeClassifierdtc = DecisionTreeClassifier(criterion='entropy', max_depth=3, min_samples_leaf=5)#最大深度和最小样本数,用于防止过拟合

随机森林(RandomForestClassifier):

from sklearn.ensemble import RandomForestClassifierrfc = RandomForestClassifier(max_depth=3, min_samples_leaf=5)

梯度提升树(GBDT):

from sklearn.ensemble import GradientBoostingClassifiergbc = GradientBoostingClassifier(max_depth=3, min_samples_leaf=5)

极限回归森林(ExtraTreesRegressor):

from sklearn.ensemble import ExtraTreesRegressor()etr = ExtraTreesRegressor()

评估:

from sklearn import metricsaccuracy_rate = metrics.accuracy_score(y_test, y_predict)metrics.classification_report(y_test, y_predict, target_names = data.target_names)#可以获取准确率,召回率等数据

K折交叉检验:

from sklearn.cross_validation import cross_val_score,KFoldcv = KFold(len(y), K, shuffle=True, random_state = 0)scores = cross_val_score(clf, X, y, cv = cv)

from sklearn.cross_validation import cross_val_scorescores = cross_val_score(dt, X_train, y_train, cv = K)

注意这里的X,y需要为ndarray类型,如果是DataFrame则需要用df.values和df.values.flatten()转化

Pipeline机制:

pipeline机制实现了对全部步骤的流式化封装和管理,应用于参数集在数据集上的重复使用.Pipeline对象接受二元tuple构成的list,第一个元素为自定义名称,第二个元素为sklearn中的transformer或estimator,即处理特征和用于学习的方法.以朴素贝叶斯为例,根据处理特征的不同方法有以下代码:

clf_1 = Pipeline([('count_vec', CountVectorizer()), ('mnb', MultinomialNB())])clf_2 = Pipeline([('hash_vec', HashingVectorizer(non_negative=True)), ('mnb', MultinomialNB())])clf_3 = Pipeline([('tfidf_vec', TfidfVectorizer()), ('mnb', MultinomialNB())])

特征选择:

from sklearn import feature_selectionfs = feature_selection.SelectPercentile(feature_selection.chi2, percentile=per)X_train_fs = fs.fit_transform(X_train, y_train)

我们以特征选择和5折交叉检验为例,实现一个完整的参数选择过程:

from sklearn import feature_selectionfrom sklearn.cross_validation import cross_val_scorepercentiles = range(1,100)results= []for i in percentiles: fs = feature_selection.SelectPercentile(feature_selection.chi2, percentile=i) X_train_fs = fs.fit_transform(X_train, y_train) scores = cross_val_score(dt, X_train_fs, y_train, cv = 5) results = np.append(results, scores.mean())opt = np.where(results == results.max())[0]fs = feature_selection.SelectPercentile(feature_selection.chi2, percentile=opt)X_train_fs = fs.fit_transform(X_train, y_train)dt.fit(X_train_fs, y_train)y_predict = dt.predict(x_test)

超参数:

超参数指机器学习模型里的框架参数,在竞赛和工程中都非常重要

集成学习(Ensemble Learning):

通过对多个模型融合以提升整体性能,如随机森林,XGBoost,参考下文:

Ensemble Learning-模型融合-Python实现

多线程网格搜索:

用于寻找最优参数,可参考下文:

Sklearn-GridSearchCV网格搜索

from sklearn.cross_validation import train_test_splitfrom sklearn.grid_search import GridSearchCVX_train, X_test, y_train, y_test = train_test_split(news.data[:3000], news.target[:3000], test_size=0.25, random_state=33)from sklearn.svm import SVCfrom sklearn.feature_extraction.text import TfidfVectorizerfrom sklearn.pipeline import Pipelineclf = Pipeline([('vect', TfidfVectorizer(stop_words='english', analyzer='word')), ('svc', SVC())])parameters = {'svc__gamma': np.logspace(-2, 1, 4), 'svc__C': np.logspace(-1, 1, 3)}gs = GridSearchCV(clf, parameters, verbose=2, refit=True, cv=3, n_jobs=-1)%time _=gs.fit(X_train, y_train)gs.best_params_, gs.best_score_print gs.score(X_test, y_test)
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值