优达机器学习：交叉验证

最新推荐文章于 2023-03-05 17:15:00 发布

灵魂画手-编程如画

最新推荐文章于 2023-03-05 17:15:00 发布

阅读量607

点赞数

分类专栏：机器学习 Udacity（优达学城）题目解析文章标签：机器学习

本文链接：https://blog.csdn.net/grape875499765/article/details/78613394

版权

机器学习同时被 2 个专栏收录

33 篇文章 3 订阅

订阅专栏

Udacity（优达学城）题目解析

20 篇文章 16 订阅

订阅专栏

练习：在 Sklearn 中训练/测试分离

#!/usr/bin/python

""" 
PLEASE NOTE:
The api of train_test_split changed and moved from sklearn.cross_validation to
sklearn.model_selection(version update from 0.17 to 0.18)

The correct documentation for this quiz is here: 
http://scikit-learn.org/0.17/modules/cross_validation.html
"""

from sklearn import datasets
from sklearn.svm import SVC

iris = datasets.load_iris()
features = iris.data
labels = iris.target

###############################################################
### YOUR CODE HERE
###############################################################

### import the relevant code and make your train/test split
### name the output datasets features_train, features_test,
### labels_train, and labels_test
# PLEASE NOTE: The import here changes depending on your version of sklearn
from sklearn import cross_validation # for version 0.17
# For version 0.18
# from sklearn.model_selection import train_test_split


### set the random_state to 0 and the test_size to 0.4 so
### we can exactly check your result
features_train, features_test, labels_train, labels_test = cross_validation.train_test_split(iris.data, iris.target, test_size=0.4, random_state=0)

###############################################################
# DONT CHANGE ANYTHING HERE
clf = SVC(kernel="linear", C=1.)
clf.fit(features_train, labels_train)

print clf.score(features_test, labels_test)
##############################################################
def submitAcc():
    return clf.score(features_test, labels_test)

K折交叉验证

可能会出现分类都一样的问题
GridSearchCV 就是通过交叉验证来确定参数的

注意：优达自己写的函数targetFeatureSplit含义

labels, features = targetFeatureSplit(data)
data是二维数组，例如
[
    [1,12.1],
    [0,14.1],
    [1,13.1],
    [1,15.2]
]
默认函数的第一个返回参数为第一列，也就是作为标签使用，返回值如下
labels = [1,0,1,1]
第二个返回参数为第二列，作为训练特征使用，返回值如下
features=
[
    [12,1],
    [14.1],
    [13.1],
    [15.2]
]

练习：第一个（过拟合）POI 识别符

答案：0.989473684211

validate_poi.py

#!/usr/bin/python


"""
    Starter code for the validation mini-project.
    The first step toward building your POI identifier!

    Start by loading/formatting the data

    After that, it's not our code anymore--it's yours!
"""

import pickle
import sys
sys.path.append("../tools/")
from feature_format import featureFormat, targetFeatureSplit

data_dict = pickle.load(open("../final_project/final_project_dataset.pkl", "r") )

### first element is our labels, any added elements are predictor
### features. Keep this the same for the mini-project, but you'll
### have a different feature list when you do the final project.
features_list = ["poi", "salary"]

data = featureFormat(data_dict, features_list)
labels, features = targetFeatureSplit(data)

### it's all yours from here forward!  
from sklearn import tree
clf = tree.DecisionTreeClassifier()
clf.fit(features,labels)
print clf.score(features,labels)

练习：部署训练/测试机制

答案：0.724137931034

validate_poi.py

#!/usr/bin/python


"""
    Starter code for the validation mini-project.
    The first step toward building your POI identifier!

    Start by loading/formatting the data

    After that, it's not our code anymore--it's yours!
"""

import pickle
import sys
sys.path.append("../tools/")
from feature_format import featureFormat, targetFeatureSplit

data_dict = pickle.load(open("../final_project/final_project_dataset.pkl", "r") )

### first element is our labels, any added elements are predictor
### features. Keep this the same for the mini-project, but you'll
### have a different feature list when you do the final project.
features_list = ["poi", "salary"]

data = featureFormat(data_dict, features_list)
labels, features = targetFeatureSplit(data)

### it's all yours from here forward!  

from sklearn.model_selection import train_test_split

features_train, features_test, labels_train, labels_test = train_test_split(features, labels,test_size=0.3,random_state=42)

from sklearn import tree
clf = tree.DecisionTreeClassifier()
clf.fit(features_train,labels_train)

result = clf.predict(features_test)

from sklearn.metrics import accuracy_score

print accuracy_score(labels_test,result)

#print clf.score(features_test,labels_test)