强大的sklearn库可以解决的问题:
train_test_split返回切分的数据集train/test:
train_test_split(*array,test_size=0.25,train_size=None,random_state=None,shuffle=True,stratify=None)
*array:切分数据源(list/np.array/pd.DataFrame/scipy_sparse matrices)
test_size和train_size是互补和为1的一对值
shuffle:对数据切分前是否洗牌 stratify:是否分层抽样切分数据(If shuffle=False then stratify must be None.)
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X, y, test_size=0.2, random_state=666,shuffle=True)
# Parameters:
# *arrays :需要进行划分的X ;
# target :数据集的结果
# test_size :测试集占整个数据集的多少比例
# train_size :test_size +train_size = 1
# random_state : 随机种子
# shuffle : 是否洗牌 在进行划分前
# 返回 X_train,X_test,y_train,y_test
x = np.arange(10).reshape([5, 2])
y = np.arange(5)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3)
print(x_train)
print(y_train)
交叉验证
cross_val_score
对数据集进行指定次数的交叉验证并为每次验证效果评测
其中,score 默认是以 scoring='f1_macro’进行评测的,余外针对分类或回归还有:
分类、聚类、回归
这需要from sklearn import metrics ,通过在cross_val_score 指定参数来设定评测标准;
当cv 指定为int 类型时,默认使用KFold 或StratifiedKFold 进行数据集打乱,
from sklearn import svm
import math
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn.model_selection import cross_val_score
datas = datasets.load_iris()
print(datas.keys())
x_train, x_test, y_train, y_test = train_test_split(
datas['data'], datas['target'], test_size=0.4, random_state=0)
clf = svm.SVC(kernel='linear', C=1).fit(x_train, y_train)
print(clf.score(x_test, y_test))
# 5折调查验证
scores = cross_val_score(clf, datas['data'], datas['target'], cv=5)
print(scores.mean())
3.cross_val_predict
cross_val_predict 与cross_val_score 很相像,不过不同于返回的是评测效果,cross_val_predict 返回的是estimator 的分类结果(或回归值),这个对于后期模型的改善很重要,可以通过该预测输出对比实际目标值,准确定位到预测出错的地方,为我们参数优化及问题排查十分的重要。
返回的是预测的结果:
from sklearn import metrics
datas = datasets.load_iris()
x_train, x_test, y_train, y_test = train_test_split(datas["data"], datas['target'], test_size=0.3)
clf = svm.SVC(kernel='linear', C=2).fit(x_train, y_train)
print(clf.score(x_test, y_test))
predicteds = cross_val_predict(clf, datas["data"], datas["target"], cv=10)
print(predicteds)
print(metrics.accuracy_score(datas['target'], predicteds