sklearn的分层抽样
目的
比较sklearn的StratifiedShuffleSplit与train_test_split的区别
StratifiedShuffleSplit
# 分层抽样 训练测试
from sklearn.model_selection import StratifiedShuffleSplit
sss = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=randoms)
X=X_new3.copy()
y=y_df.copy()
for train_index, test_index in sss.split(X, y): # 这里循环的次数由n_splits决定,前面指定的5
#print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X.iloc[train_index,:], X.iloc[test_index,:]
y_train, y_test = y.iloc[train_index,:], y.iloc[test_index,:]
# 分层抽样 训练验证集
sss = StratifiedShuffleSplit(n_splits=1, test_size=0.25, random_state=randoms)
X=X_train.copy()
y=y_train.copy()
for train_index, test_index in sss.split(X, y): # 这里循环的次数由n_splits决定,前面指定的5
#print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_validate = X.iloc[train_index,:], X.iloc[test_index,:]
y_train, y_validate = y.iloc[train_index,:], y.iloc[test_index,:]
train_test_split
X_train1, X_test1, y_train1, y_test1 = train_test_split(X_new3, y_df, test_size=0.2, random_state=8,stratify=y_df)
二者拆分的数据交集
二者拆分而成的样本一模一样