一、k次交叉验证
-
原理
将数据集划分为若干等分,然后对每一等分数据当测试集数据进行验证,最后对每一个测试结果进行汇总取平均值 -
代码实现
数据:
User ID Gender Age EstimatedSalary Purchased 15624510 Male 19.0 19000.0 0 15810944 Male 35.0 20000.0 0 15668575 Female 26.0 43000.0 0 15603246 Female 27.0 57000.0 0 ... 此数据为针对不同的用户信息,是否会点击投放的广告
代码:
from sklearn.model_selection import train_test_split from sklearn.model_selection import cross_val_score from sklearn.preprocessing import StandardScaler from sklearn.model_selection import GridSearchCV from sklearn.metrics import confusion_matrix from sklearn.svm import SVC import pandas as pd # Importing the dataset dataset = pd.read_csv('Social_Network_Ads.csv') X = dataset.iloc[:, [2, 3]].values y = dataset.iloc[:, 4].values # Splitting the dataset into the Training set and Test set X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state = 0) # Feature Scaling sc = StandardScaler() X_train = sc.fit_transform(X_train) X_test = sc.transform(X_test) # Fitting Kernel SVM to the Training set classifier = SVC(kernel='rbf', random_state=0) classifier.fit(X_train, y_train) # Predicting the Test set results y_pred = classifier.predict(X_test) # Making the Confusion Matrix cm = confusion_matrix(y_test, y_pred) # 应用K次交叉验证 accuracies = cross_val_score(estimator=classifier, X=X_train, y=y_train, cv=10) # cv: 将X划为几个等分进行测试 accuracies.mean() accuracies.std()
结果:
accuracies.mean(): 0.9005302187615868
accuracies.std():0.06388957356626285
二、网状搜索
-
原理
对于一次数据拟合模型中类的参数如果比较多,那么对于自己手动变更这些参数寻找最佳参数相对比较繁琐,所以利用机器学习类中的网状搜索算法实现最佳参数选择,效率会有很大的提升
-
代码实现
数据:
数据于上面的”k次交叉验证“的数据是一样的
代码:from sklearn.model_selection import train_test_split from sklearn.model_selection import cross_val_score from sklearn.preprocessing import StandardScaler from sklearn.model_selection import GridSearchCV from sklearn.metrics import confusion_matrix from sklearn.svm import SVC import pandas as pd # Importing the dataset dataset = pd.read_csv('Social_Network_Ads.csv') X = dataset.iloc[:, [2, 3]].values y = dataset.iloc[:, 4].values # Splitting the dataset into the Training set and Test set X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state = 0) # Feature Scaling sc = StandardScaler() X_train = sc.fit_transform(X_train) X_test = sc.transform(X_test) # Fitting Kernel SVM to the Training set classifier = SVC(kernel='rbf', random_state=0) classifier.fit(X_train, y_train) # Predicting the Test set results y_pred = classifier.predict(X_test) # Making the Confusion Matrix cm = confusion_matrix(y_test, y_pred) # 应用K次交叉验证 accuracies = cross_val_score(estimator=classifier, X=X_train, y=y_train, cv=10) # cv: 将X划为几个等分进行测试 print(accuracies.mean()) print(accuracies.std()) # 网状搜索 parameters = [ {'C': [1, 10, 100, 1000], 'kernel': ['linear']}, {'C': [1, 2, 3, 4], 'kernel': ['rbf'], 'gamma': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 1]}, ] grid_search = GridSearchCV( estimator=classifier, param_grid=parameters, scoring="accuracy", # 最后将不同参数配置的结果以什么样的方式展示 cv=10 # 将数据划分为几份 # n_jobs=-1 ) grid_search = grid_search.fit(X_train, y_train) best_accuracy = grid_search.best_score_ best_parameter = grid_search.best_params_
输出结果:
best_accuracy:0.9033333333333333
best_parameter: {‘C’: 1, ‘gamma’: 0.7, ‘kernel’: ‘rbf’}