目录
1.1 Linear Regression (Classification to Binomial Categories)
1.2 Evaluate - Confusion Matrices
2. Classification & Regression Trees (CART) [S]
4. Rock & Mine Example- Classification
5. Wine Quality Example - Regression
1. Regression [S]
1.1 Linear Regression (Classification to Binomial Categories)
1. Read data
import pandas as pd
from pandas import DataFrame
url="https://archive.ics.uci.edu/ml/machine-learning-databases/undocumented/connectionist-bench/sonar/sonar.all-data"
df = pd.read_csv(url,header=None)
df.describe()
2. Train & Fit & Predict
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pandas import DataFrame
%matplotlib inline
# 把 R/M 转化为 0/1
df[60]=np.where(df[60]=='R',0,1)
# 设置训练集&测试集
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size = 0.3) #df任意30%为test,70%为train
x_train = train.iloc[0:,0:60]
y_train = train[60] #60列,即R/M列
x_test = test.iloc[0:,0:60]
y_test = test[60]
# 建模&拟合
from sklearn import linear_model
model = linear_model.LinearRegression()
model.fit(x_train,y_train)
# 预测
testing_predictions = model.predict(x_test)
# 设置threshold
def get_classification(predictions,threshold):
classes = np.zeros_like(testing_predictions)
for i in range(len(classes)):
if predictions[i] > threshold:
classes[i] = 1
return classes
get_classification(testing_predictions,0.5)
1.2 Evaluate - Confusion Matrices
from sklearn.metrics import confusion_matrix
#真实test值,预测test值,返回matrix[tn,fp;fn,tp]
confusion_matrix(y_test,get_classification(testing_predictions,0.5))
#返回各值
tn, fp, fn, tp = confusion_matrix(y_test,get_classification(testing_predictions,0.5)).ravel()
- True Positive Rate/ Sensitivity/ Recall
预测positive的表现 (预测对+占 真的+比例);如果为1,找到了所有的+;找到+的表现(find)
tpr = tp/(tp+fn)
- Precision
真的+占被预测成+的比例;如果是1,我们预测是+的,一定是+的;识别+的表现(disciminate)
precision = tp/(tp+fp)
- F-Score
综合precision和recall的表现
f = precision*tpr/(precision+tpr)*2
- True Negative Rate/ Specificity
预测negative的表现
tnr = tn/(tn+fp)
-
False Positive Rate/ Fall out
和True Nagetive Rate相加为1
fpr = fp/(fp+tn)
-
Accuracy
判断+/-的准确率,如果为1,表示所有归类都正确
accuracy = (tp+tn)/(tp+tn+fp+fn)
-
Misclassification Rate
和Accuracy相加为1
misclassification_rate = (fp + fn)/(tp+fp+tn+fn)
- ROC Curve
testing_predictions = model.predict(x_test)
(fpr, tpr, thresholds) = roc_curve(y_test,testing_predictions)
area = auc(fpr,tpr)
plt.clf() #Clear the current figure
plt.plot(fpr,tpr,label="Out-Sample ROC Curve with area = %1.2f"%area)
plt.plot([0, 1], [0, 1], 'k')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Out sample ROC rocks versus mines')
plt.legend(loc="lower right")
plt.show()
- Precision vs. Recall
from sklearn.metrics import precision_recall_curve
import matplotlib.pyplot as plt
from sklearn.metrics import average_precision_score
precision, recall, thresholds = precision_recall_curve(y_test, testing_predictions)
average_precision = average_precision_score(y_test, testing_predictions)
step_kwargs = ({'step' : 'post'})
plt.step(recall, precision, color='b', alpha=0.2,
where='post')
plt.fill_between(recall, precision, alpha=0.2, color='b', **step_kwargs)
plt.xlabel('Recall')
plt.ylabel('