机器学习入门笔记 一
简单分类
- Supervised: Both input and output data are used in prediction.
- Unsupervised: Only input data is used in prediction.
算法例子
- 回归(Regression) ,通过寻找每一对input,output 数据之间的数学联系,模拟联系通过input预测output. 属于supervised
- Classifiction and regression trees 通过进行input的不同分类来观测得到的不同output,从而探索其之间的关系;属于supervised
- Clustering 对input整体进行分类,找出相似的input组成cluster;属于unsupervised
回归
- 线性 linear 𝑦=𝛼+𝛽1𝑥1+𝛽2𝑥2+…+𝛽𝑛𝑥𝑛+𝜖
我们需要找到最能代表x们和y关系的一条直线,所以需要找到最小化error(预测的偏差)的系数们。常见检验偏差大小的方式有:MSE = sigma(yi-y)2 越小越好, R方 = 1-MSE - 逻辑 logistic 作用于分类的应变量(0/1):
if 𝛼+𝛽1𝑥1+𝜖>0, y=1.
else, y=0.
示例
使用线性回归分辨 Rocks vs. Mines
仅做例子不推荐,classification类型更适合使用logistic regression
step1: 导入数据
前60列为自变量,第61列为应变量,体现最后结果是Rock还是Mine.
import pandas as pd
from pandas import DataFrame
url="https://archive.ics.uci.edu/ml/machine-learning-databases/undocumented/connectionist-bench/sonar/sonar.all-data"
df = pd.read_csv(url,header=None)
df
我们可以首先观察变量之间的correlation
import matplotlib.pyplot as plot
plot.pcolor(df.corr(),cmap='coolwarm') #https://matplotlib.org/examples/color/colormaps_reference.html
plot.show()
此处corr分布较为平均,中间斜率为1的红色直线显示的是变量与自身corr为最高“1”,是正常现象,通过图片可以看出,前五列,后十列自变量,相对与第60列应变量corr略高。
为了便于回归,需要将R,M这些字母变成数字:
df[60]=np.where(df[60]=='R',0,1) #R变成0,Mine变成1
step2:建立train,test 数据集
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size = 0.3)
#30%的数据用作检验,将数据首先分成x,y,再各自按照7:3分成train,test
x_train = train.iloc[0:,0:60]
y_train = train[60]
x_test = test.iloc[0:,0:60]
y_test = test[60]
step3: 建立模型
from sklearn import linear_model
model = linear_model.LinearRegression()
model.fit(x_train,y_train)
step4: 预测结果
testing_predictions = model.predict(x_test)
#获得一个预测结果列表
def get_classification(predictions,threshold):
classes = np.zeros_like(predictions)
for i in range(len(classes)):
if predictions[i] > threshold:
classes[i] = 1
return classes
get_classification(testing_predictions,0.5)
#超过0.5则视作1,反之视作0
step5: 评估模型
confusion matrix
预测- | 预测+ | |
---|---|---|
实际- | tn: true negative | fn: false positive |
实际+ | fp: false positive | tp: true positive |
指标:
- true positive rate (sensitivity/recall) = tpr = tp / ( tp + fn )
- true negative rate (specificy) = tnr = tn / ( tn + fn )
- precision = tp / ( tp + fp ) 在所有实际为positive的情况里,预测也为positive,即预测正确的比例
Q: 如何衡量不同指标间的权重关系? - F-score 𝐹=2( 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗ 𝑟𝑒𝑐𝑎𝑙𝑙 )/( 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙 )
- Accuracy = ( tp + tn ) / ( tp + tn + fp + fn ) 预测正确的数量占整体预测比例
- Misclassification rate = (fp + fn ) / ( tp + tn + fp + fn )
from sklearn.metrics import confusion_matrix
def c_m_analysis(true,pred,threshold):
tn, fp, fn, tp = confusion_matrix(true,get_classification(pred,threshold)).ravel()
precision = tp/(tp+fp)
recall = tp/(tp+fn)
tpr = tp/(tp+fn)
fpr = fp/(fp+tn)
f_score = 2*precision*tpr/(precision+tpr)
accuracy = (tp+tn)/(tp+tn+fp+fn)
print("Precision:\t\t\t%1.2f identified as mines are mines"%(precision))
print("Recall/TPR:\t\t\t%1.2f proportion of actual mines identified"%(recall))
print("False Positive Rate:\t\t%1.2f proportion of rocks identified as mines"%fpr)
print("f-score:\t\t\t%1.2f tradeoff between precision and recall"%(f_score))
print("Accuracy:\t\t\t%1.2f how well the model has classified"%(accuracy))
c_m_analysis(y_test,testing_predictions,0.5)
ROC(Receiver Order Characteristic)
tpr (为纵坐标) 与 fpr (为横坐标)的关系。
ROC曲线通常分布与左上角,与y=x直线之间形成的面积越大,indicate 模型模拟的越好。
from sklearn.metrics import roc_curve, auc
testing_predictions = model.predict(x_test)
(fpr, tpr, thresholds) = roc_curve(y_test,testing_predictions)
area = auc(fpr,tpr)
plt.clf() #Clear the current figure
plt.plot(fpr,tpr,label="Out-Sample ROC Curve with area = %1.2f"%area)
plt.plot([0, 1], [0, 1], 'k')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Out sample ROC rocks versus mines')
plt.legend(loc="lower right")
plt.show()
Precision & Recall
我们希望两个指标都能够尽量接近 1
所以曲线与x轴之间形成的面积越大越好
以上两条曲线同时反映这些指标对于theshold的敏感程度
如何决定theshold的值?
根据实际情况的成本决定,选择minimize成本的theshold。
如:
- Everything classified as a rock needs to be checked with a hand scanner at $200/scan.
- Everything classified as a mine needs to be defused at $1000 if it is a real mine or $300 if it turns out to be a rock.
#比较三个cost,哪个threshold的值能够minimizecost?
tn, fp, fn, tp = confusion_matrix(y_test,get_classification(testing_predictions,.1)).ravel()
cost1 = (tn+fn) * 200 + 1000 * tp + 300 * fp
tn, fp, fn, tp = confusion_matrix(y_test,get_classification(testing_predictions,.5)).ravel()
cost2 = (tn+fn) * 200 + 1000 * tp + 300 * fp
tn, fp, fn, tp = confusion_matrix(y_test,get_classification(testing_predictions,.9)).ravel()
cost3 = (tn+fn) * 200 + 1000 * tp + 300 * fp