这是挑战100天搞定机器学习的第7天到第11天的翻译。这篇文章后可能会停更几天,因为有几天原作者给出的只是简单的一句话,我也想深入的了解理论知识。李宏毅教授的课程我也没有学到相应位置,泡沫积累太多,所以要沉淀一下了。我会很快回来的,2.0倍播放速度看视频不会错过什么,只能让你更专心。
原项目地址在这里,作者写的相当不错。
转载请注明出处。
第7天 K最临近算法
what is k-NN
什么是KNN
K-Nearest Neighbro algorithm is a simple yet most used classification algorithm.
KNN 算法是一种简单但最常用的分类算法
it can also be used for regression.
它也可以用作回归问题
KNN is non-parametric (means that it does not make any assumptions on the underlying data distribution),instance-based(means that our algorithm doesn’t explicitly learn a model. Instead, it chooses to memorize the training instances.)
and used in a supervised learning setting
KNN 没有参数(即它不对基础数据分布做出任何假设)、基于实例(即算法没有明确的学习模型,相反,它选择记忆训练实例)、用于监督学习环境。
we want to classify the grey point into one of the three classes light green, green and red.
我们想把灰色点分为浅绿色、绿色和红色三个类别中的一个。
start by calculating the distance between the grey point and K-nearest points
通过计算灰点与k最近点之间的距离开始
making predections
预测
to classify an unlabeled object, the distance of this object to the labeled objects is computed, its K-nearest neighors are identified, and the class label of the majority of nearset meighbors is then used to determine the class label of the object
为了对未标记对象进行分类,计算该对象与标记对象的距离,确定其k最近邻,然后使用大多数近邻邻居的类标签来确定对象的类标签。
for real-valued input variables, the most popular distance measure is Euclidean distance
对于实值输入变量,最常用的距离度量是Euclidean距离。
Value of k
k 的值
finding the value of not easy.
找到这个值不容易
a small value of k means than noise will have a higher influence on the result and a large value make it computationally expensive.
k值较小,噪声对结果的影响较大,而较大的k值使其计算成本较高。
It depend a lot on your individual cases, sometimes it is best to run through each possible value for k and decide for yourself.
它在很大程度上取决于你的,有时最好贯穿k的每个可能值并你自行决定。
K-NN is also called a lazy algorithm because it is instance based
K-NN也被称为懒人算法,因为它是基于实例的。
how does K-NN algorithm work
K-NN 的工作原理
k-NN when used used for classification—the output is a class membership (predicts a class –-- discrete value )
k-NN用于分类时 – 输出是类成员(预测一个类---离散值)
there are three key elements of this approach: a set of labeled objects, e.g, a set of stored records, and the value of k, the number of nearest neighbors
该方法有三个关键要素:一组标记对象,例如,一组存储的记录,
以及k的值,最近邻居的数量。
the distance
距离
Euclidean distance is calculated as the squared differences between a new
point and an existing point across all input attributes.
欧几里德距离计算为所有输入属性中新点与现有点之间的平方差。
Other popular distance measures include:
hamming distance
manhattan distance
munkowski distance
其他主流的距离计算包括:
汉明距离
曼哈顿距离
闵可夫斯基距离
第8 天 逻辑回归的数学原理
/*译者注:
今天作者只给出了一个网址,说是深入了解了逻辑回归的数学原理,但是,这个网站打不开,这归功于中国伟大工程---金盾工程。这就是我要停更的原因。我也要学习原理知识,但没有看到,所以就只能看视频了。*/
#100DaysOfMLCode To clear my insights on logistic regression I was searching on the internet for some resource or article and I came across this article (https://towardsdatascience.com/logistic-regression-detailed-overview-46c4da4303bc) by Saishruthi Swaminathan.
为了清楚我对逻辑回归的见解,我在互联网上搜索了一些资源或文章,我在Saishruthi Swaminathan看到了这篇文章(https://towardsdatascience.com/logistic-regression-detailed-overview-46c4da4303bc)
It gives a detailed description of Logistic Regression. Do check it out.
它给出了逻辑回归的详细描述。 一定要看一下。
第9天 支持向量机
Got an intuition on what SVM is and how it is used to solve Classification problem.
直观了解SUM是什么以及如何使用它来解决分类问题。
/*还是一句话,所以自己学很重要。*/
第十天 SVM 和 KNN
Learned more about how SVM works and implementing the knn algorithm.
了解更多关于SVM如何工作和实现KNN算法。
/*你就说一句,我看多少合适呢? */
第11 天 实现KNN
/*用到的数据集还是昨天用的,这里就不在赘述了,想了解看前一篇,记住,一定要把数据集放到KNN实现代码的同级目录*/
代码
Importing the libraries 导入库
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
Importing the dataset 导入数据集
dataset = pd.read_csv('Social_Network_Ads.csv')
X = dataset.iloc[:, [2, 3]].values
y = dataset.iloc[:, 4].values
Splitting the dataset into the Training set and Test set
数据集分割为训练集和测试集
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)
Feature Scaling 特征提取
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
Fitting K-NN to the Training set 使KNN拟合训练集
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
classifier.fit(X_train, y_train)
Predicting the Test set results 预测测试集的结果
y_pred = classifier.predict(X_test)
Making the Confusion Matrix 构造混合矩阵
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
杂谈
还是老样子,按上面的代码没有显示结果,加一行
print(cm)
对输出结果有什么疑问可以参看上一篇,写了很久,到饭点了,饿的不要不要的,就不再罗嗦了,吃饭。