K-Nearest Neighbors
n
特征的数据可以概念化为位于n维空间中的点。- 可以使用距离公式来比较数据点。相似的数据点之间的距离较小。
- 可以通过查找
k
最近邻居来对具有未知类别的点进行分类 - 为了验证分类器的有效性,可以将具有已知类的数据分成训练集和验证集。然后可以计算验证错误。
- 分类器具有可以调整的参数以提高其有效性。在K-Nearest Neighbors的情况下,
k
可以改变。 - 分类器可以被不适当地训练并且遭受过度拟合或欠拟合。在K-Nearest Neighbors的情况下,低
k
通常会导致过度拟合,而大的k
通常会导致不合适。 - Python的sklearn库可用于许多分类和机器学习算法。
Scikit-Learn库
classifier = KNeighborsClassifier(n_neighbors = 3) 创建对象
classifier.fit(training_points, training_labels) 传入参数
参数类型大致是
training_points = [
[0.5, 0.2, 0.1],
[0.9, 0.7, 0.3],
[0.4, 0.5, 0.7]
]
training_labels = [0, 1, 1]
guesses = classifier.predict(unknown_points)预测模型
unknown_points = [
[0.2, 0.1, 0.7],
[0.4, 0.7, 0.6],
[0.5, 0.8, 0.1]
]
guesses = classifier.predict(unknown_points)
基本使用
from movies import movie_dataset, labels
from sklearn.neighbors import KNeighborsClassifier
classifier=KNeighborsClassifier(n_neighbors=5)
classifier.fit(movie_dataset,labels)
unknown_points=[[.45, .2, .5], [.25, .8, .9],[.1, .1, .9]]
print(classifier.predict(unknown_points))
实战
乳腺癌数据的处理
import codecademylib3_seaborn
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
import matplotlib.pyplot as plt
breast_cancer_data=load_breast_cancer()
#print(breast_cancer_data.feature_names)
#print(breast_cancer_data.target)
#print(breast_cancer_data.target_names)
training_data, validation_data, training_labels,validation_labels=train_test_split(breast_cancer_data.data,breast_cancer_data.target,test_size=0.2,random_state=20)
#print(len(training_data))
#print(len(training_labels))
accuracies=[]
for k in range(1, 101):
classifier = KNeighborsClassifier(k)
classifier.fit(training_data,training_labels)
accuracies.append(classifier.score(validation_data,validation_labels))
k_list=range(1,101)
plt.plot(k_list,accuracies)
plt.xlabel("k")
plt.ylabel("Validation Accuracy")
plt.title("Breast Cancer Classifier Accuracy")
plt.show()
原理
数据正常化 So a feature with a vastly different scale does not dominate other features.
一个特别大的数据类不会主导整个算法
def min_max_normalize(lst):
minimum=min(lst)
maximum=max(lst)
normalized=[]
for i in range(len(lst)):
normalized.append((lst[i]-minimum)/(maximum-minimum))
return normalized
一个判断算法
from movies import movie_dataset, movie_labels
def distance(movie1, movie2):
squared_difference = 0
for i in range(len(movie1)):
squared_difference += (movie1[i] - movie2[i]) ** 2
final_distance = squared_difference ** 0.5
return final_distance
def classify(unknown, dataset, labels,k):
distances = []
num_good=0
num_bad=0
#Looping through all points in the dataset
for title in dataset:
movie = dataset[title]
distance_to_point = distance(movie, unknown)
#Adding the distance and point associated with that distance
distances.append([distance_to_point, title])
distances.sort()
#Taking only the k closest points
neighbors = distances[0:k]
title=[]
for movie in neighbors:
title=movie[1]
if(movie_labels[title]==0):
num_bad+=1
else:
num_good+=1
if(num_good>num_bad):
return 1
else:
return 0
print(classify([.4,.2,.9],movie_dataset,movie_labels,5))
当实验成功后测试movie_dataset里没有的数据 视为测试集
首先将测试数据normalized (正常化)再放入算法中检测
from movies import training_set, training_labels, validation_set, validation_labels
def distance(movie1, movie2):
squared_difference = 0
for i in range(len(movie1)):
squared_difference += (movie1[i] - movie2[i]) ** 2
final_distance = squared_difference ** 0.5
return final_distance
def classify(unknown, dataset, labels, k):
distances = []
#Looping through all points in the dataset
for title in dataset:
movie = dataset[title]
distance_to_point = distance(movie, unknown)
#Adding the distance and point associated with that distance
distances.append([distance_to_point, title])
distances.sort()
#Taking only the k closest points
neighbors = distances[0:k]
num_good = 0
num_bad = 0
for neighbor in neighbors:
title = neighbor[1]
if labels[title] == 0:
num_bad += 1
elif labels[title] == 1:
num_good += 1
if num_good > num_bad:
return 1
else:
return 0
print(validation_set["Bee Movie"])
guess=classify(validation_set["Bee Movie"],training_set,training_labels,5)
print(guess)
if(guess == validation_labels["Bee Movie"]):
print("Correct!")
else:
print("Wroing!")
选择k值 判断正确率
当k
小时,发生过度拟合并且精度相对较低。另一方面,当k
变得太大时,发生欠拟合并且精度开始下降。
Overfitting happens when:k
is too small so outliers dominate the result. 异常值主导
Underfitting happens when:k
is too big so larger trends in the dataset aren’t represented. 趋势未被表达
from movies import training_set, training_labels, validation_set, validation_labels
def distance(movie1, movie2):
squared_difference = 0
for i in range(len(movie1)):
squared_difference += (movie1[i] - movie2[i]) ** 2
final_distance = squared_difference ** 0.5
return final_distance
def classify(unknown, dataset, labels, k):
distances = []
#Looping through all points in the dataset
for title in dataset:
movie = dataset[title]
distance_to_point = distance(movie, unknown)
#Adding the distance and point associated with that distance
distances.append([distance_to_point, title])
distances.sort()
#Taking only the k closest points
neighbors = distances[0:k]
num_good = 0
num_bad = 0
for neighbor in neighbors:
title = neighbor[1]
if labels[title] == 0:
num_bad += 1
elif labels[title] == 1:
num_good += 1
if num_good > num_bad:
return 1
else:
return 0
def find_validation_accuracy(training_set,training_labels,validation_set,validation_labels,k):
num_correct = 0.0
guess=0
for title in validation_set:
guess=classify(validation_set[title],training_set,training_labels,k)
if(guess==validation_labels[title]):
num_correct+=1
return num_correct/len(validation_set)
print(find_validation_accuracy(training_set,training_labels,validation_set,validation_labels,3))