【机器学习】K-Nearest Neighbors

K-Nearest Neighbors

  • n特征的数据可以概念化为位于n维空间中的点。
  • 可以使用距离公式来比较数据点。相似的数据点之间的距离较小。
  • 可以通过查找k最近邻居来对具有未知类别的点进行分类
  • 为了验证分类器的有效性,可以将具有已知类的数据分成训练集和验证集。然后可以计算验证错误。
  • 分类器具有可以调整的参数以提高其有效性。在K-Nearest Neighbors的情况下,k可以改变。
  • 分类器可以被不适当地训练并且遭受过度拟合或欠拟合。在K-Nearest Neighbors的情况下,低k通常会导致过度拟合,而大的k通常会导致不合适。
  • Python的sklearn库可用于许多分类和机器学习算法。

 

Scikit-Learn库

classifier = KNeighborsClassifier(n_neighbors = 3)  创建对象
classifier.fit(training_points, training_labels)  传入参数

参数类型大致是

training_points = [
  [0.5, 0.2, 0.1],
  [0.9, 0.7, 0.3],
  [0.4, 0.5, 0.7]
]

training_labels = [0, 1, 1]

guesses = classifier.predict(unknown_points)预测模型

unknown_points = [
  [0.2, 0.1, 0.7],
  [0.4, 0.7, 0.6],
  [0.5, 0.8, 0.1]
]

guesses = classifier.predict(unknown_points)

 

基本使用

from movies import movie_dataset, labels
from sklearn.neighbors import KNeighborsClassifier

classifier=KNeighborsClassifier(n_neighbors=5)
classifier.fit(movie_dataset,labels)
unknown_points=[[.45, .2, .5], [.25, .8, .9],[.1, .1, .9]]
print(classifier.predict(unknown_points))

 

实战

 乳腺癌数据的处理

import codecademylib3_seaborn
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
import matplotlib.pyplot as plt

breast_cancer_data=load_breast_cancer()
#print(breast_cancer_data.feature_names)
#print(breast_cancer_data.target)
#print(breast_cancer_data.target_names)

training_data, validation_data, training_labels,validation_labels=train_test_split(breast_cancer_data.data,breast_cancer_data.target,test_size=0.2,random_state=20)

#print(len(training_data))
#print(len(training_labels))

accuracies=[]
for k in range(1, 101):
  classifier = KNeighborsClassifier(k)
  classifier.fit(training_data,training_labels)
  accuracies.append(classifier.score(validation_data,validation_labels))
k_list=range(1,101)
plt.plot(k_list,accuracies)
plt.xlabel("k")
plt.ylabel("Validation Accuracy")
plt.title("Breast Cancer Classifier Accuracy")
plt.show()

 

原理

数据正常化 So a feature with a vastly different scale does not dominate other features.

一个特别大的数据类不会主导整个算法

def min_max_normalize(lst):
  minimum=min(lst)
  maximum=max(lst)
  normalized=[]
  for i in range(len(lst)):
    normalized.append((lst[i]-minimum)/(maximum-minimum))
  return normalized

一个判断算法 

from movies import movie_dataset, movie_labels

def distance(movie1, movie2):
  squared_difference = 0
  for i in range(len(movie1)):
    squared_difference += (movie1[i] - movie2[i]) ** 2
  final_distance = squared_difference ** 0.5
  return final_distance

def classify(unknown, dataset, labels,k):
  distances = []
  num_good=0
  num_bad=0
  #Looping through all points in the dataset
  for title in dataset:
    movie = dataset[title]
    distance_to_point = distance(movie, unknown)
    #Adding the distance and point associated with that distance
    distances.append([distance_to_point, title])
  distances.sort()
  #Taking only the k closest points
  neighbors = distances[0:k]
  title=[]
  for movie in neighbors:
    title=movie[1]
    if(movie_labels[title]==0):
      num_bad+=1
    else:
      num_good+=1
  if(num_good>num_bad):
    return 1
  else:
    return 0
  
print(classify([.4,.2,.9],movie_dataset,movie_labels,5))

当实验成功后测试movie_dataset里没有的数据 视为测试集

首先将测试数据normalized (正常化)再放入算法中检测

from movies import training_set, training_labels, validation_set, validation_labels

def distance(movie1, movie2):
  squared_difference = 0
  for i in range(len(movie1)):
    squared_difference += (movie1[i] - movie2[i]) ** 2
  final_distance = squared_difference ** 0.5
  return final_distance

def classify(unknown, dataset, labels, k):
  distances = []
  #Looping through all points in the dataset
  for title in dataset:
    movie = dataset[title]
    distance_to_point = distance(movie, unknown)
    #Adding the distance and point associated with that distance
    distances.append([distance_to_point, title])
  distances.sort()
  #Taking only the k closest points
  neighbors = distances[0:k]
  num_good = 0
  num_bad = 0
  for neighbor in neighbors:
    title = neighbor[1]
    if labels[title] == 0:
      num_bad += 1
    elif labels[title] == 1:
      num_good += 1
  if num_good > num_bad:
    return 1
  else:
    return 0
  
  
print(validation_set["Bee Movie"])
guess=classify(validation_set["Bee Movie"],training_set,training_labels,5)
print(guess)
if(guess == validation_labels["Bee Movie"]):
  print("Correct!")
else:
  print("Wroing!")

选择k值 判断正确率

k小时,发生过度拟合并且精度相对较低。另一方面,当k变得太大时,发生欠拟合并且精度开始下降。

Overfitting happens when:k is too small so outliers dominate the result. 异常值主导

Underfitting happens when:k is too big so larger trends in the dataset aren’t represented. 趋势未被表达

from movies import training_set, training_labels, validation_set, validation_labels

def distance(movie1, movie2):
  squared_difference = 0
  for i in range(len(movie1)):
    squared_difference += (movie1[i] - movie2[i]) ** 2
  final_distance = squared_difference ** 0.5
  return final_distance

def classify(unknown, dataset, labels, k):
  distances = []
  #Looping through all points in the dataset
  for title in dataset:
    movie = dataset[title]
    distance_to_point = distance(movie, unknown)
    #Adding the distance and point associated with that distance
    distances.append([distance_to_point, title])
  distances.sort()
  #Taking only the k closest points
  neighbors = distances[0:k]
  num_good = 0
  num_bad = 0
  for neighbor in neighbors:
    title = neighbor[1]
    if labels[title] == 0:
      num_bad += 1
    elif labels[title] == 1:
      num_good += 1
  if num_good > num_bad:
    return 1
  else:
    return 0
  
  
def find_validation_accuracy(training_set,training_labels,validation_set,validation_labels,k):
  num_correct = 0.0
  guess=0
  for title in validation_set:
    guess=classify(validation_set[title],training_set,training_labels,k)
    if(guess==validation_labels[title]):
      num_correct+=1
  return num_correct/len(validation_set)

print(find_validation_accuracy(training_set,training_labels,validation_set,validation_labels,3))

 

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值