Machine Learning Project1, KNN

Header

  • Name: Shusen Wu
  • OS: Win10 
  • Cpu: I7-7700
  • Language: Python3.6
  • Environment: Jupyter Notebook
  • library: 
    numpy
    matplotlib.pyplot
    collections 
    time
    operator

Reference

Machine Learning in Action, Peter Harrington, ISBN 9781617290183

Datasets

The project will explore two datasets, the famous MNIST dataset of very small pictures of handwritten numbers, and a dataset that explores the prevelance of diabetes in a native american tribe named the Pima. You can access the datasets here:
1. https://www.kaggle.com/uciml/pima-indians-diabetes-database
2. https://www.kaggle.com/c/digit-recognizer/data

Part 1: Pima

Dataset details

Here, I use 80% data as trainning data, 10% as validation data and 10% as test data.

Besides, since the scales of features are quite different, so I normalize them into 0~1.

Algorithm Description

Learn from the book <<Machine Learning in Action>>,  I get the KNN and normalize functions.

  • KNN 

  • Normalize 

Algorithm Results

  

Predit

  
  

1

0

合计

Reality

1

True Positive(TP)

False Negative(FN)

Actual Positive(TP+FN)

 

0

False Positive(FP)

True Negative(TN)

Actual Negative(FP+TN)

Total

 

Predicted Positive(TP+FP)

Predicted Negative(FN+TN)

TP+FP+FN+TN

 

First, we need to predit the results by using the validation set with different K. Here, I set K from 3 to 9.

Here are the results:

 

 

 

 

 

 

 

Compare the accuracies when k is from 3 to 9, we find that when K is set to 7 it works well.

So we choose k as 7 to run on the test set

Runtime

The cost time on test set is:0.006999969482421875

Other running times are showing on the above pictures.

Part2:Recognise Digits

Dataset details

Here, again, I use 90% of traning data to train, 10% of traning set as validation set. Cause there is already a test set, so I do not need to split it from training set. From the following picture, we can see the shapes (row x col) of these data sets.

 

The distribution of 0~9 numbers: 

Random sample and show:

Besides, we'd like to have a quick look on one image:

Algorithm Description

  • KNN: we know that (A-B)^2 equals to A^2+B^2+2AB. So, when we compute the distance, we can use this way to calculate and the matric computation will save a lot of time.
  • Besides, we also need to computer the accuracy for every K. So, here, we can computer it quite fast by using np.sum(y-y') /len(y)

 

Algorithm Results && Runtime

KNN do cost a lot of time, and it costs huge memory. When putting this algorithm on the Jupyter Notebooks, the memory is not enough without splitting into small batch sizes.

 Thus, I put the code back to pycharm and run.

From this picture, we can find that when K is 5, it does best on the validation set. So we set K=5 to run on the test set. There is the result:

top 100 results:

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值