Machine Learning Project1, KNN

最新推荐文章于 2024-06-18 09:46:15 发布

Sengo_1993

最新推荐文章于 2024-06-18 09:46:15 发布

阅读量685

点赞数

分类专栏： Machine Learning 文章标签： KNN

本文链接：https://blog.csdn.net/sengo_gwu/article/details/82710651

版权

Machine Learning 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

Header

Name: Shusen Wu
OS: Win10
Cpu: I7-7700
Language: Python3.6
Environment: Jupyter Notebook

library:

numpy
matplotlib.pyplot
collections 
time
operator

Reference

Machine Learning in Action, Peter Harrington, ISBN 9781617290183

Datasets

The project will explore two datasets, the famous MNIST dataset of very small pictures of handwritten numbers, and a dataset that explores the prevelance of diabetes in a native american tribe named the Pima. You can access the datasets here:
1. https://www.kaggle.com/uciml/pima-indians-diabetes-database
2. https://www.kaggle.com/c/digit-recognizer/data

Part 1: Pima

Dataset details

Here, I use 80% data as trainning data, 10% as validation data and 10% as test data.

Besides, since the scales of features are quite different, so I normalize them into 0~1.

Algorithm Description

Learn from the book <<Machine Learning in Action>>, I get the KNN and normalize functions.

Normalize

Algorithm Results

		Predit
		1	0	合计
Reality	1	True Positive（TP）	False Negative（FN）	Actual Positive(TP+FN)
	0	False Positive（FP)	True Negative(TN)	Actual Negative(FP+TN)
Total		Predicted Positive(TP+FP)	Predicted Negative(FN+TN)	TP+FP+FN+TN

First, we need to predit the results by using the validation set with different K. Here, I set K from 3 to 9.

Here are the results:

Compare the accuracies when k is from 3 to 9, we find that when K is set to 7 it works well.

So we choose k as 7 to run on the test set

Runtime

The cost time on test set is:0.006999969482421875

Other running times are showing on the above pictures.

Part2:Recognise Digits

Dataset details

Here, again, I use 90% of traning data to train, 10% of traning set as validation set. Cause there is already a test set, so I do not need to split it from training set. From the following picture, we can see the shapes (row x col) of these data sets.

The distribution of 0~9 numbers:

Random sample and show:

Besides, we'd like to have a quick look on one image:

Algorithm Description

KNN: we know that (A-B)^2 equals to A^2+B^2+2AB. So, when we compute the distance, we can use this way to calculate and the matric computation will save a lot of time.
Besides, we also need to computer the accuracy for every K. So, here, we can computer it quite fast by using np.sum(y-y') /len(y)

Algorithm Results && Runtime

KNN do cost a lot of time, and it costs huge memory. When putting this algorithm on the Jupyter Notebooks, the memory is not enough without splitting into small batch sizes.

Thus, I put the code back to pycharm and run.