DATACAMP–machine learning with R
Not including the mechanism of knn, this passage focus on how to apply the package “KNN” in R to conduct the classification.
It’s my study notes from DATACAMP. link: https://campus.datacamp.com
Chapter 1: k-Nearest Neighbors (kNN)
Recognizing a road sign with kNN
After several trips with a human behind the wheel, it is time for the self-driving car to attempt the test course alone.
As it begins to drive away, its camera captures the following image:
Apply a kNN classifier to help the car recognize this sign.
KNN聚类算法:可以用class包中的knn函数进行处理。
Codes:
# Load the 'class' package
library(class)
# Create a vector of labels
sign_types <- signs$sign_type
# Classify the next sign observed
knn(train = signs[-1], test = next_sign, cl = sign_types)
How did the knn() function correctly classify the stop sign?
----The sign was in some way similar to another stop sign
Exploring the traffic sign dataset
To better understand how the knn() function was able to classify the stop sign, it may help to examine the training dataset it used.
Each previously observed street sign was divided into a 4x4 grid, and the red, green, and blue level for each of the 16 center pixels is recorded as illustrated here.
The result is a dataset that records the sign_type as well as 16 x 3 = 48 color properties of each sign.
codes:
# Examine the structure of the signs dataset
str(signs)
##'data.frame': 146 obs. of 49 variables:
$ sign_type: chr "pedestrian" "pedestrian" "pedestrian" "pedestrian" ...
$ r1 : int 155 142 57 22 169 75 136 149 13 123 ...
$ g1 : int 228 217 54 35 179 67 149 225 34 124 ...
......
$ r16 : int 22 164 58 19 160 180 188 237 83 43 ...
$ g16 : int 52 227 60 27 183 107 211 254 125 29 ...
$ b16 : int 53 237 60 29 187 26 227 53 19 11 ...
# Count the number of signs of each type
table(signs$sign_type)
## pedestrian speed stop
46 49 51
# Check r10's average red level by sign type
aggregate(r10 ~ sign_type, data = signs, mean)
## sign_type r10
1 pedestrian 113.71739
2 speed 80.63265
3 stop 132.39216
Classifying a collection of road signs
Now that the autonomous vehicle has successfully stopped on its own, your team feels confident allowing the car to continue the test course.
The test course includes 59 additional road signs divided into three types:
At the conclusion of the trial, you are asked to measure the car’s overall performance at recognizing these signs.
# Use kNN to identify the test road signs
sign_types <- signs$sign_type
signs_pred <- knn(train = signs[-1], test = test_signs[-1], cl = sign_types)
# Create a confusion matrix of the predicted versus actual values
signs_actual <- test_signs$sign_type
table(signs_pred, signs_actual)
## signs_actual
signs_pred pedestrian speed stop
pedestrian 19 2 0
speed 0 17 0
stop 0 2 19
# Compute the accuracy
mean(signs_pred == signs_actual)
#[1] 0.9322034--accurary rate
How to choose ‘k’? — Try it
Testing other ‘k’ values
By default, the knn() function in the class package uses only the single nearest neighbor.
Setting a k parameter allows the algorithm to consider additional nearby neighbors. This enlarges the collection of neighbors which will vote on the predicted class.
Compare k values of 1, 7, and 15 to examine the impact on traffic sign classification accuracy.
# Compute the accuracy of the baseline model (default k = 1)
k_1 <- knn(train = signs[-1], test = signs_test[-1], cl = sign_types)
mean(k_1==signs_actual)
##[1] 0.9322034
# Modify the above to set k = 7
k_7 <- knn(train = signs[-1], test = signs_test[-1], cl = sign_types,k=7)
mean(k_7==signs_actual)
##[1] 0.9491525
# Set k = 15 and compare to the above
k_15 <- knn(train = signs[-1], test = signs_test[-1], cl = sign_types,k=15)
mean(k_15==signs_actual)
##[1] 0.8813559
Thus, k=7 is the best among 1,7,15.
Seeing how the neighbors voted
When multiple nearest neighbors hold a vote, it can sometimes be useful to examine whether the voters were unanimous or widely separated.
For example, knowing more about the voters’ confidence in the classification could allow an autonomous vehicle to use caution in the case there is any chance at all that a stop sign is ahead.
In this exercise, you will learn how to obtain the voting results from the knn() function.
# Use the prob parameter to get the proportion of votes for the winning class
sign_pred <- knn(train=signs[-1],test=signs_test[-1],cl=sign_types,prob=TRUE,k=7)
sign_pred
# Get the "prob" attribute from the predicted classes
sign_prob <- attr(sign_pred, "prob")
sign_prob
# Examine the first several predictions
head(sign_pred)
Why normalize data?
Before applying kNN to a classification task, it is common practice to rescale the data using a technique like min-max normalization.
What is the purpose of this step?
To ensure all data elements may contribute equal shares to distance.
KNN benefits from the normalized data.
Codes:
normalization <- function(x){
return((x-min(x))/(max(x)-min(x)))
}
C’est tout, merci beaucoup.