KNN, Naive Bayes and Logistic regression
Catalogue
Dataset
https://github.com/zalandoresearch/fashion-mnist
There are 10 classes in total:
0 T-shirt/Top
1 Trouser
2 Pullover
3 Dress
4 Coat
5 Sandal
6 Shirt
7 Sneaker
8 Bag
9 Ankle boot
hint:If you really need the content and code in this blog,please quote it!
1.Import library
code:
import h5py
import numpy as np
import matplotlib.pyplot as plt
import os
from tqdm import tqdm
2.Prepare and load the data
code:
with h5py.File('Input/train/images_training.h5','r') as d:
orig_train_data = np.copy(d['datatrain'])
with h5py.File('Input/train/labels_training.h5','r') as d:
orig_train_label = np.copy(d['labeltrain'])
with h5py.File('Input/test/images_testing.h5','r') as d:
orig_test_data = np.copy(d['datatest'])
with h5py.File('Input/test/labels_testing_2000.h5','r') as d:
orig_test_label = np.copy(d['labeltest'])
(32000, 784)
(32000,)
whole_data = np.concatenate((orig_train_data,orig_test_data[:2000]), axis = 0)
whole_label = np.concatenate((orig_train_label,orig_test_label), axis = 0)
print(whole_data.shape)
print(whole_label.shape)
check the shape of data and the missing value
print("\n","train data shape:",orig_train_data.shape,"\n",
"train label shape:",orig_train_label.shape,"\n",
"test data shape:",orig_test_data.shape,"\n",
"test label shape:",orig_test_label.shape)
print("Is there any missing data?",
np.isnan(orig_train_data).any(),
np.isnan(orig_train_label).any(),
np.isnan(orig_test_data).any(),
np.isnan(orig_test_label).any())
train data shape: (30000, 784)
train label shape: (30000,)
test data shape: (5000, 784)
test label shape: (2000,)
Is there any missing data? False False False False
check the occurance of label
for i in np.unique(orig_train_label):
occur = np.count_nonzero(i == orig_train_label)
print("class:",i,"occurs",occur, "times")
class: 0 occurs 3041 times
class: 1 occurs 2972 times
class: 2 occurs 2936 times
class: 3 occurs 3008 times
class: 4 occurs 2954 times
class: 5 occurs 3029 times
class: 6 occurs 3023 times
class: 7 occurs 3013 times
class: 8 occurs 3040 times
class: 9 occurs 2984 times
3.Preprocessing data
Split the data set into training and validation sets
train_d = []
train_l = []
valid_d = []
valid_l = []
data_index = list(range(whole_data.shape[0]))
np.random.shuffle(data_index) ## Randomize the index of the data set
train_index = data_index[:int(whole_data.shape[0]*0.85)] # 85%data as training data
valid_index = data_index[int(whole_data.shape[0]*0.85):] # 15%data as valid data
for i in train_index:
train_d.append(whole_data[i])
train_l.append(whole_label[i])
for i in valid_index:
valid_d.append(whole_data[i])
valid_l.append(whole_label[i])
train_data = np.array(train_d)
train_label = np.array(train_l)
valid_data = np.array(valid_d)
valid_label = np.array(valid_l)
3.1 preprocessing methods – image standardization
def stand_data(input_data):
for i in range(input_data.shape[0]):
input_data[i] = input_data[i] - np.mean(input_data[i])
input_data[i] = input_data[i] / np.std(input_data[i])
return input_data
3.2 Implementation of PCA
def PCA(input_data, n_component):
data_mean = np.mean(input_data, axis = 0) # Calculate the mean
n_samples,n_features = np.shape(input_data)
averg_matrix = np.tile(data_mean, (n_samples, 1))
data_cent = input_data - averg_matrix #data centralization
cov_matrix = np.dot(np.transpose(data_cent),data_cent) #Get the covariance matrix
eig_values, eig_vectors = np.linalg.eig(cov_matrix)
value_index = np.argsort(eig_values) #Find the first n_component vectors
vect_index = value_index[:-(n_component+1):-1]
feature = eig_vectors[:,vect_index]
low_dim_Data = np.dot(data_cent,feature)
reconData = np.dot(low_dim_Data,np.transpose(feature)) + averg_matrix #Map the reduced data back to the original space
return low_dim_Data , reconData
train_data_pca = PCA(stand_data(train_data),180)[0]
train_data_pca1 = PCA(stand_data(train_data),180)[1]
valid_data_pca = PCA(stand_data(valid_data),180)[0]
valid_data_pca1 = PCA(stand_data(valid_data),180)[1]
train_data_pca2 = PCA(stand_data(orig_train_data),80)[1]
3.3 Implementation of SVD
def SVDdecomposition(input_matrix, n_eig_value):
U, s, Vt = np.linalg.svd(input_matrix, full_matrices = False)
S = np.diag(s)
new_matrix = U[:,0:n_eig_value].dot(S[0:n_eig_value, 0:n_eig_value]).dot(Vt[0:n_eig_value,:])
new_matrix
return new_matrix
def svd_data(input_data):
data_res = input_data.reshape((input_data.shape[0], 28, 28))
svd_data = []
for i in range(data_res.shape[0]):
data_svd = SVDdecomposition(data_res[i],7)
svd_data.append(data_svd)
svd_data = np.array(svd_data)
return svd_data
train_data_res = svd_data(train_data)
valid_data_res = svd_data(valid_data)
sample2 = svd_data(orig_train_data)
train_data_svd = train_data_res.reshape((train_data_res.shape[0], 784))
valid_data_svd = valid_data_res.reshape((valid_data_res.shape[0], 784))
3.4 Define the accuracy function
def accuracy(pred_label, true_label):
a = (np.array(pred_label) == np.array(true_label)).mean()*100
return a
4. Define the KNN algorithm function
#knn classification
def KNN_classifier(train_data, train_label, test_data, k):
result = []
for n in tqdm(range(test_data.shape[0])):
test = test_data[n]
dist_list = ((train_data - test)**2).sum(axis =1) #Calculate Euclidean distance
sort_idx = np.argsort(dist_list)
tar_label = np.zeros(10)
for nearest_num in range(k):
lable = train_label[sort_idx[nearest_num]] #The number of different tags in the nearest k tags
tar_label[lable] += 1
nearest_label = np.argmax(tar_label) #Vote for the nearest label
result.append(nearest_label)
return result
4.1 Fine-tune hyper-parameters for KNN Classifier
## original data training
k_list = []
acc_result = []
for k in range(1,15):
knn_result = KNN_classifier(train_data, train_label, valid_data, k)
k_list.append(k)
acc_result.append((np.array(knn_result) == valid_label).mean())
result = np.array(acc_result)*100
plt.plot( k_list, result, 'go:', linewidth=2)
plt.xlabel('k')
plt.ylabel('accurarcy')
plt.title('KNN select hyperparameter(orginal data)')
plt.axvline(x=8, color='#d46061', linewidth=1)
plt.show
reduce the dimension of training and valid data by using PCA and keep the low dimension result for Fine-tune hyper-parameters for knn
k_list_pca = []
acc_result_pca = []
for k in range(1,10):
knn_result = KNN_classifier(train_data_pca, train_label, valid_data_pca, k)
k_list_pca.append(k)
acc_result_pca.append((np.array(knn_result) == valid_label).mean())
result = np.array(acc_result_pca)*100
plt.plot( k_list_pca, acc_result_pca)
plt.xlabel('k')
plt.ylabel('accurarcy')
plt.title('KNN accurarcy with training set 0.8,test set 0.2')
#plt.axvline(x=6, color='#d46061', linewidth=1)
plt.show
Reduce the dimension of training and valid data by using PCA and keep the reconstructed data for Fine-tune hyper-parameters for PCA Choose the number of component for PCA and plot the classification accuracy.
ncomponent_list_pca = []
acc_pcaN = []
for i in range(80,780,100):
train_data_pca = PCA(stand_data(train_data),i)[1]
valid_data_pca = PCA(stand_data(valid_data),i)[1]
knn_result_pca= KNN_classifier(train_data_pca, train_label, valid_data_pca, 8)
ncomponent_list_pca.append(i)
acc_pcaN.append((np.array(knn_result_pca) == valid_label).mean())
acc_pcaN = np.array(acc_pcaN)*100
plt.plot( ncomponent_list_pca, acc_pcaN )
plt.xlabel('num_component')
plt.ylabel('accurarcy')
plt.title('KNN accuracy (PCA)')
plt.show
ncomponent_list_pca = []
acc_pcaN = []
for i in range(10,80,10):
train_data_pca = PCA(stand_data(train_data),i)[1]
valid_data_pca = PCA(stand_data(valid_data),i)[1]
knn_result_pca= KNN_classifier(train_data_pca, train_label, valid_data_pca, 8)
ncomponent_list_pca.append(i)
acc_pcaN.append((np.array(knn_result_pca) == valid_label).mean())
acc_pcaN = np.array(acc_pcaN)*100
plt.plot( ncomponent_list_pca, acc_pcaN ,'go:', linewidth=2)
plt.xlabel('num_component')
plt.ylabel('accurarcy')
plt.title('KNN accuracy (PCA)')
plt.show
After fine-tune hyper-parameters for PAC in knn algorithm set the number of component = 60 and keep the reconstructed data for training.
Choosing the k for knn algorithm (from 2 to 9)
train_data_pca60 = PCA(stand_data(train_data),60)[1]
valid_data_pca60 = PCA(stand_data(valid_data),60)[1]
valid_data_pca60.shape
k_opt_pca = []
acc_opt_pca = []
for k in range(2,10):
knn_result = KNN_classifier(train_data_pca60, train_label, valid_data_pca60, k)
k_opt_pca.append(k)
acc_opt_pca.append((np.array(knn_result) == valid_label).mean())
acc_opt_pca = np.array(acc_opt_pca)*100
plt.plot( k_opt_pca, acc_opt_pca )
plt.xlabel('k')
plt.ylabel('accurarcy')
plt.title('KNN select k hyperparameter(PCA)')
plt.show
4.2 End of tuning
Dimension reduction PCA set number of component = 60 KNN classifier hyper parameter k = 6
traindata_pca_opt = PCA(stand_data(train_data),60)[1]
testdata_pca_opt = PCA(stand_data(orig_test_data[:2000]),60)[1]
opt_knn = KNN_classifier(traindata_pca_opt, train_label, testdata_pca_opt, 6)
opt_knn_result = accuracy(opt_knn, orig_test_label)
print("KNN classification accurarcy = %0.2f" % opt_knn_result,"%")
KNN classification accurarcy = 87.80 %
5. Define the NaiveBayes classification algorithm function
## NaiveBayes classification
def NB_classifier(train_data, train_label, test_data):
## the number of training data
num = train_data.shape[0]
## Combine the same label of data
sp_data = []
for label in np.unique(train_label):
same_label = []
for x,y in zip(train_data, train_label):
if y == label:
same_label.append(x)
sp_data.append(same_label)
##calculate prior probability
log_prior = []
for i in sp_data:
log_prior.append(np.log((len(i) / num)))
count_data = []
for i in sp_data:
i = np.array(i)
count_data.append(i.sum(axis = 0))
count_data = np.array(count_data)
count_data[count_data < 0] = 0
count_data += 1.0
##calculate posterior probability
feature_post = np.log(count_data / count_data.sum(axis = 1)[np.newaxis].T)
result = np.argmax([(feature_post*s).sum(axis = 1) + log_prior for s in test_data], axis = 1)
return result
5.1 Training data by NaiveBayes classification
N_svd = NB_classifier(train_data_svd, train_label, valid_data_svd)
NaiveBayes_accurancy = (N_svd == valid_label).mean()*100
print("NaiveBayes_accurancy = ", NaiveBayes_accurancy,"%")
NaiveBayes_accurancy = 62.0 %
Preprocesse method | Original data(Sd) | PCA | SVD |
---|---|---|---|
Naïve Bayes Accuracy | 54.15% | 54.1% | 66.91% |
6. Define the Rogistc regression classification algorithm function
## Rogistc regression classification
# Find w by gradient descent
def LR_classifier(train_data, train_label, max_i , learning_rate):
# initialize w size
w = np.random.random((train_data.shape[1],1))/train_data.shape[1]
# gradient descent
e = 1e-10
itres = 0
dif = float("Inf")
while itres < max_i and dif > e:
gradien = np.zeros((train_data.shape[1],1))
residual = np.zeros((train_data.shape[0],1))
for i in range(train_data.shape[0]):
sigma = 1.0 / (1.0 + np.exp(-np.dot(train_data[i], w))) ## sigmoid function
if train_label[i] == 1:
residual[i] = sigma -1
else:
residual[i] = sigma
gradien = np.dot(train_data.T, residual)
w_new = w - learning_rate*gradien
dif = np.linalg.norm(w_new - w)
w = w_new
itres += 1
return w
## Split the test label and obtain the predicted label :prediction = W.T * test_data
def LR_data_preprocess(train_data,train_label,valid_data,valid_label, max_i , learning_rate):
lr_train_label = [] #Split the train label in 10 parts
for i in range(len(np.unique(train_label))):
label_new = []
for j in train_label:
if j == i:
label_new.append(1)
else:
label_new.append(0)
lr_train_label.append(label_new)
LR_result = [] ## obtain the w (0-9 label)
for i in tqdm(range(len(np.unique(train_label)))):
LR_re = LR_classifier(train_data, lr_train_label[i],max_i,learning_rate)
LR_result.append(LR_re)
lr_test_label = [] #Split the test label in 10 parts
for i in range(len(np.unique(train_label))):
label_test_new = []
for j in valid_label:
if j == i:
label_test_new.append(1)
else:
label_test_new.append(0)
lr_test_label.append(label_test_new)
prediction = [] ## obtain the predicted label
for i in range(len(np.unique(train_label))):
pred_label = []
for j in range(valid_data.shape[0]):
c = 1.0 / (1.0 + np.exp(-np.dot(valid_data[j], LR_result[i])))
if c > 0.5:
pred_label.append(1)
else:
pred_label.append(0)
prediction.append(pred_label)
return lr_test_label ,prediction
6.1 Fine-tune hyper-parameters for Rogistc regression Classifier
lr_learnrate_label = []
lr_learnrate_result = []
for i in [0.1,0.01,0.001,0.0001,0.00001]:
lr_result_w = LR_data_preprocess(orig_train_data,orig_train_label,orig_test_data[:2000],orig_test_label, 100, i)
lr_learnrate_label.append(lr_result_w[0])
lr_learnrate_result.append(lr_result_w[1])
learning_rate_n = [0.1,0.01,0.001,0.0001,0.00001]
lr_learning_rate_acc = []
for i in range(len(learning_rate_n)):
lr_label_o = np.array(lr_learnrate_label)[i]
prediction_o = np.array(lr_learnrate_result)[i]
lr_acc_o = []
for j in range(len(np.unique(train_label))):
lr_acc_o.append((np.array(lr_label_o[j]) == np.array(prediction_o[j])).mean()*100)
lr_acc_o = np.array(lr_acc_o)
lr_learning_rate_acc.append(lr_acc_o.mean())
print(lr_learning_rate_acc)
plt.plot( learning_rate_n, lr_learning_rate_acc)
plt.xlabel('learning_rate')
plt.ylabel('accurarcy')
plt.title('Accuracy with different learning_rate')
plt.show
lr_itre_label = []
lr_itre_result = []
for i in [70,100,130,160,200,250,280]:
lr_result_w = LR_data_preprocess(orig_train_data,orig_train_label,orig_test_data[:2000],orig_test_label, i, 0.01)
lr_itre_label.append(lr_result_w[0])
lr_itre_result.append(lr_result_w[1])
max_iteration = [70,100,130,160,200,250,280]
lr_max_ite_acc = []
for i in range(len(max_iteration)):
lr_label_o = np.array(lr_itre_label)[i]
prediction_o = np.array(lr_itre_result)[i]
lr_acc_o = []
for j in range(len(np.unique(train_label))):
lr_acc_o.append((np.array(lr_label_o[j]) == np.array(prediction_o[j])).mean()*100)
lr_acc_o = np.array(lr_acc_o)
lr_max_ite_acc.append(lr_acc_o.mean())
print(lr_max_ite_acc)
plt.plot( max_iteration, lr_max_ite_acc,'go:', linewidth=2)
plt.xlabel('max_iteration')
plt.ylabel('accurarcy')
plt.title('LR Accuracy with different max_iteration')
plt.show
This task is to classify images of 10 different categories. However, the logistic regression classifier is normally a binary classifier. Therefore, for the modeling of this logistic regression classifier, we will classify all 10 types of labels in order. That is, assuming that the category 0 labels are in the same category, and the data of the remaining 9 labels belong to the same category, we filter out the category of category 0 labels, and then we filter the data of category 1 label, and separate it from other categories again, do a two-category, and so on, you can separate all 10 types of labels (for input data). Then compare the predicted tags with the real tags one by one to calculate the accuracy rate, and finally average the prediction accuracy rates of the 10 tags to get the final result. As for this classifier we need to fine-tune two hyperparameters, one is the learning rate and anther is the max iteration times.
Through two experiments, we choose the optimal hyperparameter value. (select learning rate = 0.01and max iteration = 160).After obtaining the optimal parameters, perform logistic regression classification on the data processed by different preprocessing methods, and the results are shown in the following table.
For logistic regression, the classification accuracy of the data processed by PCA and SVD is not as high as that of the original data, but these original data are standardized. The importance of standardized data has been mentioned above. It can be seen that dimensionality reduction processing cannot
improve the accuracy of logistic regression for this data.
Preprocesse method | Original data(Sd) | PCA | SVD |
---|---|---|---|
LR Accuracy | 95.02% | 93.63% | 93.63% |
Conclusion
From the above experimental results, the logistic regression classifier performed the best for the data set of this job.After data standardization, and the processed data is classified by logistic regression with an accuracy of 95%. Logistic regression takes an average of 8 minutes.Compared with the logistic regression classifier, the KNN classifier takes less time to process data classification, with an average of 3 minutes, but its accuracy is not that high as around 86%. The Naive Bayes classifier runs the fastest, and more than 3000 data is classified in a few seconds, but the results are not so satisfactory. For future work, we need more models of different algorithms to experiment and continuously optimize model algorithms and codes, while trying different data preprocessing methods, accumulating experience to deal with different data types. Adjusting the parameters is the most important step, which directly affects the quality of the model. It also requires continuous attempts to optimize the model. The purpose is to obtain a model with less temporary resources, short time-consuming, and high model robustness.