使用手写KNN, Naive Bayes and Logistic regression 分类器对Fashion-MNIST 数据进行分类训练

最新推荐文章于 2022-01-29 23:56:03 发布

哈哈小火锅

最新推荐文章于 2022-01-29 23:56:03 发布

阅读量387

点赞数 1

分类专栏： Python 文章标签：机器学习朴素贝叶斯算法逻辑回归

本文链接：https://blog.csdn.net/weixin_47018261/article/details/112754005

版权

Python 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

KNN, Naive Bayes and Logistic regression

Catalogue

Conclusion

Dataset

https://github.com/zalandoresearch/fashion-mnist
There are 10 classes in total:
0 T-shirt/Top
1 Trouser
2 Pullover
3 Dress
4 Coat
5 Sandal
6 Shirt
7 Sneaker
8 Bag
9 Ankle boot
在这里插入图片描述

hint：If you really need the content and code in this blog,please quote it!

1.Import library

code：

import h5py
import numpy as np
import  matplotlib.pyplot as plt
import os
from tqdm import tqdm

2.Prepare and load the data

code：

with h5py.File('Input/train/images_training.h5','r') as d:
    orig_train_data = np.copy(d['datatrain'])
    
with h5py.File('Input/train/labels_training.h5','r') as d:
    orig_train_label = np.copy(d['labeltrain'])
    
with h5py.File('Input/test/images_testing.h5','r') as d:
    orig_test_data = np.copy(d['datatest'])
    
with h5py.File('Input/test/labels_testing_2000.h5','r') as d:
    orig_test_label = np.copy(d['labeltest'])

(32000, 784)
(32000,)

whole_data = np.concatenate((orig_train_data,orig_test_data[:2000]), axis = 0)
whole_label = np.concatenate((orig_train_label,orig_test_label), axis = 0)
print(whole_data.shape)
print(whole_label.shape)

check the shape of data and the missing value

print("\n","train data shape:",orig_train_data.shape,"\n",
      "train label shape:",orig_train_label.shape,"\n",
      "test data shape:",orig_test_data.shape,"\n",
      "test label shape:",orig_test_label.shape)
print("Is there any missing data？",
     np.isnan(orig_train_data).any(),
     np.isnan(orig_train_label).any(),
     np.isnan(orig_test_data).any(),
     np.isnan(orig_test_label).any())

train data shape: (30000, 784) 
train label shape: (30000,) 
test data shape: (5000, 784)
test label shape: (2000,)
Is there any missing data？ False False False False

check the occurance of label

for i in np.unique(orig_train_label):
    occur = np.count_nonzero(i == orig_train_label)
    print("class:",i,"occurs",occur, "times")

	class: 0 occurs 3041 times
	class: 1 occurs 2972 times
	class: 2 occurs 2936 times
	class: 3 occurs 3008 times
	class: 4 occurs 2954 times
	class: 5 occurs 3029 times
	class: 6 occurs 3023 times
	class: 7 occurs 3013 times
	class: 8 occurs 3040 times
	class: 9 occurs 2984 times

3.Preprocessing data

Split the data set into training and validation sets

train_d = []
train_l = []
valid_d = []
valid_l = []

data_index = list(range(whole_data.shape[0]))
np.random.shuffle(data_index)   ## Randomize the index of the data set

train_index = data_index[:int(whole_data.shape[0]*0.85)] # 85%data as training data
valid_index = data_index[int(whole_data.shape[0]*0.85):] # 15%data as valid data

for i in train_index:
    train_d.append(whole_data[i])
    train_l.append(whole_label[i])
for i in valid_index:
    valid_d.append(whole_data[i])
    valid_l.append(whole_label[i])
    
train_data = np.array(train_d)
train_label = np.array(train_l)
valid_data = np.array(valid_d)
valid_label = np.array(valid_l)

3.1 preprocessing methods – image standardization

def stand_data(input_data):  
    
    for i in range(input_data.shape[0]):
        input_data[i] = input_data[i] - np.mean(input_data[i])
        input_data[i] = input_data[i] / np.std(input_data[i])
        
    return input_data

3.2 Implementation of PCA

def PCA(input_data, n_component):
    
    data_mean = np.mean(input_data, axis = 0) # Calculate the mean
    n_samples,n_features = np.shape(input_data)
    averg_matrix = np.tile(data_mean, (n_samples, 1))
    
    data_cent = input_data - averg_matrix  #data centralization
    
    cov_matrix = np.dot(np.transpose(data_cent),data_cent) #Get the covariance matrix

    eig_values, eig_vectors = np.linalg.eig(cov_matrix)
    
    value_index = np.argsort(eig_values) #Find the first n_component vectors
    vect_index = value_index[:-(n_component+1):-1]
    feature = eig_vectors[:,vect_index]
    
    low_dim_Data = np.dot(data_cent,feature)
    
    reconData = np.dot(low_dim_Data,np.transpose(feature)) + averg_matrix #Map the reduced data back to the original space
    
    return  low_dim_Data , reconData

train_data_pca = PCA(stand_data(train_data),180)[0]
train_data_pca1 = PCA(stand_data(train_data),180)[1]
valid_data_pca = PCA(stand_data(valid_data),180)[0]
valid_data_pca1 = PCA(stand_data(valid_data),180)[1]
train_data_pca2 = PCA(stand_data(orig_train_data),80)[1]

3.3 Implementation of SVD

def SVDdecomposition(input_matrix, n_eig_value):
    
    U, s, Vt = np.linalg.svd(input_matrix, full_matrices = False)
    S = np.diag(s)
    new_matrix = U[:,0:n_eig_value].dot(S[0:n_eig_value, 0:n_eig_value]).dot(Vt[0:n_eig_value,:])
    new_matrix
    return new_matrix

def svd_data(input_data):
    
    data_res = input_data.reshape((input_data.shape[0], 28, 28))
    svd_data = []
    for i in range(data_res.shape[0]):
        data_svd = SVDdecomposition(data_res[i],7)
        svd_data.append(data_svd)
    svd_data = np.array(svd_data)    
    
    return svd_data

train_data_res = svd_data(train_data)
valid_data_res = svd_data(valid_data)
sample2 = svd_data(orig_train_data)
train_data_svd = train_data_res.reshape((train_data_res.shape[0], 784))
valid_data_svd = valid_data_res.reshape((valid_data_res.shape[0], 784))

3.4 Define the accuracy function

def accuracy(pred_label, true_label):
    a = (np.array(pred_label) == np.array(true_label)).mean()*100
    return a

4. Define the KNN algorithm function

#knn classification 
def KNN_classifier(train_data, train_label, test_data, k):
    result = []
    for n in tqdm(range(test_data.shape[0])):
        test = test_data[n]
        dist_list = ((train_data - test)**2).sum(axis =1) #Calculate Euclidean distance
        sort_idx = np.argsort(dist_list)
        
        tar_label = np.zeros(10)
        for nearest_num in range(k):
            lable = train_label[sort_idx[nearest_num]]   #The number of different tags in the nearest k tags
            tar_label[lable] += 1
        
        nearest_label = np.argmax(tar_label)  #Vote for the nearest label
        result.append(nearest_label)
        
    return result

4.1 Fine-tune hyper-parameters for KNN Classifier

## original data training
k_list = []
acc_result = []

for k in range(1,15):
    knn_result = KNN_classifier(train_data, train_label, valid_data, k)
    k_list.append(k)
    acc_result.append((np.array(knn_result) == valid_label).mean())

result = np.array(acc_result)*100
plt.plot( k_list, result, 'go:', linewidth=2)
plt.xlabel('k')
plt.ylabel('accurarcy')
plt.title('KNN select hyperparameter(orginal data)')
plt.axvline(x=8, color='#d46061', linewidth=1)
plt.show

在这里插入图片描述

reduce the dimension of training and valid data by using PCA and keep the low dimension result for Fine-tune hyper-parameters for knn

k_list_pca = []
acc_result_pca = []

for k in range(1,10):
    knn_result = KNN_classifier(train_data_pca, train_label, valid_data_pca, k)
    k_list_pca.append(k)
    acc_result_pca.append((np.array(knn_result) == valid_label).mean())
result = np.array(acc_result_pca)*100
plt.plot( k_list_pca, acc_result_pca)
plt.xlabel('k')
plt.ylabel('accurarcy')
plt.title('KNN accurarcy with training set 0.8,test set 0.2')
#plt.axvline(x=6, color='#d46061', linewidth=1)
plt.show

在这里插入图片描述

Reduce the dimension of training and valid data by using PCA and keep the reconstructed data for Fine-tune hyper-parameters for PCA Choose the number of component for PCA and plot the classification accuracy.

ncomponent_list_pca = []
acc_pcaN = []
for i in range(80,780,100):
    train_data_pca = PCA(stand_data(train_data),i)[1]
    valid_data_pca = PCA(stand_data(valid_data),i)[1]
    knn_result_pca= KNN_classifier(train_data_pca, train_label, valid_data_pca, 8)
    ncomponent_list_pca.append(i)
    acc_pcaN.append((np.array(knn_result_pca) == valid_label).mean())

acc_pcaN = np.array(acc_pcaN)*100
plt.plot( ncomponent_list_pca, acc_pcaN )
plt.xlabel('num_component')
plt.ylabel('accurarcy')
plt.title('KNN accuracy (PCA)')
plt.show

在这里插入图片描述

ncomponent_list_pca = []
acc_pcaN = []
for i in range(10,80,10):
    train_data_pca = PCA(stand_data(train_data),i)[1]
    valid_data_pca = PCA(stand_data(valid_data),i)[1]
    knn_result_pca= KNN_classifier(train_data_pca, train_label, valid_data_pca, 8)
    ncomponent_list_pca.append(i)
    acc_pcaN.append((np.array(knn_result_pca) == valid_label).mean())
acc_pcaN = np.array(acc_pcaN)*100
plt.plot( ncomponent_list_pca, acc_pcaN ,'go:', linewidth=2)
plt.xlabel('num_component')
plt.ylabel('accurarcy')
plt.title('KNN accuracy (PCA)')
plt.show

在这里插入图片描述
After fine-tune hyper-parameters for PAC in knn algorithm set the number of component = 60 and keep the reconstructed data for training.
Choosing the k for knn algorithm (from 2 to 9)

train_data_pca60 = PCA(stand_data(train_data),60)[1]
valid_data_pca60 = PCA(stand_data(valid_data),60)[1]
valid_data_pca60.shape

k_opt_pca = []
acc_opt_pca = []

for k in range(2,10):
    knn_result = KNN_classifier(train_data_pca60, train_label, valid_data_pca60, k)
    k_opt_pca.append(k)
    acc_opt_pca.append((np.array(knn_result) == valid_label).mean())
acc_opt_pca = np.array(acc_opt_pca)*100
plt.plot( k_opt_pca, acc_opt_pca )
plt.xlabel('k')
plt.ylabel('accurarcy')
plt.title('KNN select k hyperparameter(PCA)')
plt.show

在这里插入图片描述

4.2 End of tuning

Dimension reduction PCA set number of component = 60 KNN classifier hyper parameter k = 6

traindata_pca_opt = PCA(stand_data(train_data),60)[1]
testdata_pca_opt = PCA(stand_data(orig_test_data[:2000]),60)[1]
opt_knn = KNN_classifier(traindata_pca_opt, train_label, testdata_pca_opt, 6)
opt_knn_result = accuracy(opt_knn, orig_test_label)
print("KNN classification accurarcy = %0.2f" % opt_knn_result,"%")

KNN classification accurarcy = 87.80 %

5. Define the NaiveBayes classification algorithm function

## NaiveBayes classification

def NB_classifier(train_data, train_label, test_data):
    
    ## the number of training data
    num = train_data.shape[0]   
    
    ## Combine the same label of data
    sp_data = []
    for label in np.unique(train_label): 
        same_label = []
        for x,y in zip(train_data, train_label):
            if y == label:
                same_label.append(x)
        sp_data.append(same_label)
    ##calculate prior probability
    
    log_prior = []
    for i in sp_data:
        log_prior.append(np.log((len(i) / num)))   

    count_data = []
    for i in sp_data:
        i = np.array(i)
        count_data.append(i.sum(axis = 0))
    count_data = np.array(count_data) 
    count_data[count_data < 0] = 0
    count_data += 1.0
    ##calculate posterior probability
    feature_post = np.log(count_data / count_data.sum(axis = 1)[np.newaxis].T)
    
    result = np.argmax([(feature_post*s).sum(axis = 1) + log_prior for s in test_data], axis = 1)
    
    return result

5.1 Training data by NaiveBayes classification

N_svd = NB_classifier(train_data_svd, train_label, valid_data_svd)
NaiveBayes_accurancy = (N_svd == valid_label).mean()*100
print("NaiveBayes_accurancy = ", NaiveBayes_accurancy,"%")

NaiveBayes_accurancy =  62.0 %

Preprocesse method	Original data(Sd)	PCA	SVD
Naïve Bayes Accuracy	54.15%	54.1%	66.91%

6. Define the Rogistc regression classification algorithm function

## Rogistc regression classification
# Find w by gradient descent


def LR_classifier(train_data, train_label, max_i , learning_rate):
    
    # initialize w size
    w = np.random.random((train_data.shape[1],1))/train_data.shape[1]
    
    # gradient descent
    
    e = 1e-10
    itres = 0
    dif = float("Inf")
    
    while itres < max_i and dif > e:
        gradien = np.zeros((train_data.shape[1],1))
        residual = np.zeros((train_data.shape[0],1))
        
        for i in range(train_data.shape[0]):
            
            sigma = 1.0 / (1.0 + np.exp(-np.dot(train_data[i], w))) ## sigmoid function
            
            if train_label[i] == 1:
                residual[i] = sigma -1 
            else:
                residual[i] = sigma 
        gradien = np.dot(train_data.T, residual)
        
        w_new = w - learning_rate*gradien
        
        dif = np.linalg.norm(w_new - w)
        
        w = w_new
        
        itres += 1
        
    return w

## Split the test label and obtain the predicted label :prediction = W.T * test_data

def LR_data_preprocess(train_data,train_label,valid_data,valid_label, max_i , learning_rate):
    
    lr_train_label = [] #Split the train label in 10 parts
    for i in range(len(np.unique(train_label))):
        label_new = []
        for j in train_label:
            if j == i:
                label_new.append(1)
            else:
                label_new.append(0)

        lr_train_label.append(label_new)


    LR_result = [] ## obtain the w (0-9 label)
    for i in tqdm(range(len(np.unique(train_label)))):
        LR_re = LR_classifier(train_data, lr_train_label[i],max_i,learning_rate)
        LR_result.append(LR_re)


    lr_test_label = [] #Split the test label in 10 parts
    for i in range(len(np.unique(train_label))):
        label_test_new = []
        for j in valid_label:
            if j == i:
                label_test_new.append(1)
            else:
                label_test_new.append(0)
        lr_test_label.append(label_test_new)


    prediction = []   ## obtain the predicted label
    for i in range(len(np.unique(train_label))):   
        pred_label = []
        for j in range(valid_data.shape[0]):
            c = 1.0 / (1.0 + np.exp(-np.dot(valid_data[j], LR_result[i])))
            if c > 0.5:
                pred_label.append(1)
            else:
                pred_label.append(0)
        prediction.append(pred_label)
    
    return lr_test_label ,prediction

6.1 Fine-tune hyper-parameters for Rogistc regression Classifier

lr_learnrate_label = []
lr_learnrate_result = []
for i in [0.1,0.01,0.001,0.0001,0.00001]:
    lr_result_w = LR_data_preprocess(orig_train_data,orig_train_label,orig_test_data[:2000],orig_test_label, 100, i)
    lr_learnrate_label.append(lr_result_w[0])
    lr_learnrate_result.append(lr_result_w[1])

learning_rate_n = [0.1,0.01,0.001,0.0001,0.00001]
lr_learning_rate_acc = []
for i in range(len(learning_rate_n)):
    lr_label_o = np.array(lr_learnrate_label)[i]
    prediction_o =  np.array(lr_learnrate_result)[i]
    lr_acc_o = []
    for j in range(len(np.unique(train_label))): 
        lr_acc_o.append((np.array(lr_label_o[j]) == np.array(prediction_o[j])).mean()*100)
    lr_acc_o = np.array(lr_acc_o)
    lr_learning_rate_acc.append(lr_acc_o.mean())
print(lr_learning_rate_acc)
plt.plot( learning_rate_n, lr_learning_rate_acc)
plt.xlabel('learning_rate')
plt.ylabel('accurarcy')
plt.title('Accuracy with different learning_rate')
plt.show

在这里插入图片描述

lr_itre_label = []
lr_itre_result = []
for i in [70,100,130,160,200,250,280]:
    lr_result_w = LR_data_preprocess(orig_train_data,orig_train_label,orig_test_data[:2000],orig_test_label, i, 0.01)
    lr_itre_label.append(lr_result_w[0])
    lr_itre_result.append(lr_result_w[1])

max_iteration = [70,100,130,160,200,250,280]
lr_max_ite_acc = []
for i in range(len(max_iteration)):
    lr_label_o = np.array(lr_itre_label)[i]
    prediction_o =  np.array(lr_itre_result)[i]
    lr_acc_o = []
    for j in range(len(np.unique(train_label))): 
        lr_acc_o.append((np.array(lr_label_o[j]) == np.array(prediction_o[j])).mean()*100)
    lr_acc_o = np.array(lr_acc_o)
    lr_max_ite_acc.append(lr_acc_o.mean())
print(lr_max_ite_acc)
plt.plot( max_iteration, lr_max_ite_acc,'go:', linewidth=2)
plt.xlabel('max_iteration')
plt.ylabel('accurarcy')
plt.title('LR Accuracy with different max_iteration')
plt.show

在这里插入图片描述
This task is to classify images of 10 different categories. However, the logistic regression classifier is normally a binary classifier. Therefore, for the modeling of this logistic regression classifier, we will classify all 10 types of labels in order. That is, assuming that the category 0 labels are in the same category, and the data of the remaining 9 labels belong to the same category, we filter out the category of category 0 labels, and then we filter the data of category 1 label, and separate it from other categories again, do a two-category, and so on, you can separate all 10 types of labels (for input data). Then compare the predicted tags with the real tags one by one to calculate the accuracy rate, and finally average the prediction accuracy rates of the 10 tags to get the final result. As for this classifier we need to fine-tune two hyperparameters, one is the learning rate and anther is the max iteration times.

Through two experiments, we choose the optimal hyperparameter value. (select learning rate = 0.01and max iteration = 160).After obtaining the optimal parameters, perform logistic regression classification on the data processed by different preprocessing methods, and the results are shown in the following table.

For logistic regression, the classification accuracy of the data processed by PCA and SVD is not as high as that of the original data, but these original data are standardized. The importance of standardized data has been mentioned above. It can be seen that dimensionality reduction processing cannot
improve the accuracy of logistic regression for this data.

Preprocesse method	Original data(Sd)	PCA	SVD
LR Accuracy	95.02%	93.63%	93.63%

Conclusion

From the above experimental results, the logistic regression classifier performed the best for the data set of this job.After data standardization, and the processed data is classified by logistic regression with an accuracy of 95%. Logistic regression takes an average of 8 minutes.Compared with the logistic regression classifier, the KNN classifier takes less time to process data classification, with an average of 3 minutes, but its accuracy is not that high as around 86%. The Naive Bayes classifier runs the fastest, and more than 3000 data is classified in a few seconds, but the results are not so satisfactory. For future work, we need more models of different algorithms to experiment and continuously optimize model algorithms and codes, while trying different data preprocessing methods, accumulating experience to deal with different data types. Adjusting the parameters is the most important step, which directly affects the quality of the model. It also requires continuous attempts to optimize the model. The purpose is to obtain a model with less temporary resources, short time-consuming, and high model robustness.