About Learning Vector Quantization(LVQ,学习矢量量化)

Learning Vector Quantization addresses this by learning a much smaller subset of patterns that best represent the training data.(KNN just use large database of training to make prediction)

Once completing this part, you will know:

  • How to learn a set of codebook vectors from a training data set
  • How to make predictions using learned codebook vectors(码书矢量).
  • How to apply Learning Vector Quantization to a real predictive modeling problem.

1.1 Description

  • LVQ(Learning Vector Quantization) algorithm
  • The Ionosphere classification problem

1.1.1 LVQ

LVQ algorithm is a lot like k-Nearest Neighbors.

Predictions are made by finding the best match among a library of patterns(codebook vectors).

each pattern is called a codebook.

1.1.2 Ionosphere  Dataset.

This dataset involves the prediction of structure in the atmosphere.The baseline performance on the problem is approximately 64%.  the filename ionosphere.csv

1.2 Tutorial

4 parts:

  1. Euclidean Distance
  2. Best Matching Unit
  3. Training Codebook Vectors
  4. Ionosphere Case Study

These steps will lay the foundation for implementing and applying the LVQ algorithm to your own predictive modeling problems.

1.2.1 Euclidean Distance

                                                 distance = \sqrt{\sum_{i=1}^{n}(x1_{i}-x2_{i})^2}

  • x1 is the first row of data
  • X2 is the second row of data
  • i is the index for a specific column as we sum across all columns.

Below is a function named euclidean_distance() that implement this in Python.

#calculate the Euclidean distance between two vectors
def euclidean_distance(row1, row2):
    distance = 0.0
    for i in range(len(row1)-1):
        distance += (row1[i] - row2[i])**2
    return sqrt(distance)

We will use this dataset a few times as we construct the element needed for the LVQ algorithm.

X1            X2             Y
2.7810836     2.550537003    0
1.465489372   2.362125076    0
3.396561688   4.400293529    0
1.38807019    1.850220317    0
3.06407232    3.005305973    0
7.627531214   2.759262235    1
5.332441248   2.088626775    1
6.922596716   1.77106367     1
8.675418651   -0.242068655   1
7.673756466   3.508563011    1

Below is a plot of the dataset using different colors to show the different classes for each point.

 we can write a small examples to test our distance function by printing the distance between the first row and all other rows.

# Example of Calculating Euclidean Distance on the Contrived Dataset
from math import sqrt

# calculate the Euclidean distance between two vectors
def euclidean_distance(row1, row2):
    distance = 0.0
    for i in range(len(row1)-1):
        distance += (row1[i]-row2[i])**2
    return sqrt(distance)

# Test distance function
dataset = [[2.7810836,2.550537003,0],
[1.465489372,2.362125076,0],
[3.396561688,4.400293529,0],
[1.38807019,1.850220317,0],
[3.06407232,3.005305973,0],
[7.627531214,2.759262235,1],
[5.332441248,2.088626775,1],
[6.922596716,1.77106367,1],
[8.675418651,-0.242068655,1],
[7.673756466,3.508563011,1]]

row0 = dataset[0]
for row in dataset:
    distance = euclidean_distance(row0, row)
    print(distance)

Running this example prints the distances between the first row and every row in the dataset, including itself.

1.2.2 Best Matching Unit(BMU)

BMU is the codebook vector that is most similar to a new piece of data.

Below is a function named get_best_matching_unit() that implements this.

# Locate the best matching unit
def get_best_matching_unit(codebooks, test_row):
    distances = list()
    for codebook in codebooks:
        dist = euclidean_distance(codebook, test_row)
        distances.append((codebook, dist))
    distances.sort(key=lambda tup: tup[1])
    return distances[0][0]

eucliden_distance() function is used to calculate the distance between each codebook and the new  test_row.

tup[1] is used to sorting operation

The complete example is listed below:

# Example of getting the BMU
from math import sqrt

# calculate the Euclidean distance between two vectors
def euclidean_distance(row1,row2):
    distance = 0.0
    for i in range(len(row1)-1):
        distance += (row1[i] - row2[i])**2
    return sqrt(distance)

# Locate the best matching unit
def get_best_matching_unit(codebooks, test_row):
    distances = list()
    for codebook in codebooks:
        dist = euclidean_distance(codebook, test_row)
        distances.append((codebook, test_row))
    distances.sort(key=lambda tup: tup[1])
    return distances[0][0]


# Test best matching unit function
dataset = [[2.7810836,2.550537003,0],
[1.465489372,2.362125076,0],
[3.396561688,4.400293529,0],
[1.38807019,1.850220317,0],
[3.06407232,3.005305973,0],
[7.627531214,2.759262235,1],
[5.332441248,2.088626775,1],
[6.922596716,1.77106367,1],
[8.675418651,-0.242068655,1],
[7.673756466,3.508563011,1]]

test_row = dataset[0]
bmu = get_best_matching_unit(dataset, test_row)
print(bmu)

Running this example prints the BMU in the dataset to the first record. As expected, the first record is the most similar to itself and is at the top of the list.

1.2.3 Training Codebook Vectors

training a set of codebook vectors is to initialize the set.We can initialize it with pattern constructed from random features in the training dataset.Below is a function named random_codebook() . Random input and output features are selected from the training data.

# Create a random codebook vector
def random_codebook(train):
    n_records = len(train)
    n_features = len(train[0])
    codebook = [train[randrange(n_records)][i] for i in range(n_features)]
    return codebook
  • Epochs
  • Training Dataset
  • Pattern Features

 We can test this equation by assuming a learning rate of 0.3 and 10 epochs. The learning rate of each epoch would be as follows:

Epoch Effective Learning Rate
0 0.3
1 0.27
2 0.24
3 0.21
4 0.18
5 0.15
6 0.12
7 0.09
8 0.06
9 0.03

Below is a function named train_codebooks() that implement the procedure for training a set of codebook vectors given a training dataset.

random_codebook() to initialize the codebook vectors

get_best_matching_unit() function to find the BMU for each training pattern within an epoch.

# Train a set of codebook vectors
def train_codebooks(train, n_codebooks, lrate, epochs):
    codebooks = [random_codebook(train) for i in range(n_codebooks)]
    for epoch in range(epochs):
        rate = lrate * (1.0 - (epoch/float(epochs)))
        sum_error = 0.0
        for row in train:
            bmu = get_best_matching_unit(codebooks, row)
            for i in range(len(row)-1):
                error = row[i] - bmu[i]
                sum_error += error**2
                if bmu[-1] == row[-1]:
                    bmu[i] += rate * error
                else:
                    bmu[i] -= rate * error
        print('>epoch=%d, lrate=%.3f, error=%.3f' % (epoch, rate, sum_error))
    return codebooks

Below is the complete example.

# Example of training a set of codebook vectors
from math import sqrt
from random import randrange
from random import seed

# calculate the Euclidean distance between two vectors
def euclidean_distance(row1, row2):
    distance = 0.0
    for i in range(len(row1)-1):
        distance += (row1[i] - row2[i])**2
    return sqrt(distance)

# Locate the best matching unit
def get_best_matching_unit(codebooks, test_row):
    distances = list()
    for codebook in codebooks:
        dist = euclidean_distance(codebook,test_row)
        distances.append((codebook, dist))
    distances.sort(key=lambda tup: tup[1])
    return distances[0][0]

# Create a random codebook vector
def random_codebook(train):
    n_records = len(train)
    n_features = len(train[0])
    codebook = [train[randrange(n_records)][i] for i in range(n_features)]
    return codebook

# Train a set of codebook vectors
def train_codebooks(train,n_codebooks,lrate,epochs):
    codebooks = [random_codebook(train) for i in range(n_codebooks)]
    for epoch in range(epochs):
        rate = lrate * (1.0 - (epoch/float(epochs)))
        sum_error = 0.0
        for row in train:
            bmu = get_best_matching_unit(codebooks, row)
            for i in range(len(row)-1):
                error = row[i] - bmu[i]
                sum_error += error**2
                if bmu[-1] == row[-1]:
                    bmu[i] += rate * error
                else:
                    bmu[i] -= rate * error
        print('>epoch=%d, lrate=%.3f, error=%.3f' % (epoch, rate, sum_error))
    return codebooks


# Test the training function
seed(1)
dataset = [[2.7810836,2.550537003,0],
[1.465489372,2.362125076,0],
[3.396561688,4.400293529,0],
[1.38807019,1.850220317,0],
[3.06407232,3.005305973,0],
[7.627531214,2.759262235,1],
[5.332441248,2.088626775,1],
[6.922596716,1.77106367,1],
[8.675418651,-0.242068655,1],
[7.673756466,3.508563011,1]]
learn_rate = 0.3
n_epochs = 10
n_codebooks = 2
codebooks = train_codebooks(dataset, n_codebooks, learn_rate, n_epochs)
print('Codebooks: %s' % codebooks)

Running this example trains a set of 2 codebook vectors for 10 epochs with an initial learning rate of 0.3. The details are printed each epoch and the set of 2 codebook vectors learned from the training data is displayed. We can see that the changes to learning rate meet our expectations explored above for each epoch. We can also see that the sum squared error each epoch does continue to drop at the end of training and that there may be an opportunity to tune the example further to achieve less error.

 1.2.4 Ionosphere Case Study

we will use the helper function load_csv() to load file, str_column_to_float() to convert string numbers to floats and str_column_to_int() to convert the class column to integer values.evaluate_algorithm() to evaluate the algorithm with cross_validation and accuracy_metric() to calculate the accuracy of predictions.

# LVQ for the Ionosphere Dataset
from random import seed
from random import randrange
from csv import reader
from math import sqrt

# Load a CSV file
def load_csv(filename):
    dataset = list()
    with open(filename, 'r') as file:
        csv_reader = reader(file)
        for row in csv_reader:
            if not row:
                continue
            dataset.append(row)
    return dataset

# Convert string column to float
def str_column_to_float(dataset, column):
    for row in dataset:
        row[column] = float(row[column].strip())
        
# Convert string column to integer
def str_column_to_int(dataset, column):
    class_values = [row[column] for row in dataset]
    unique = set(class_values)
    lookup = dict()
    for i ,value in enumerate(unique):
        lookup[value] = i
    for row in dataset:
        row[column] = lookup[row[column]]
    return lookup

# Split a dataset into k folds
def cross_validation_split(dataset, n_folds):
    dataset_split = list()
    dataset_copy = list(dataset)
    fold_size = int(len(dataset) / n_folds)
    for i in range(n_folds):
        fold = list()
        while len(fold) < fold_size:
            index = randrange(len(dataset_copy))
            fold.append(dataset_copy.pop(index))
            dataset_split.append(fold)
        return dataset_split
    
# Calculate accuracy percentage
def accuracy_metric(actual,predicted):
    correct = 0
    for i in range(len(actual)):
        if actual[i] == predicted[i]:
            correct += 1
        return correct / float(len(actual)) * 100.0
    
# Evaluate an algorithm using a cross validation split
def evaluate_algorithm(dataset, algorithm, n_folds, *args):
    folds = cross_validation_split(dataset, n_folds)
    scores = list()
    for fold in folds:
        train_set = list(folds)
        train_set.remove(fold)
        train_set = sum(train_set,[])
        test_set = list()
        for row in fold:
            row_copy = list(row)
            test_set.append(row_copy)
            row_copy[-1] = None
        predicted  = algorithm(train_set, test_set, *args)
        actual = [row[-1] for row in fold]
        accuracy = accuracy_metric(actual, predicted)
        scores.append(accuracy)
    return scores

# calculate the Euclidean distance between two vectors

def euclidean_distance(row1, row2):
    distance = 0.0
    for i in range(len(row1)-1):
        distance += (row1[i] - row2[i])**2
    return sqrt(distance)

# Locate the best matching unit
def get_best_matching_unit(codebooks, test_row):
    distances = list()
    for codebook in codebooks:
        dist = euclidean_distance(codebook, test_row)
        distances.append((codebook, dist))
    distances.sort(key=lambda tup: tup[1])
    return distances[0][0]

# Make a prediction with codebook vectors
def predict(codebooks, test_row):
    bmu = get_best_matching_unit(codebooks, test_row)
    return bmu[-1]

# Create a random codebook vector
def random_codebook(train):
    n_records = len(train)
    n_features = len(train[0])
    codebook = [train[randrange(n_records)][i] for i in range(n_features)]
    return codebook

# Train a set of codebook vectors
def train_codebooks(train, n_codebooks, lrate, epochs):
    codebooks = [random_codebook(train) for i in range(n_codebooks)]
    for epoch in range(epochs):
        rate = lrate * (1.0 -(epoch/float(epochs)))
        for row in train:
            bmu = get_best_matching_unit(codebooks,row)
            for i in range(len(row)-1):
                error = row[i] - bmu[i]
                if bmu[-1] == row[-1]:
                    bmu[i] += rate * error
                else:
                    bmu[i] -= rate * error
    return codebooks



# LVQ Algorithm
def learning_vector_quantization(train, test,n_codebooks, lrate,epochs):
    codebooks = train_codebooks(train,n_codebooks,lrate, epochs)
    predictions = list()
    for row in test:
        output = predict(codebooks, row)
        predictions.append(output)
    return (predictions)

# Test LVQ on Ionosphere dataset
seed(1)
# Load and prepare data
filename = 'ionosphere.csv'
dataset = load_csv(filename)
for i in range(len(dataset[0])-1):
    str_column_to_int(dataset,i)
# convert class column to integers
str_column_to_int(dataset, len(dataset[0])-1)
# evaluate algorithm
n_folds = 5
learn_rate = 0.3
n_epochs = 50
n_codebooks = 20
scores = evaluate_algorithm(dataset, learning_vector_quantization, n_folds, n_codebooks,learn_rate, n_epochs)
print('Scores: %s' % scores)
print('Mean Accuracy: %.3f%%' % (sum(scores)/float(len(scores))))

Running this example prints the classification accuracy on each fold and the mean classification accuracy across all folds. We can see that the accuracy of 87.143% is better than the baseline of 64.286%. We can also see that our library of 20 codebook vectors is far fewer than holding the entire training dataset.

Scores: [90.0, 88.57142857142857, 84.28571428571429, 87.14285714285714, 85.71428571428571]

Mean Accuracy: 87.143%

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值