Learning Vector Quantization addresses this by learning a much smaller subset of patterns that best represent the training data.(KNN just use large database of training to make prediction)
Once completing this part, you will know:
- How to learn a set of codebook vectors from a training data set
- How to make predictions using learned codebook vectors(码书矢量).
- How to apply Learning Vector Quantization to a real predictive modeling problem.
1.1 Description
- LVQ(Learning Vector Quantization) algorithm
- The Ionosphere classification problem
1.1.1 LVQ
LVQ algorithm is a lot like k-Nearest Neighbors.
Predictions are made by finding the best match among a library of patterns(codebook vectors).
each pattern is called a codebook.
1.1.2 Ionosphere Dataset.
This dataset involves the prediction of structure in the atmosphere.The baseline performance on the problem is approximately 64%. the filename ionosphere.csv
1.2 Tutorial
4 parts:
- Euclidean Distance
- Best Matching Unit
- Training Codebook Vectors
- Ionosphere Case Study
These steps will lay the foundation for implementing and applying the LVQ algorithm to your own predictive modeling problems.
1.2.1 Euclidean Distance
- x1 is the first row of data
- X2 is the second row of data
- i is the index for a specific column as we sum across all columns.
Below is a function named euclidean_distance() that implement this in Python.
#calculate the Euclidean distance between two vectors
def euclidean_distance(row1, row2):
distance = 0.0
for i in range(len(row1)-1):
distance += (row1[i] - row2[i])**2
return sqrt(distance)
We will use this dataset a few times as we construct the element needed for the LVQ algorithm.
X1 X2 Y
2.7810836 2.550537003 0
1.465489372 2.362125076 0
3.396561688 4.400293529 0
1.38807019 1.850220317 0
3.06407232 3.005305973 0
7.627531214 2.759262235 1
5.332441248 2.088626775 1
6.922596716 1.77106367 1
8.675418651 -0.242068655 1
7.673756466 3.508563011 1
Below is a plot of the dataset using different colors to show the different classes for each point.

we can write a small examples to test our distance function by printing the distance between the first row and all other rows.
# Example of Calculating Euclidean Distance on the Contrived Dataset
from math import sqrt
# calculate the Euclidean distance between two vectors
def euclidean_distance(row1, row2):
distance = 0.0
for i in range(len(row1)-1):
distance += (row1[i]-row2[i])**2
return sqrt(distance)
# Test distance function
dataset = [[2.7810836,2.550537003,0],
[1.465489372,2.362125076,0],
[3.396561688,4.400293529,0],
[1.38807019,1.850220317,0],
[3.06407232,3.005305973,0],
[7.627531214,2.759262235,1],
[5.332441248,2.088626775,1],
[6.922596716,1.77106367,1],
[8.675418651,-0.242068655,1],
[7.673756466,3.508563011,1]]
row0 = dataset[0]
for row in dataset:
distance = euclidean_distance(row0, row)
print(distance)
Running this example prints the distances between the first row and every row in the dataset, including itself.

1.2.2 Best Matching Unit(BMU)
BMU is the codebook vector that is most similar to a new piece of data.
Below is a function named get_best_matching_unit() that implements this.
# Locate the best matching unit
def get_best_matching_unit(codebooks, test_row):
distances = list()
for codebook in codebooks:
dist = euclidean_distance(codebook, test_row)
distances.append((codebook, dist))
distances.sort(key=lambda tup: tup[1])
return distances[0][0]
eucliden_distance() function is used to calculate the distance between each codebook and the new test_row.
tup[1] is used to sorting operation
The complete example is listed below:
# Example of getting the BMU
from math import sqrt
# calculate the Euclidean distance between two vectors
def euclidean_distance(row1,row2):
distance = 0.0
for i in range(len(row1)-1):
distance += (row1[i] - row2[i])**2
return sqrt(distance)
# Locate the best matching unit
def get_best_matching_unit(codebooks, test_row):
distances = list()
for codebook in codebooks:
dist = euclidean_distance(codebook, test_row)
distances.append((codebook, test_row))
distances.sort(key=lambda tup: tup[1])
return distances[0][0]
# Test best matching unit function
dataset = [[2.7810836,2.550537003,0],
[1.465489372,2.362125076,0],
[3.396561688,4.400293529,0],
[1.38807019,1.850220317,0],
[3.06407232,3.005305973,0],
[7.627531214,2.759262235,1],
[5.332441248,2.088626775,1],
[6.922596716,1.77106367,1],
[8.675418651,-0.242068655,1],
[7.673756466,3.508563011,1]]
test_row = dataset[0]
bmu = get_best_matching_unit(dataset, test_row)
print(bmu)
Running this example prints the BMU in the dataset to the first record. As expected, the first record is the most similar to itself and is at the top of the list.

1.2.3 Training Codebook Vectors
training a set of codebook vectors is to initialize the set.We can initialize it with pattern constructed from random features in the training dataset.Below is a function named random_codebook() . Random input and output features are selected from the training data.
# Create a random codebook vector
def random_codebook(train):
n_records = len(train)
n_features = len(train[0])
codebook = [train[randrange(n_records)][i] for i in range(n_features)]
return codebook
- Epochs
- Training Dataset
- Pattern Features

We can test this equation by assuming a learning rate of 0.3 and 10 epochs. The learning rate of each epoch would be as follows:
Epoch Effective Learning Rate
0 0.3
1 0.27
2 0.24
3 0.21
4 0.18
5 0.15
6 0.12
7 0.09
8 0.06
9 0.03

Below is a function named train_codebooks() that implement the procedure for training a set of codebook vectors given a training dataset.
random_codebook() to initialize the codebook vectors
get_best_matching_unit() function to find the BMU for each training pattern within an epoch.
# Train a set of codebook vectors
def train_codebooks(train, n_codebooks, lrate, epochs):
codebooks = [random_codebook(train) for i in range(n_codebooks)]
for epoch in range(epochs):
rate = lrate * (1.0 - (epoch/float(epochs)))
sum_error = 0.0
for row in train:
bmu = get_best_matching_unit(codebooks, row)
for i in range(len(row)-1):
error = row[i] - bmu[i]
sum_error += error**2
if bmu[-1] == row[-1]:
bmu[i] += rate * error
else:
bmu[i] -= rate * error
print('>epoch=%d, lrate=%.3f, error=%.3f' % (epoch, rate, sum_error))
return codebooks
Below is the complete example.
# Example of training a set of codebook vectors
from math import sqrt
from random import randrange
from random import seed
# calculate the Euclidean distance between two vectors
def euclidean_distance(row1, row2):
distance = 0.0
for i in range(len(row1)-1):
distance += (row1[i] - row2[i])**2
return sqrt(distance)
# Locate the best matching unit
def get_best_matching_unit(codebooks, test_row):
distances = list()
for codebook in codebooks:
dist = euclidean_distance(codebook,test_row)
distances.append((codebook, dist))
distances.sort(key=lambda tup: tup[1])
return distances[0][0]
# Create a random codebook vector
def random_codebook(train):
n_records = len(train)
n_features = len(train[0])
codebook = [train[randrange(n_records)][i] for i in range(n_features)]
return codebook
# Train a set of codebook vectors
def train_codebooks(train,n_codebooks,lrate,epochs):
codebooks = [random_codebook(train) for i in range(n_codebooks)]
for epoch in range(epochs):
rate = lrate * (1.0 - (epoch/float(epochs)))
sum_error = 0.0
for row in train:
bmu = get_best_matching_unit(codebooks, row)
for i in range(len(row)-1):
error = row[i] - bmu[i]
sum_error += error**2
if bmu[-1] == row[-1]:
bmu[i] += rate * error
else:
bmu[i] -= rate * error
print('>epoch=%d, lrate=%.3f, error=%.3f' % (epoch, rate, sum_error))
return codebooks
# Test the training function
seed(1)
dataset = [[2.7810836,2.550537003,0],
[1.465489372,2.362125076,0],
[3.396561688,4.400293529,0],
[1.38807019,1.850220317,0],
[3.06407232,3.005305973,0],
[7.627531214,2.759262235,1],
[5.332441248,2.088626775,1],
[6.922596716,1.77106367,1],
[8.675418651,-0.242068655,1],
[7.673756466,3.508563011,1]]
learn_rate = 0.3
n_epochs = 10
n_codebooks = 2
codebooks = train_codebooks(dataset, n_codebooks, learn_rate, n_epochs)
print('Codebooks: %s' % codebooks)
Running this example trains a set of 2 codebook vectors for 10 epochs with an initial learning rate of 0.3. The details are printed each epoch and the set of 2 codebook vectors learned from the training data is displayed. We can see that the changes to learning rate meet our expectations explored above for each epoch. We can also see that the sum squared error each epoch does continue to drop at the end of training and that there may be an opportunity to tune the example further to achieve less error.

1.2.4 Ionosphere Case Study
we will use the helper function load_csv() to load file, str_column_to_float() to convert string numbers to floats and str_column_to_int() to convert the class column to integer values.evaluate_algorithm() to evaluate the algorithm with cross_validation and accuracy_metric() to calculate the accuracy of predictions.
# LVQ for the Ionosphere Dataset
from random import seed
from random import randrange
from csv import reader
from math import sqrt
# Load a CSV file
def load_csv(filename):
dataset = list()
with open(filename, 'r') as file:
csv_reader = reader(file)
for row in csv_reader:
if not row:
continue
dataset.append(row)
return dataset
# Convert string column to float
def str_column_to_float(dataset, column):
for row in dataset:
row[column] = float(row[column].strip())
# Convert string column to integer
def str_column_to_int(dataset, column):
class_values = [row[column] for row in dataset]
unique = set(class_values)
lookup = dict()
for i ,value in enumerate(unique):
lookup[value] = i
for row in dataset:
row[column] = lookup[row[column]]
return lookup
# Split a dataset into k folds
def cross_validation_split(dataset, n_folds):
dataset_split = list()
dataset_copy = list(dataset)
fold_size = int(len(dataset) / n_folds)
for i in range(n_folds):
fold = list()
while len(fold) < fold_size:
index = randrange(len(dataset_copy))
fold.append(dataset_copy.pop(index))
dataset_split.append(fold)
return dataset_split
# Calculate accuracy percentage
def accuracy_metric(actual,predicted):
correct = 0
for i in range(len(actual)):
if actual[i] == predicted[i]:
correct += 1
return correct / float(len(actual)) * 100.0
# Evaluate an algorithm using a cross validation split
def evaluate_algorithm(dataset, algorithm, n_folds, *args):
folds = cross_validation_split(dataset, n_folds)
scores = list()
for fold in folds:
train_set = list(folds)
train_set.remove(fold)
train_set = sum(train_set,[])
test_set = list()
for row in fold:
row_copy = list(row)
test_set.append(row_copy)
row_copy[-1] = None
predicted = algorithm(train_set, test_set, *args)
actual = [row[-1] for row in fold]
accuracy = accuracy_metric(actual, predicted)
scores.append(accuracy)
return scores
# calculate the Euclidean distance between two vectors
def euclidean_distance(row1, row2):
distance = 0.0
for i in range(len(row1)-1):
distance += (row1[i] - row2[i])**2
return sqrt(distance)
# Locate the best matching unit
def get_best_matching_unit(codebooks, test_row):
distances = list()
for codebook in codebooks:
dist = euclidean_distance(codebook, test_row)
distances.append((codebook, dist))
distances.sort(key=lambda tup: tup[1])
return distances[0][0]
# Make a prediction with codebook vectors
def predict(codebooks, test_row):
bmu = get_best_matching_unit(codebooks, test_row)
return bmu[-1]
# Create a random codebook vector
def random_codebook(train):
n_records = len(train)
n_features = len(train[0])
codebook = [train[randrange(n_records)][i] for i in range(n_features)]
return codebook
# Train a set of codebook vectors
def train_codebooks(train, n_codebooks, lrate, epochs):
codebooks = [random_codebook(train) for i in range(n_codebooks)]
for epoch in range(epochs):
rate = lrate * (1.0 -(epoch/float(epochs)))
for row in train:
bmu = get_best_matching_unit(codebooks,row)
for i in range(len(row)-1):
error = row[i] - bmu[i]
if bmu[-1] == row[-1]:
bmu[i] += rate * error
else:
bmu[i] -= rate * error
return codebooks
# LVQ Algorithm
def learning_vector_quantization(train, test,n_codebooks, lrate,epochs):
codebooks = train_codebooks(train,n_codebooks,lrate, epochs)
predictions = list()
for row in test:
output = predict(codebooks, row)
predictions.append(output)
return (predictions)
# Test LVQ on Ionosphere dataset
seed(1)
# Load and prepare data
filename = 'ionosphere.csv'
dataset = load_csv(filename)
for i in range(len(dataset[0])-1):
str_column_to_int(dataset,i)
# convert class column to integers
str_column_to_int(dataset, len(dataset[0])-1)
# evaluate algorithm
n_folds = 5
learn_rate = 0.3
n_epochs = 50
n_codebooks = 20
scores = evaluate_algorithm(dataset, learning_vector_quantization, n_folds, n_codebooks,learn_rate, n_epochs)
print('Scores: %s' % scores)
print('Mean Accuracy: %.3f%%' % (sum(scores)/float(len(scores))))
Running this example prints the classification accuracy on each fold and the mean classification accuracy across all folds. We can see that the accuracy of 87.143% is better than the baseline of 64.286%. We can also see that our library of 20 codebook vectors is far fewer than holding the entire training dataset.
Scores: [90.0, 88.57142857142857, 84.28571428571429, 87.14285714285714, 85.71428571428571]
Mean Accuracy: 87.143%
5277

被折叠的 条评论
为什么被折叠?



