About Naive Bayes

最新推荐文章于 2022-10-27 19:03:06 发布

DB架构

最新推荐文章于 2022-10-27 19:03:06 发布

阅读量136

点赞数

分类专栏： Python learning 文章标签：概率论机器学习人工智能

本文链接：https://blog.csdn.net/u011868279/article/details/125308529

版权

Python learning 专栏收录该内容

33 篇文章 1 订阅

订阅专栏

We can use probability to predictions.Perhaps the most widely used example is called the Nayes algorithm.

How to calculate the probabilities required by the Naive Bayes algorithm.
How to implement the Naive Bayes algorithm from scratch.
How to apply Naive Bayes to a real-world predictive modeling problem.

1.1 Descriptions

This section:

provides a brief overview of the Naive Bayes algorithm
the Iris flowers dataset that we will use

1.1.1 Naive Bayes

Bayes’ Theorem provides a way that we can calculate the probability of a piece of data belonging to a given class, given our prior knowledge. Bayes’ Theorem is stated as:

Where P(class|data) is the probability of class given the provided data. Naive Bayes is a classification algorithm for binary (two-class) and multiclass classification problems. It is called Naive Bayes or idiot Bayes because the calculations of the probabilities for each class are simplified to make their calculations tractable.

Rather than attempting to calculate the probabilities of each attribute value, they are assumed to be conditionally independent given the class value. This is a very strong assumption that is most unlikely in real data, i.e. that the attributes do not interact. Nevertheless, the approach performs surprisingly well on data where this assumption does not hold.

1.1.2 Iris Flower Species Dataset

the Iris Flower Species Dataset. This dataset involves the prediction of iris flower species. The baseline performance on the problem is approximately 26%.Download the dataset and save it into your current working directory with the filename iris.csv.

1.2 Tutorial

This tutorial is broken down into 6 parts

Separate By Class
Summarize Dataset
Summarize Data By Class
Gaussian Probability Density Function
Class Probabilities
Iris Flower Species Case Study

1.2.1 Separate By Class

We will need to calculate the probability of data by the class they belong to. This means that we will first need to separate our training data by class. A relatively straightforward operation. We can create a dictionary object where each key is the class value and then add a list of all the records as the value in the dictionary. Below is a function named separate_by_class() that implements this approach. It assumes that the last column in each row is the class value.

# Function To Separate Rows By Class Value
# Split  the  dataset by class values, return a dictionary
def separate_by_class(dataset):
    separated = dict()
    for i in range(len(dataset)):
        vector = dataset[i]
        class_value = vector[-1]
        if (class_value not in separated):
            separated[class_value] = list()
        separated[class_value].append(vector)
    return separated

we can contrive a small dataset to test out this function.

X1          X2              Y
3.393533211 2.331273381     0
3.110073483 1.781539638     0
1.343808831 3.368360954     0
3.582294042 4.67917911      0
2.280362439 2.866990263     0
7.423436942 4.696522875     1
5.745051997 3.533989803     1
9.172168622 2.511101045     1
7.792783481 3.424088941     1
7.939820817 0.791637231     1

Putting this all together , we can test our separate_by_class() function on the contrived dataset.

# Example of separating data by class value

# Split the dataset by class values, return a dictionary
def separate_by_class(dataset):
    separated = dict()
    for i in range(len(dataset)):
        vector = dataset[i]
        class_value = vector[-1]
        if (class_value not in separated):
            separated[class_value] = list()
        separated[class_value].append(vector)
    return separated

# Test separating data by class
dataset = [[3.393533211,2.331273381,0],
[3.110073483,1.781539638,0],
[1.343808831,3.368360954,0],
[3.582294042,4.67917911,0],
[2.280362439,2.866990263,0],
[7.423436942,4.696522875,1],
[5.745051997,3.533989803,1],
[9.172168622,2.511101045,1],
[7.792783481,3.424088941,1],
[7.939820817,0.791637231,1]]
separated = separate_by_class(dataset)
for label in separated:
    print(label)
    for row in separated[label]:
        print(row)

1.2.2.2 Summarize Dataset

We need two statistics from a given set of data. We'll see how these used in the calculation of probabilities in a few steps.

$mean = \frac{\sum_{i=1}^{n}x_{i}}{count(x)}$

Where x is the list of values or a column we looking at and i is the index of a specific value. Below is a small function named mean() that calculates the mean of a list of numbers.

# Calculate the mean of a list of numbers
def mean(numbers):
    return sum(numbers)/float(len(numbers))

The sample standard deviation is calculated as the mean difference from the mean value. This can be calculated as:

You can see that we square the difference between the mean and a given value, calculate the average squared difference from the mean, then take the square root to return the units back to their original value. Below is a small function named standard deviation() that calculates the standard deviation of a list of numbers. You will notice that it calculates the mean. It might be more efficient to calculate the mean of a list of numbers once and pass it to the standard deviation() function as a parameter. You can explore this optimization if you’re interested later.

# Function To Calculate the Standard Deviation of a List of Numbers
from math import sqrt

# Calculate the standard  deviation of a list of numbers
dev stdev(numbers):
    avg = mean(numbers)
    variance = sum([(x-avg)** 2 for x in numbers]) / float(len(numbers)-1)
    return sqrt(variance)

We require the mean and standard deviation statistics to be calculated for each input attribute or each column of our data. We can do that by gathering all of the values for each column into a list and calculating the mean and standard deviation on that list. Once calculated, we can gather the statistics together into a list or tuple of statistics. Then, repeat this operation for each column in the dataset and return a list of tuples of statistics.

Below is a function named summarize_dataset() that implements this approach.It uses some Python tricks to cut down on the number of lines required.

# Function To Summarize Each Column in a dataset

# Calculate the mean , stdev and count for each column in a dataset.
def summarize_dataset(dataset):
    summaries =[(mean(column),stdev(column),len(column)) for column in zip(*dataset)]
    del(summaries[-1])
    return summaries

The first trick is the use of the zip() function that will aggregate elements from each provided argument. We pass in the dataset to the zip() function with the * operator that separates the dataset into separate lists for each row. The zip() function then iterates over each element of each row and returns a column from the dataset as a list of numbers.A clever little trick.

We then calculate the mean, standard deviation and count of rows in each column. A tuple is created from these 3 numbers and a list of these tuples is stored. We then remove the statistics for class variable as we will not need these tuples is stored .We then remove the statistics for the class variable as we will not need these statistics . Let's test all of these functions on our contrived dataset from above.Below is the complete example.

# Example of summarizing a dataset
from math import sqrt
# Calculate the mean of a list of numbers
def mean(numbers):
    return sum(numbers)/ float(len(numbers))

# Calculate the standard deviation of a list of numbers
def stdev(numbers):
    avg = mean(numbers)
    variance = sum([(x-avg)**2 for x in numbers]) / float(len(numbers)-1)
    return sqrt(variance)

# Calculate the mean, stdev and count for each column in a dataset
def summarize_dataset(dataset):
    summaries = [(mean(column),stdev(column),len(column)) for column in zip(*dataset)]
    del(summaries[-1])
    return summaries

# Test summarizing a dataset
dataset = [[3.393533211,2.331273381,0],
[3.110073483,1.781539638,0],
[1.343808831,3.368360954,0],
[3.582294042,4.67917911,0],
[2.280362439,2.866990263,0],
[7.423436942,4.696522875,1],
[5.745051997,3.533989803,1],
[9.172168622,2.511101045,1],
[7.792783481,3.424088941,1],
[7.939820817,0.791637231,1]]

summary = summarize_dataset(dataset)
print(summary)

Running the example prints out the list of tuples of statistics on each of the two input variables. Interpreting the results, we can see that the mean value of X1 is 5.178333386499999 and the standard deviation of X1 is 2.7665845055177263.

Now we are ready to use these functions on each group of rows in our dataset.

1.2.3 Summarize Data By Class

we require statistics from our training dataset organized by class. Above, we have developed the separate_by_class() function to separate a dataset into rows by class.And we have developed summarize_dataset() function to calculate summary statistics for each column.

we can put off this together and summarize the columns in the dataset organized by class values.Below is a function named summarize_by_class() that implements this operation.The dataset is first split by class ,then ststistics are calculated on each subset.The results in the form of a list tuples of statistics are then stored in a dictionary by their class value.

# Split dataset by class then calculate statistics for each row
def summarize_by_class(dataset):
    separated = separate_by_class(dataset)
    summaries = dict()
    for class_value, rows in separated.iteritems():
        summaries[class_value] = summarize_dataset(rows)
    return summaries

Again,let's test out all of these behaviours on our contrived dataset.

# Example of summarizing data by class value
from math import sqrt

# Split the dataset by class values, returns a dictionary
def separate_by_class(dataset):
    separated = dict()
    for i in range(len(dataset)):
        vector = dataset[i]
        class_value = vector[-1]
        if (class_value not in separated):
            separated[class_value] = list()
        separated[class_value].append(vector)
    return separated

# Calculate the mean of a list of numbers
def mean(numbers):
    return sum(numbers)/float(len(numbers))

# Calculate the standard deviation of a list of numbers

def stdev(numbers):
    avg = mean(numbers)
    variance = sum([(x-avg)**2 for x in numbers])/ float(len(numbers)-1)
    return sqrt(variance)

# Calculate the mean,stdev and count for each column in a dataset.
def summarize_dataset(dataset):
    summaries = [(mean(column),stdev(column),len(column),len(column)) for column in zip(*dataset)]
    del(summaries[-1])
    return summaries

# Split  dataset by class then calculate statistics for each row
def summarize_by_class(dataset):
    separated = separate_by_class(dataset)
    summaries = dict()
    for class_value,rows in separated.items():
        summaries[class_value] = summarize_dataset(rows)
    return summaries

# Test summaring by class
dataset = [[3.393533211,2.331273381,0],
[3.110073483,1.781539638,0],
[1.343808831,3.368360954,0],
[3.582294042,4.67917911,0],
[2.280362439,2.866990263,0],
[7.423436942,4.696522875,1],
[5.745051997,3.533989803,1],
[9.172168622,2.511101045,1],
[7.792783481,3.424088941,1],
[7.939820817,0.791637231,1]]
summary = summarize_by_class(dataset)
for label in summary:
    print(label)
    for row in summary[label]:
        print(row)

Running this example calculates the statistics for each input variable and prints them organized by class value. Interpreting the results, we can see that the X1 values for rows for class 0 have a mean value of 2.7420144012.

There is one more piece we need before we start calculating probabilities

1.2.4 Gaussian Probability Density Function

Calculating the probability or likelihood of observing a given real-value like X1 is difficult. One way we can do this is to assume that X1 values are drawn from a distribution, such as a bell curve or Gaussian distribution.

A Gaussian distribution can be summarized using only two numbers: the mean and the standard deviation. Therefore, with a little math, we can estimate the probability of a given value. This piece of math is called a Gaussian Probability Distribution Function (or Gaussian PDF) and can be calculated as:

Below is a function that implements this. I tried to split it up to make it more readable.

# Calculate the Gaussian probability distribution function for x
def calculate_probability(x, mean, stdev):
    exponent = exp(-((x-mean)**2 / (2 * stdev**2)))
    return (1 / (sqrt(2 * pi) * stdev)) * exponent

Let's test it out to see how it works. Below are some worked examples.

# Example of Gaussian PDF
from math import sqrt
from math import pi
from math import exp

# Calculate the Gaussian probability distribution function for x
def calculate_probability(x, mean, stdev):
    exponent = exp(-((x-mean)**2 / (2 * stdev**2)))
    return (1 / (sqrt(2 * pi) * stdev)) * exponent

# Test Gaussian PDF
print(calculate_probability(1.0, 1.0, 1.0))
print(calculate_probability(2.0, 1.0, 1.0))
print(calculate_probability(0.0, 1.0, 1.0))

Running it prints the probability of some input values. You can see that when the value is 1 and the mean and standard deviation is 1 our input is the most likely (top of the bell curve) and has the probability of 0.39. We can see that when we keep the statistics the same and change the x value to 1 standard deviation either side of the mean value (2 and 0 or the same distance either side of the bell curve) the probabilities of those input values are the same at 0.24.

Now that we have all the pieces in place, let’s see how we can calculate the probabilities we need for a Naive Bayes Theorem.

1.2.5 Class Probabilities

Use the statistics calculated from our training data to calculate probabilities for new data.Probabilities are calculated separately for each class. This mean that we first calculate the probabilities that a new piece of data belongs to the first class, then calculate probabilities that it belongs to the second class, and so on for all the classes. The probability that a piece of data belongs to a class is calculated as follows:

You may note that this is different from the Bayes Theorem described above. The division have been removed to simplify the calculation. This means that the result is no longer strictly a probability of the data belonging to a class. The value is still maximized, meaning that the calculation for the class that results in the largest value is taken as the prediction. This is a common implementation simplification as we are often more interested in the class prediction rather than the probability.

The input variables are treated separately, giving the technique it’s name naive. For the above example where we have 2 input variables, the calculation of the probability that a row belongs to the first class 0 can be calculated as:

Now you can see why we need to separate the data by class value. The Gaussian Probability Density function in the previous step is how we calculate the probability of a real value like X1 and the statistics we prepared are used in this calculation. Below is a function named calculate class probabilities() that ties all of this together. It takes a set of prepared summaries and a new row as input arguments

First the total number of training records is calculated from the counts stored in the summary statistics. This is used in the calculation of the probability of a given class or P(class) as the ratio of rows with a given class of all rows in the training data. 88

Next, probabilities are calculated for each input value in the row using the Gaussian probability density function and the statistics for that column and of that class. Probabilities are multiplied together as they accumulated. This process is repeated for each class in the dataset. Finally a dictionary of probabilities is returned with one entry for each class.

# Calculate the probabilities of predicting each class for a given row

def calculate_class_probabilities(summaries, row):
    total_rows = sum([summaries[label][0][2] for label in summaries])
    probabilities = dict()
    for class_value, class_summaries in summaries.items():
        probabilities[class_value] = summaries[class_value][0][2]/float(total_rows)
        for i in range(len(class_summaries)):
            mean, stdev, count = class_summaries[i]
            probabilities[class_value] *= calculate_probability(row[i],mean,stdev)
    return probabilities

Let's tie this together with an example on the contrived dataset.The example below first calculates the summary statistics by class for the training dataset, then uses these ststistics to calculate the probability of the first record belonging to each class.

# Example of calculating class probabilities
from math import sqrt
from math import pi
from math import exp

# Split the dataset by class values,returns a dictionary
def separate_by_class(dataset):
    separated = dict()
    for i in range(len(dataset)):
        vector = dataset[i]
        class_value = vector[-1]
        if (class_value not in separated):
            separated[class_value] = list()
        separated[class_value].append(vector)
    return separated

# Calculate the mean of a list of numbers
def mean(numbers):
    return sum(numbers)/float(len(numbers))

# Calculate the standard deviation of a list of numbers
def stdev(numbers):
    avg = mean(numbers)
    variance = sum([(x-avg)**2 for x in numbers])/float(len(numbers)-1)
    return sqrt(variance)

# Calculate the mean ,stdev and count for each column in a dataset
def summarize_dataset(dataset):
    summaries = [(mean(column),stdev(column),len(column)) for column in zip(*dataset)]
    del(summaries[-1])
    return summaries

# Split dataset by class then calculate statistics for each row
def summarize_by_class(dataset):
    separated = separate_by_class(dataset)
    summaries = dict()
    for class_value,rows in separated.items():
        summaries[class_value] = summarize_dataset(rows)
    return summaries

# Calculate the Gaussian probability distribution function for x
def calculate_probability(x, mean, stdev):
    exponent = exp(-((x-mean)**2 / (2 * stdev**2 )))
    return (1 / (sqrt(2 * pi) * stdev)) * exponent

# Calculate the probabilities of predicting each class for a given row
def calculate_class_probabilities(summaries, row):
    total_rows = sum([summaries[label][0][2] for label in summaries])
    probabilities = dict()
    for class_value, class_summaries in summaries.items():
        probabilities[class_value] = summaries[class_value][0][2]/float(total_rows)
        for i in range(len(class_summaries)):
            mean, stdev, row[i] = class_summaries[i]
            probabilities[class_value] *= calculate_probability(row[i], mean, stdev)
    return probabilities

# Test calculating class probabilities
dataset = [[3.393533211,2.331273381,0],
[3.110073483,1.781539638,0],
[1.343808831,3.368360954,0],
[3.582294042,4.67917911,0],
[2.280362439,2.866990263,0],
[7.423436942,4.696522875,1],
[5.745051997,3.533989803,1],
[9.172168622,2.511101045,1],
[7.792783481,3.424088941,1],
[7.939820817,0.791637231,1]]

summaries = summarize_by_class(dataset)
probabilities = calculate_class_probabilities(summaries,dataset[0])
print(probabilities)

Running the example prints the probabilities calculated for each class. We can see that the probability of the first row belonging to the 1 class is higher than the probability of it belonging to the 0 class . We would therefore correctly conclude that it belongs to the 1 class.

1.2.6 Iris Flower Species Case Study

This section applies the Naive Bayes algorithm to the Iris flowers dataset.

load the dataset and convert the loaded data to numbers that we can use with the mean and standard deviation calculation.(load_csv to load the file, str_column_to_float() to convert string numbers to floats and str_column_to_int() to convert the class column to integer values)
evaluate the algorithm using k-fold cross-validation with 5 folds. This means that 150/5 = 30 records will be each fold.(evaluate_algorithm() to evaluate the algorithm with cross-validation and accuracy_metric() to calculate the accuracy of predictions.predict() was developed to manage the calculation of the probabilities of a new row belonging to each class and selecting the class with the largest probability value. naive_bayes() was developed to manage the application of the Naive Bayes algorithm.)

First learning the statistics from a training dataset and using them to make predictions for a test dataset.The complete example is listed below:

# Naive Bayes on The Iris Dataset
from csv import reader
from random import randrange
from math import sqrt
from math import exp
from math import pi

# Load a CSV file
def load_csv(filename):
    dataset = list()
    with open(filename,'r') as file:
        csv_reader = reader(file)
        for row in csv_reader:
            if not row:
                continue
            dataset.append(row)
    return dataset

# Convert string column to float
def str_column_to_float(dataset, column):
    for row in dataset:
        row[column] = float(row[column].strip())
        
# Convert string column to integer
def str_column_to_int(dataset, column):
    class_values = [row[column] for row in dataset]
    unique = set(class_values)
    lookup = dict()
    for i, value in enumerate(unique):
        lookup[value] = i
    for row in dataset:
        row[column] = lookup[row[column]]
    return lookup

# Split a dataset into k folds
def cross_validation_split(dataset, n_folds):
    dataset_split = list()
    dataset_copy = list(dataset)
    fold_size = int(len(dataset) / n_folds)
    for i in range(n_folds):
        fold = list()
        while len(fold) < fold_size:
            index = randrange(len(dataset_copy))
            fold.append(dataset_copy.pop(index))
        dataset_split.append(fold)
    return dataset_split

# Calculate accuracy percentage
def accuracy_metric(actual, predicted):
    correct = 0
    for i in range(len(actual)):
        if actual[i] == predicted[i]:
            correct += 1
    return correct / float(len(actual)) * 100.0

# Evaluate an algorithm using a cross validation split
def evaluate_algorithm(dataset,algorithm,n_folds,*args):
    folds = cross_validation_split(dataset, n_folds)
    scores = list()
    for fold in folds:
        train_set = list(folds)
        train_set.remove(fold)
        train_set = sum(train_set, [])
        test_set = list()
        for row in fold:
            row_copy = list(row)
            test_set.append(row_copy)
            row_copy[-1] = None
        predicted = algorithm(train_set, test_set, *args)
        actual = [row[-1] for row in fold]
        accuracy = accuracy_metric(actual,predicted)
        scores.append(accuracy)
    return scores

# Split the dataset by class values , return a dictionary
def separate_by_class(dataset):
    separated = dict()
    for i in range(len(dataset)):
        vector = dataset[i]
        class_value = vector[-1]
        if (class_value not in separated):
            separated[class_value] = list()
        separated[class_value].append(vector)
    return separated

# Calculate the mean of a list of numbers
def mean(numbers):
    return sum(numbers)/float(len(numbers))

# Calculate the standard deviation of a list of numbers
def stdev(numbers):
    avg = mean(numbers)
    variance = sum([(x-avg)**2 for x in numbers]) / float(len(numbers)-1)
    return sqrt(variance)


# Calculate the mean, stdev and count for each column in a dataset

def summarize_dataset(dataset):
    summaries = [(mean(column),stdev(column),len(column)) for column in zip(*dataset)]
    del(summaries[-1])
    return summaries

# Split dataset by class then calculate stastistics for each row
def summarize_by_class(dataset):
    separated = separate_by_class(dataset)
    summaries = dict()
    for class_value, rows in separated.items():
        summaries[class_value] = summarize_dataset(rows)
    return summaries

# Calculate the Gaussian probability distributioon function for x
def calculate_probability(x, mean, stdev):
    exponent = exp(-((x-mean)**2 / (2 * stdev**2)))
    return (1 / (sqrt(2 * pi) * stdev)) * exponent

# Calculate the probabilities of predicting each class for a given row
def calculate_class_probabilities(summaries,row):
    total_rows = sum([summaries[label][0][2] for label in summaries])
    probabilities = dict()
    for class_value, class_summaries in summaries.items():
        probabilities[class_value] = summaries[class_value][0][2] / float(total_rows)
        for i in range(len(class_summaries)):
            mean,stdev,row[i] = class_summaries[i]
            probabilities[class_value] *= calculate_probability(row[i], mean, stdev)
    return probabilities

# Predict the class for a given row
def predict(summaries, row):
    probabilities = calculate_class_probabilities(summaries,row)
    best_label,best_prob = None,-1
    for class_value,probability in probabilities.items():
        if best_label is None or probability > best_prob:
            best_prob = probability
            best_label = class_value
    return best_label

# Naive Bayes Algorithm
def naive_bayes(train, test):
    summarize = summarize_by_class(train)
    predictions = list()
    for row in test:
        output = predict(summarize, row)
        predictions.append(output)
    return (predictions)
    
# Test Naive Bayes on Iris Dataset
filename = 'iris.csv'
dataset = load_csv(filename)
for i in range(len(dataset[0])-1):
    str_column_to_float(dataset,i)
    
# convert  class column to integers
str_column_to_int(dataset, len(dataset[0])-1)

# evaluate algorithm
n_folds = 5
scores = evaluate_algorithm(dataset, naive_bayes,n_folds)
print('Scores: %s' % scores)
print('Mean Accuracy: %.3f%%' % (sum(scores)/float(len(scores))))

Running the example prints the mean classification accuracy scores on each cross-validation fold as well as the mean accuracy score. We can see that the mean accuracy of 34.000% is dramatically better than the baseline accuracy of 26%.