Better Naive Bayes: 12 Tips To Get The Most From The Naive Bayes Algorithm

最新推荐文章于 2024-09-12 07:41:03 发布

尧尧1988

最新推荐文章于 2024-09-12 07:41:03 发布

阅读量297

点赞数

分类专栏： Machine Learning 文章标签： Naive Bayes

Machine Learning 专栏收录该内容

12 篇文章 0 订阅

订阅专栏

1. Missing Data

Naive Bayes can handle missing data.

Attributes are handled separately by the algorithm at both model construction time and prediction time.

As such, if a data instance has a missing value for an attribute, it can be ignored while preparing the model, and ignored when a probability is calculated for a class value.

2. Use Log Probabilities

Probabilities are often small numbers. To calculate joint probabilities, you need to multiply probabilities together. When you multiply one small number by another small number, you get a very small number.

It is possible to get into difficulty with the precision of your floating point values, such as under-runs. To avoid this problem, work in the log probability space (take the logarithm of your probabilities).

This works because to make a prediction in Naive Bayes we need to know which class has the larger probability (rank) rather than what the specific probability was.

3. Use Other Distributions

To use Naive Bayes with categorical attributes, you calculate a frequency for each observation.

To use Naive Bayes with real-valued attributes, you can summarize the density of the attribute using a Gaussian distribution. Alternatively you can use another functional form that better describes the distribution of the data, such as an exponential.

Don’t constrain yourself to the distributions used in examples of the Naive Bayes algorithm. Choose distributions that best characterize your data and prediction problem.

4. Use Probabilities For Feature Selection

Feature selection is the selection of those data attributes that best characterize a predicted variable.

In Naive Bayes, the probabilities for each attribute are calculated independently from the training dataset. You can use a search algorithm to explore the combination of the probabilities of different attributes together and evaluate their performance at predicting the output variable.

5. Segment The Data

Is their a well-defined subset of your data that responds well to the the Naive Bayes probabilistic approach?

Identifying and separating out segments that are easily handled by a simple probabilistic approach like Naive Bayes can give you increase performance and focus on the elements of the problem that are more difficult to model.

Explore different subsets, such as as the average or popular cases that are very likely handled well by Naive Bayes.

6. Re-compute Probabilities

Calculate the probabilities for each attribute is very fast.

This benefit of Naive Bayes means that you can re-calculate the probabilities as the data changes. This may be monthly, daily, even hourly.

This is something that may be unthinkable for other algorithms, but should be tested when using Naive Bayes if there is some temporal drift in the problem being modeled.

7. Use as a Generative Model

The Naive Bayes method characterizes the problem, which in turn can be used for making predictions about unseen data.

This probabilistic characterization can also be used to generate instances of the problem.

In the case of a numeric vector, the probability distributions can be sampled to create new fictitious vectors.

In the case of text (a very popular application of Naive Bayes), the model can be used to create fictitious input documents.

How might this be useful in your problem?

At the very least you can use the generative approach to help provide context for what the model has characterized.

8. Remove Redundant Features

The performance of Naive Bayes can degrade if the data contains highly correlated features.

This is because the highly correlated features are voted for twice in the model, over inflating their importance.

Evaluate the correlation of attributes pairwise with each other using a correlation matrix and remove those features that are the most highly correlated.

Nevertheless, always test your problem before and after such a change and stick with the form of the problem that leads to the better results.

9. Parallelize Probability Calculation

The probabilities for each attribute are calculated independently. This is the independence assumption in the approach and the reason why it has it’s name “naive”.

You can exploit this assumption to further speed up the execution of the algorithm by calculating attribute probabilities in parallel.

Depending on the size of the dataset and your resources, you could do this using different CPUs, different machines or different clusters.

10. Less Data Than You Think

Naive Bayes does not need a lot of data to perform well.

It needs enough data to understand the probabilistic relationship of each attribute in isolation with the output variable.

Given that interactions between attributes are ignored in the model, we do not need examples of these interactions and therefore generally less data than other algorithms, such as logistic regression.

Further, it is less likely to overfit the training data with a smaller sample size.

Try Naive Bayes if you do not have much training data.

11. Zero Observations Problem

Naive Bayes will not be reliable if there are significant differences in the attribute distributions compared to the training dataset.

An important example of this is the case where a categorical attribute has a value that was not observed in training. In this case, the model will assign a 0 probability and be unable to make a prediction.

These cases should be checked for and handled differently. After such cases have been resolved (an answer is known), the probabilities should be recalculated and the model updated.

12. It Works Anyway

An interesting point about Naive Bayes is that even when the independence assumption is violated and there are clear known relationships between attributes, it works anyway.

Importantly, this is one of the reasons why you need to spot check a variety of algorithms on a given problem, because the results can very likely surprise you.