1. Missing Data
Naive Bayes can handle missing data.
Attributes are handled separately by the algorithm at both model construction time and prediction time.
As such, if a data instance has a missing value for an attribute, it can be ignored while preparing the model, and ignored when a probability is calculated for a class value.
2. Use Log Probabilities
Probabilities are often small numbers. To calculate joint probabilities, you need to multiply probabilities together. When you multiply one small number by another small number, you get a very small number.
It is possible to get into difficulty with the precision of your floating point values, such as under-runs. To avoid this problem, work in the log probability space (take the logarithm of your probabilities).
This works because to make a prediction in Naive Bayes we need to know which class has the larger probability (rank) rather than what the specific probability was.
3. Use Other Distributions
To use Naive Bayes with categorical attributes, you calculate a frequency for each observation.
To use Naive Bayes with real-valued attributes, you can summarize the density of the attribute using a Gaussian distribution. Alternatively you can use another functional form that better describes the distribution of the data, such as an exponential.
Don’t constrain yourself to the distributions used in examples of the Naive Bayes algorithm. Choose distributions that best characterize your data and prediction problem.
4. Use Probabilities For Feature Selection
Feature selection is the selection of those data attributes that best characterize a predicted variable.
In Naive Bayes, the probabilities for each attribute are calculated independently from the training dataset. You can use a search algorithm to explore the combination of the probabilities of different attributes together and evaluate their performance at predicting the output variable.
5. Segment The Data
Is their a well-defined subset of your data that responds well to the the Naive Bayes probabilistic approach?
Identifying and separating out segments that are easily handled by a simple probabilistic approach like Naive Bayes can give you increase performance and focus on the elements of the problem that are more difficult to model.
Explore different subsets, such as as the average or popular cases that are very likely handled well by Naive Bayes.
6. Re-compute Probabilities
Calculate the probabilities for each attribute is very fast.
This benefit of Naive Bayes means that you can re-calculate the probabilities as the data changes. This may be monthly, daily, even hourly.
This is something that may be unthinkable for other algorithms, but should be tested when using Naive Bayes if there is some temporal drift in the problem being modeled.
7. Use as a Generative Model
The Naive Bayes method characterizes the problem, which in turn can be used for making predictions about unseen data.
This probabilistic characterization can also be used to generate instances of the problem.
In the case of a numeric vector, the probability distributions can be sampled to create new fictitious vectors.
In the case of text (a very popular application of Naive Bayes), the model can be used to create fictitious input documents.
How might this be useful in your problem?
At the very least you can use the generative approach to help provide context for what the model has characterized.
8. Remove Redundant Features
The performance of Naive Bayes can degrade if the data contains highly correlated features.
This is because the highly correlated features are voted for twice in the model, over inflating their importance.
Evaluate the correlation of attributes pairwise with each other using a correlation matrix and remove those features that are the most highly correlated.
Nevertheless, always test your problem before and after such a change and stick with the form of the problem that leads to the better results.
9. Parallelize Probability Calculation
The probabilities for each attribute are calculated independently. This is the independence assumption in the approach and the reason why it has it’s name “naive”.
You can exploit this assumption to further speed up the execution of the algorithm by calculating attribute probabilities in parallel.
Depending on the size of the dataset and your resources, you could do this using different CPUs, different machines or different clusters.
10. Less Data Than You Think
Naive Bayes does not need a lot of data to perform well.
It needs enough data to understand the probabilistic relationship of each attribute in isolation with the output variable.
Given that interactions between attributes are ignored in the model, we do not need examples of these interactions and therefore generally less data than other algorithms, such as logistic regression.
Further, it is less likely to overfit the training data with a smaller sample size.
Try Naive Bayes if you do not have much training data.
11. Zero Observations Problem
Naive Bayes will not be reliable if there are significant differences in the attribute distributions compared to the training dataset.
An important example of this is the case where a categorical attribute has a value that was not observed in training. In this case, the model will assign a 0 probability and be unable to make a prediction.
These cases should be checked for and handled differently. After such cases have been resolved (an answer is known), the probabilities should be recalculated and the model updated.
12. It Works Anyway
An interesting point about Naive Bayes is that even when the independence assumption is violated and there are clear known relationships between attributes, it works anyway.
Importantly, this is one of the reasons why you need to spot check a variety of algorithms on a given problem, because the results can very likely surprise you.