This is a good question, and one that seems to get raised time and time again.
Myself and a colleague (Sven Crone from Lancaster University in the UK) published a paper on this issue last year in the International Journal of Forecasting. "Instance sampling in credit scoring: An empirical study of sample size and balancing." A summary of our findings can also be found in the book "Credit Scoring, Response Modeling and Insurance Rating. A Practical Guide to Forecasting Consumer Behavior.”
There are also some very good papers by G. Weiss from 2004/5 which are highly cited and referenced in our paper/book.
What we found was that for some methods of model construction sample imbalance was not an issue at all – not even a tiny amount. For logistic regression in particular, there was absolutely no benefit to creating a balanced sample. What was far more important was using all the data you had available. For example, for a marketing campaign, if you had 1,000 responses and 50,000 non-responses you got better models by using all 51,000 cases, compared to sampling down the non-responses to 1000 or by weighting up the 1,000 responses.
We also looked at Neural Networks, Discriminant Analysis and Decision Trees. Discriminant Analysis was somewhat sensitive to the class imbalance (Balanced better than imbalanced) but the method that was the most sensitive, by far, was the Decision Tree approach (CART 4.5). We saw differences in model performance of more than 10% - with the balanced sample performing much better than the imbalanced one.
We also considered the two different ways of creating a balanced sample. The first was “under-sampling” where you “throw away” some of the larger class (non-responders in the above example) to create a sample with the same numbers of each class. The second method was “over-sampling” which is where you weight up the minority class – so in the above example you would treat each response as if it appeared 51 times in your sample. Over sampling was generally better than under-sampling, particularly when small samples were involved – which makes intuitive sense given that with under sampling you are not making full use of the data available to you. Weiss comes to the same conclusion in his paper, if memory serves me right.