Lots of businesses try using regression models. The basic idea is great – use the value of some variable(s) which you may be able to control to predict the value of another variable [most likely a metric] for which you’d like to optimize (probably get as high as possible).
In it’s basic form – a regression model is simply the equation for a line that goes onto a scatter plot. Each value represents a an instance of 1 variable’s response given the state of another variable (in a single regression). For a multiple regression you can’t really see a scatter plot because you have multiple inputs represented.
The special thing about the line is that it’s the line that minimizes the total space between the dots and the line.
Above – think of the green dots as actual points where 1 variable met the other (eg sales was $5,000 and marketing spend was $2,000 – that’s one dot where sales and spend are on different axis). The maroon line minimizes the distance between the points and the said line. How does it do this you wonder?
http://mathworld.wolfram.com/LeastSquaresFitting.html
Well – you need some partial derivatives and fairly sophisticated linear algebra. I doubt most applied math professors can derive the equations without referring to them beforehand. In reality you use MS Excel or R, or SAS or some other statistical analysis software. Below is a good video by Khan Academy that gives some good grounding to the concept.
So great – why aren’t we using regressions to predict everything about any business, precisely forecast sales, budgets and resource allocation and the whole 9 yards?
Well – think about that line for a second. How many points are actually on the line? There’s a good chance that in a fairly sparse plot, only a handful are. If you spend $2,000 on marketing and only come up with $4,000 in sales instead of $5,000…the model can say nothing about this shortfall. If the model is only giving you estimates – might you fare better if you just looked at what happened in the recent past and made an educated guess?
PROBLEMS WITH REGRESSION MODELS:
- The relationship between the variables must be linear – you’re not going to find a good y = a + bx model for x and x^2, as the curve is exponential. The scatterplot usually tells you whether the relationship is linear. Common sense prevails most of the time though.
- There’s no way to truly understand whether the variable you’re trying to predict actually relates to the variable you expect to predict it. This is the old correlation vs causation issue. Sometimes the relationships may be obvious, but in a multiple regression model it may not be as clear-cut and some of the variables may be codependent which throws off some key assumptions.
- The model error (the distance between the points and the line, more or less) should be normally distributed. This means outliers can ruin the model depending on their position on the plot. Are those outliers relevant? Your call.
- Regression models use historical data to predict future data – in a growing business many things are changing including (marketing spend, competition, margins, macroeconomic factors, site design, etc…). Looking at a regression model to forecast sales may be less useful than looking at the same month last year and multiplying it by the % you want to grow by.
- Before statistical software packages, running a regression was a laborious processes (remember that thing about partial derivatives and matrices?). People only inspected relationships that were really ‘worth inspecting’. Nowadays there’s quite a bit of data snooping and simply running regressions because one can. The odds of finding a reasonably good model the more variables you inspect are well…reasonably good!
- Again – they are imprecise. While the algorithm minimizes the distance between the points and the line…the line cannot go through all the points since it’s impossible to account for all the conditions that led to each particular observation in the data set (eg, the site went down one day and ad spend did not).
Time for dinner (on average anyway),