【转】Programmers Need To Learn Statistics Or I Will Kill Them All

Programmers Need To Learn Statistics Or I Will Kill Them All

2010-02-06 20:45

I have a major pet peeve that I need to confess. I go insane when I hear programmers talking about statistics like they know shit when it’s clearly obvious they do not. I’ve been studying it for years and years and still don’t think I know anything. This article is my call for all programmers to finally learn enough about statistics to at least know they don’t know shit. I have no idea why, but their confidence in their lacking knowledge is only surpassed by their lack of confidence in their personal appearance.

A bit of background about me is in order. I got interested in statistics when I started to read about the history of mathematics and how statistics radically changed the way science was done. Before statistics the belief was that the world fit into perfectly mathematical models, and that any error we find is because we don’t have the models right. This is thanks to Descartes convincing everyone that math is reality, and that we’re just full of bullshit. Eventually, every major science adopted an empiricist view of the world. Except Computer Science of course.

My part in this little drama is that I’m a weirdo who’s studied more sociology, business, economics, and history than I’ve studied computer science. I have taken a bunch of math classes, studied statistics in grad school, learned the R language , and read tons of books on the subject. Despite all of this I’m not at all confident in my understanding of such a vast topic. What I can do is apply the techniques to common problems I encounter at work. My favorite problem to attack with the statistics wolverine is performance measurement and tuning.

All of this leads to a curse since none of my colleagues have any clue about what they don’t understand. I’ll propose a measurement technique and they’ll scoff at it. I try to show them how to properly graph a run chart and they’re indignant. I question their metrics and they try to back it up with lame attempts at statistical reasoning. I really can’t blame them since they were probably told in college that logic and reason are superior to evidence and observation. Even the great Knuth once said: “Beware of bugs in the above code; I have only proved it correct, not tried it.”

A Blind Man On A Planet With No Sight

I’m sure you’ve all thought about it at some point. “Imagine you’re on a planet where everyone was blind, and you’re the only one with sight. How would you describe the sunset?” It’s commonly something done as an exercise in high school and it’s retarded. If this planet were populated with programmers though it would be really interesting.

Zed: Wow, the sunset here is a brilliant blue.
Joe Programmer: No, you’re fucking wrong it’s red asshole.
Zed: Uh, it’s blue. Guy with vision here. Remember?
Frank Programmer: Yeah, it’s red man. You’re an idiot. See, I can hear the way it makes the air move so I know it looks red.
Zed: Look! I’m the one who can see! It’s blue.
Joe Programmer: I have written huge web applications in every language and even programmed the original VAX. I know that sunset is red.
Frank Programmer: It’s red because of the heat it generates on my arm. Yes. I’m sure that’s it.
Zed: Fuck! Fuck! I have eyes! You do not! See!? No?! Exactly! Because you can’t fucking see because you have no fucking eyes! Arrggh! I’m going to get a burrito.
Joe Programmer: That guy is such an asshole.
Frank Programmer: Yep. Still sounds red to me though.

This is how I feel when I try to explain why someone’s analysis isn’t quite right. I grab one of my many statistics book, open up my R console, start to draw up some graphs and show them how to do it. Next thing I know, I’m getting the cold shoulder or told I’m an idiot. It’s even worse when the person is a programmer and I’m showing them that they have work to do.

Another analogy is when I met a guy from Arkansas who said that perpetual motion machines could work. “Yesir, perpetual motion machines—or PMMs as Billy Bob Dunsfield down the street calls 'em—are a reality. I read it on the eenternet yesterday. Yesir.” I’m usually just stuck with even where to begin. “Don’t believe what you read on the Internet…” No, this is the guy who reads Hustler, so I don’t want to keep him from reading more. “You see, there’s this law of…” No, physics won’t come into the conversation at all. I’m just stuck.

The difference though between some hill-billy from Arkansas and a clueless programmer is that the programmer should know better. He’s probably educated, smart, and hopefully both (you’d be surprised).

Oh, and you wonder why I say, “he”? I never have this problem with female programmers. Maybe it’s because I’m tall (6’2”), or nicer to them, but they always speak rationally and are really keen to learn. If they disagree, they do so rationally and back up what they say. I think women are better programmers because they have less ego and are typically more interested in the gear rather than the pissing contest.

My List of Pet Peeves

I could make this list go on and on forever, but I’ll just rant about the top things that irritate me to no end. These are things you’ll see all over the IT industry. Performance measurements, capacity planning guides, product literature, and anything Microsoft writes about Linux. I’ll detail each annoyance, and then describe how you can stop doing it, and what to read to get help.

Power-of-Ten Syndrome

“All you need to do is run that test [insert power-of-ten] times and then do an average.” Usually the power-of-ten is 1000, but it will be 10 if the test takes longer than 2 minutes (which is the exact attention span of the average programmer). I’ll cover “average” later on, but there’s several problems with the power-of-ten choice, which I’ll demonstrate with the usually given “1000 times” figure.

How do you know that 1000 is the correct number of iterations to improve the power of the experiment?

What’s that? You don’t know what “power” is? It’s basically the chance that your experiment is right (not quite but close enough). There’s some decent mathematics behind determining power, and you can even run a single function in R to find out appropriate sample sizes given your accuracy needs. Take a look at power.t.test power.prop.test in R for information.

How are you performing the samplings?

1000 iterations run in a massive sequential row? A set of 10 runs with 100 each? The statistics are different depending on which you do, but the 10 runs of 100 each would be a better approach. This lets you compare sample means and figure out if your repeated runs have any bias.

How do you know that 1000 is enough to get the process into a steady state after the ramp-up period?

A common element of process control statistics is that all processes have a period in the beginning where the process isn’t stable. This “ramp-up” period is usually discarded when doing the analysis unless your run length has to include it. Most people think that 1000 is more than enough, but it totally depends on how the system functions. Many complex interacting systems can easily need 1000 iterations to get to a steady state, especially if performing 1000 transactions is very quick. Imagine a banking system that handles 10,000 transactions a second. I could take a good minute to get this system into a steady state, but your measly little 1000 transaction test is only scratching the surface.

What will you do if the 1000 tests takes 10 hours?

How does 1000 sequential requests help you determine the performance under load?

You run 1000 test sequentially and then find out that the system blows up when you give it a parallel load. Now you’re back to square one because the performance characteristics are different under parallel load.

If all you do is run 1000 and then take an average, then how do you spot places where the system is really hurting?* Read the “Averages Only” section for more on this.

A graph can really demonstrate this problem well. Using the following R code:

> a <- rnorm(100, 30, 5)
> b <- rnorm(100, 30, 20)

I construct two sets of 100 random samples from the normal distribution. Now, if I just take the average (mean or median) of these two sets they seem almost the same:

> mean(a)
[1] 30.05907
> mean(b)
[1] 30.11601
> median(a)
[1] 30.12729
> median(b)
[1] 31.06874

They’re both around 30 (which is what we requested with the second parameter). This is where most programmers would start to piss me off, because if you take a look at the following run chart of the samples you can see the difference (blue is a, orange is b):

The third parameter tells R to give samples with different standard deviations. This makes the range of possible responses “wider” and gives you two wildly different charts even though they have the exact same mean and nearly the same median. Even better is the results of the summary function in R:

> summary(a)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
  13.33   27.00   30.13   30.06   33.43   47.23
> summary(b)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
 -15.48   16.90   31.07   30.12   43.42   80.86

Without even doing a nice graph you can see that the ranges are totally different.

Averages Only

This one pisses me off the most as it is so obvious. It usually happens when a programmer states that his system can handle “[insert power-of-ten] requests per second”. I then see the power-of-ten and raise a red flag. A power-of-ten isn’t really bad, as long as it’s arrived at through a decent analysis. But typically it’s torn out of a monkey’s ass and thrown on a wall with a plop.

The most troubling problem with these single number “averages” is that there’s two common averages and that without some form of range or variance error they are useless. If you take a look at the previous graphs you can see visually why this is a problem. Two averages can be the same, but hide massive differences in behavior. Without a standard deviation it’s not possible to figure out if the two might even be close. An even better approach (with normally distributed data) is to use a Student’s t-test to see if there are differences.

Let’s look at the standard deviation for our two samples:

> sd(a)
[1] 5.562842
> sd(b)
[1] 19.09167

Now that’s a difference! If this were a web server performance run I’d say the second server (represented by b) has a major reliability problem. No, it’s not going to crash, but it’s performance response is so erratic that you’d never know how long a request would take. Even though the two servers perform the same on average, users will think the second one is slower because of how it seems to randomly perform. Of course, this can’t be a server measurement since it has negative timing measurements, but it is only an example for this standard deviation.

Why is this so important? Here’s two more graphs to illustrate how important consistent behavior is to performance measurements—and actually most process oriented measurements. The first graph is the a set and the second graph is the b set:

Normally you don’t get such nice graphs from performance measurements, but the idea is still the same. The first graph fits well into a very nice normal curve. The curved line pretty closely matches the histogram it models. This is what a consistent process looks like.

This second graph though shows a total mess of a process. A run chart would give you a better view, but this illustrates an important point that badly behaving processes tend to not fit their supposed distribution. We know this is supposed to be a normal distribution, but look at it. It’s all slumped to the side and doesn’t even look at all like it’s distribution.

The moral of the story is that if you give an average without standard deviations then you’re totally missing the entire point of even trying to measure something. A major goal of measurement is to develop a succinct and accurate picture of what’s going on, but if you don’t find out the standard deviation and do at least a couple graphs then you’re screwed. Just give up man. Game over. Game over.

Confounding, Confounding, Confounding

Ah, confounding. The most difficult thing to explain to a programmer, yet the most elementary part of all scientific experimentation. It’s pretty simple: If you want to measure something, then don’t measure other shit. Wow, what a revelation. It is a lot more difficult to do in a lab and especially in agriculture where most of this crap was thought up in the first place. Programmers have no fucking excuse though since they can easily remove confounding by isolating systems.

Here’s an example of why confounding is so wrong. Imagine that someone tried to tell you that you needed to compare a bunch of flavors of ice cream for their taste, but that half of the tubs of creamy goodness were melted, and half were frozen. Do you think having to slop down a gallon of Heath Crunch flavored warm milk would skew your quality measurement? Of course it would. The temperature of the ice cream is confounding your comparison of taste quality. In order to fix the problem you need to remove this confounding element by melting all the ice cream.

Alright, well, I guess you could freeze them all too but you get the idea.

How do you fix or even detect confounding? Well, in “the real world” it’s a bitch—and sometimes it’s impossible—so some super smart motha fu.. (shut your mouth) came up with all sorts of ways to reduce the impact of confounding. One way is to randomize the confounding element so that it’s effect is not influencing the element under investigation. If you have to find out which fertilizer is the best, but the water and soil have different properties over your massive 20 acre field, then you need to randomize where you put what fertilizer. It gets even more complicated since you might randomly put them in a really bad formation, so these super smart people came up with ways around that too.

Hold on though, we’re fucking programmers not farmers. If we want to take one single line of code and test it then we can. If we want to only verify one single query on a database then what’s stopping us? Stupidity that’s what. Programmers just don’t get confounding and companies use it against them by writing “case studies” comparing wildly different systems in performance or security that are chock full of confounding elements.

The classic example of this is the Pet Store debacle where Sun put out an example application showing how to do J2EE right (which was really more wrong than a transvestite pregnant with triplets). Then Microsoft came along and re-implemented the whole thing using ASP that smoked the Java Pet Store—even though it was implemented just as wrong. The confounding in the Pet Store comparison though was so bad that it was impossible to really compare the two. They were different systems, had different URLs, different form elements, different backend databases, and the test procedure was totally bogus. The comparison claimed to measure X number of “users”, but didn’t cover single page execution, database configurations used, system level tunings. Of course none of that is useful because even that just confounds shit even further.

For the Pet Store experiment to be meaningful it should have tried to keep every damn thing it could the same except for the minimum of different elements. The same databases, operating systems, file layouts, forms, HTML tags, logic flow, everything possible. The only different element should have been the ASP vs. JSP and the Servlets vs. VB hackery. Even better would have been to not use a full application in order to avoid confounding an entire application design on the measurement of speed to render functionality.

Being able to remove confounding elements to get to the core of a problem is also a valuable analysis tool outside of measurement. I once worked for a financial services company that was rolling out a new application to its sales team. We spent the whole Sunday night getting this thing rolling and the next morning we kept running into these weird 2 or 3 minute delays on some queries. The manager in charge of the project was throwing fits and threatening to fire our entire team for incompetence. He especially picked on the database admin for whatever reason, but our direct manager was very calm and rational about it. She (remember what I said about women) asked me to look at it and see what I could figure out.

The ranting manager was claiming it was the DBA’s database and he was screwed. It was our programming that was causing the problem. It was this and it was that. I checked out the program and the queries and nothing seemed wrong. I decided to break down each element of the chain in the request processing to see what was causing the problem. JSP rendering? Nope, that was sub-second response on average with a sub-second standard deviation. Controller code? Nope. Microsoft SQL database? Nope. A small harness that ran all the queries in the application showed those to have great and consistent performance.

Then I hit the DB2 database and about crapped my pants. Almost all of the queries performed great, except one query that had sub-second response on average, but a 60 second standard deviation! This was the query. I made a chart of all the different queries, marched into a meeting, slapped them on the table and said, “It’s not the database, it’s IBM’s DB2 configuration. Here’s the time measurements to prove it.”

The next day we had IBM fixing the problem (turned out to be a single update index command) and we all kept our jobs. That’s what a proper analysis method can do for you.

The Definition of “User”

I worked with this idiot we’ll call Mr. BJ who would constantly say that my measurements were crap. I’d compare the performance of X and show that it wouldn’t meet our coming student storm (it was a university) and he’d tell me that I’m full of it since I didn’t measure how many “users” the system could handle. I kept asking BJ, “define user.” The best he could come up with was, “You know, click around a lot and fill out some forms. Like a user!”

Before you can measure something you really need to lay down a very concrete definition of what you’re measuring. You should also try to measure the simplest thing you can and try to avoid confounding. Yet still I see software developers begging for gazillions of dollars to buy some crap tool that doesn’t even mention “standard deviation”, but throws “user” around like it’s Dr. Phil treating Robert Downey Jr. for heroin addiction.

What gets me though is most IT people can’t even grok people’s facial expressions, but they’ll trust anything that claims it measures “the average user”. I’m sorry to say boys and girls but there’s entire industries and scientific disciplines trying to figure out the average user. Your piddly little JMeter analysis with it’s lousy graphs and even worse statistics won’t tell you crap about your users.

It especially won’t tell you the one thing you need to know for performance measurements:

How much data can go down this fucking pipe in a second?

That’s all there is to performance measurement. Sure, “how much”, “data”, and “pipe” all depend on the application, but if you need 1000 requests/second processing mojo, and you can’t get your web server to push out more than 100 requests/second, then you’ll never get your JSP+EJB+Hibernate+SOAP application anywhere near good enough. If all you can shove down your DS3 is 10k/second then you’ll never get that massive 300k flash animation to your users in time to sell them your latest Gizmodo 9000. Face it, users are both useless for fixing their computer and as a measurement of speed.

Measuring anything else first is like trying to see who has the fastest car by going to the super market to get eggs. All you get out of that exercise is broken eggs.

Where To Get Help

I’ve read a lot of books on the subject, but here’s a few that you can look for on your favorite corporate book pusher site:

Statistics; by Freedman, Pisani, Purves, and Adhikari. Norton publishers.
Introductory Statistics with R; by Dalgaard. Springer publishers.
Statistical Computing: An Introduction to Data Analysis using S-Plus; by Crawley. Wiley publishers.
Statistical Process Control; by Grant, Leavenworth. McGraw-Hill publishers.
Statistical Methods for the Social Sciences; by Agresti, Finlay. Prentice-Hall publishers.
Methods of Social Research; by Baily. Free Press publishers.
Modern Applied Statistics with S-PLUS; by Venables, Ripley. Springer publishers.

There’s also several books on statistics and the software development process which you can apply to your work directly. Finally, you should check out the R Project for the programming language used in this article. It is a great language for this, with some of the best plotting abilities in the world. Learning to use R will help you also learn statistics better.