Jakob Nielsen's Alertbox, January 21, 2001:
Usability Metrics
Summary:
Although measuring usability can cost four times as much as conducting qualitative studies (which often generate better insight), metrics are sometimes worth the expense. Among other things, metrics can help managers track design progress and support decisions about when to release a product.
Usability can be measured, but it rarely is. The reason? Metrics are expensive and are a poor use of typically scarce usability resources.
Most companies still under-invest in usability. With a small budget, you're far better off passing on quantitative measures and reaching for the low-hanging fruit of qualitative methods, which provide a much better return on investment. Generally, to improve a design, insight is better than numbers.
However, the tide might be turning on usability funding. I've recently worked on several projects to establish formal usability metrics in different companies. As organizations increase their usability investments, collecting actual measurements is a natural next step and does provide benefits. In general, usability metrics let you:
- Track progress between releases. You cannot fine-tune your methodology unless you know how well you're doing.
- Assess your competitive position. Are you better or worse than other companies? Where are you better or worse?
- Make a Stop/Go decision before launch. Is the design good enough to release to an unsuspecting world?
- Create bonus plans for design managers and higher-level executives. For example, you can determine bonus amounts for development project leaders based on how many customer-support calls or emails their products generated during the year.
How to Measure
It is easy to specify usability metrics, but hard to collect them. Typically, usability is measured relative to users' performance on a given set of test tasks. The most basic measures are based on the definition of usability as a quality metric:
- success rate (whether users can perform the task at all),
- the time a task requires,
- the error rate, and
- users' subjective satisfaction.
It is also possible to collect more specific metrics, such as the percentage of time that users follow an optimal navigation path or the number of times they need to backtrack.
You can collect usability metrics for both novice users and experienced users. Few websites have truly expert users, since people rarely spend enough time on any given site to learn it in great detail. Given this, most websites benefit most from studying novice users. Exceptions are sites like Yahoo and Amazon, which have highly committed and loyal users and can benefit from studying expert users.
Intranets, extranets, and weblications are similar to traditional software design and will hopefully have skilled users; studying experienced users is thus more important than working with the novice users who typically dominate public websites.
With qualitative user testing, it is enough to test three to five users. After the fifth user tests, you have all the insight you are likely to get and your best bet is to go back to the drawing board and improve the design so that you can test it again. Testing more than five users wastes resources, reducing the number of design iterations and compromising the final design quality.
Unfortunately, when you're collecting usability metrics, you must test with more than five users. In order to get a reasonably tight confidence interval on the results, I usually recommend testing 20 users for each design. Thus, conducting quantitative usability studies is approximately four times as expensive as conducting qualitative ones. Considering that you can learn more from the simpler studies, I usually recommend against metrics unless the project is very well funded.
Comparing Two Designs
To illustrate quantitative results, we can look at those recently posted by Macromedia from its usability study of a Flash site, aimed at showing that Flash is not necessarily bad. Basically, Macromedia took a design, redesigned it according to a set of usability guidelines, and tested both versions with a group of users. Here are the results:
| Original Design | Redesign |
Task 1 | 12 sec. | 6 sec. |
Task 2 | 75 sec. | 15 sec. |
Task 3 | 9 sec. | 8 sec. |
Task 4 | 140 sec. | 40 sec. |
Satisfaction score* | 44.75 | 74.50 |
*Measured on a scale ranging from |
It is very rare for usability studies to employ tasks that are so simple that users can perform them in a few seconds. Usually, it is better to have the users perform more goal-directed tasks that will take several minutes. In a project I'm working on now, the tasks often take more than half an hour (admittedly, it's a site that needs much improvement).
Given that the redesign scored better than the original design on all five measures, there is no doubt that the new design is better than the old one. The only sensible move is to go with the new design and launch it as quickly as possible. However, in many cases, results will not be so clear cut. In those cases, it's important to look in more detail at how much the design has improved.
Measuring Success
There are two ways of looking at the time-to-task measures in our example case:
- Adding the time for all four tasks produces a single number that indicates "how long it takes users to do stuff" with each design. You can then easily compute the improvement. With the original design, the set of tasks took 236 seconds. With the new design, the set of tasks took 69 seconds. The improvement is thus 242%. This approach is reasonable if site visitors typically perform all four tasks in sequence. In other words, when the test tasks are really subtasks of a single, bigger task that is the unit of interest to users.
- Even though it is simpler to add up the task times, doing so can be misleading if the tasks are not performed equally often. If, for example, users commonly perform Task 3 but rarely perform the other tasks, the new design would be only slightly better than the old one; task throughput would be nowhere near 242% higher. When tasks are unevenly performed, you should compute the improvement separately for each of the tasks:
- Task 1: relative score 200% (improvement of 100%).
- Task 2: relative score 500% (improvement of 400%).
- Task 3: relative score 113% (improvement of 13%).
- Task 4: relative score 350% (improvement of 250%).
You can then take the geometric mean of these four scores, which leads to an overall improvement in task time of 150%.
Why do I recommend using the geometric mean rather than the more common arithmetic mean? Two reasons: First, you don't want a single big number to skew the result. Second, the geometric mean accounts fairly for cases in which some of the metrics are negative (i.e., the second design scores less than 100% of the first design).
Consider a simple example containing two metrics: one in which the new design doubles usability and one in which the new design has half the usability of the old. If you take the arithmetic average of the two scores (200% and 50%), you would conclude that the new design scored 125%. In other words, the new design would be 25% better than the old design. Obviously, this is not a reasonable conclusion.
The geometric mean provides a better answer. In general, the geometric mean of N numbers is the N'th root of the product of the numbers. In our sample case, you would multiply 2.0 with 0.5, take the square root, and arrive at 1.0 (or 100%), indicating that the new design has the same usability as the baseline.
Although it is possible to assign different weights to the different tasks when computing the geometric mean, absent any knowledge as to the relative frequency or importance of the tasks, I've assumed equal weights here.
Summarizing Results
Once you've gathered the metrics, you can use the numbers to formulate an overall conclusion about your design's usability. However, you should first examine the relative importance of performance versus satisfaction. In the Macromedia example, users' subjective satisfaction with the new design was 66% higher than the old design. For a business-oriented website or a website that is intended for frequent use (say, stock quotes), performance might be weighted higher than preference. For an entertainment site or a site that will only be used once, preference may get the higher weight. Before making a general conclusion, I would also prefer to have error rates and perhaps a few additional usability attributes, but, all else being equal, I typically give the same weight to all the usability metrics. Thus, in the Macromedia example, the geometric mean averages the set of scores as: sqrt(2.50*1.66)=2.04. In other words, the new design scores 204% compared with the baseline score of 100% for the control condition (the old design).
The new design thus has 104% higher usability than the old one.
This result does not surprise me: It is common for usability to double as a result of a redesign. In fact, whenever you redesign a website that was created without a systematic usability process, you can often improve measured usability even more. However, the first numbers you should focus on are those in your budget. Only when those figures are sufficiently large should you make metrics a part of your usability improvement strategy.
----------------------------------------------------------------------------------------------
Jakob Nielsen's Alertbox, February 18, 2001:
Success Rate: The Simplest Usability Metric
Summary:
In addition to being expensive, collecting usability metrics interferes with the goal of gathering qualitative insights to drive design decisions. As a compromise, you can measure users' ability to complete tasks. Success rates are easy to understand and represent usability's bottom line.
Numbers are powerful. They offer a simple way to communicate usability findings to a general audience. Saying, for example, that "Amazon.com complies with 72% of the e-commerce usability guidelines" is a much more specific statement than "Amazon.com has great usability, but they don't do everything right."
In a previous Alertbox, I discussed ways of measuring and comparing usability metrics like time on task. Such metrics are great for assessing long-term progress on a project: Does your site's usability improve by at least 20% per year? If not, you are falling behind relative to both the competition and the needs of the new, less technically inclined users who are coming online.
Unfortunately, there is a conflict between the need for numbers and the need for insight. Although numbers can help you communicate usability status and the need for improvements, the true purpose of usability is to set the design direction, not to generate numbers for reports and presentations. In addition, the best methods for usability testing conflict with the demands of metrics collection.
The best usability tests involve frequent small tests, rather than a few big ones. You gain maximum insight by working with 4-5 users and asking them to think out loud during the test. As soon as users identify a problem, you fix it immediately (rather than continue testing to see how bad it is). You then test again to see if the "fix" solved the problem.
Although small tests give you ample insight into how to improve design, such tests do not generate the sufficiently tight confidence intervals that traditional metrics require. Thinking aloud protocols are the best way to understand users' thinking and thus how to design for them, but the extra time it takes for users to verbalize their thoughts contaminates task time measures.
Thus, the best usability methodology is the one least suited for generating detailed numbers.
Measuring Success
To collect metrics, I recommend using a very simple usability measure: the user success rate. I define this rate as the percentage of tasks that users complete correctly. This is an admittedly coarse metric; it says nothing about why users fail or how well they perform the tasks they did complete.
Nonetheless, I like success rates because they are easy to collect and a very telling statistic. After all, if users can't accomplish their target task, all else is irrelevant. User success is the bottom line of usability.
Success rates are easy to measure, with one major exception: How do we account for cases of partial success? If users can accomplish part of a task, but fail other parts, how should we score them?
Let's say, for example, that the users' task is to order twelve yellow roses to be delivered to their mothers on their birthday. True task success would mean just that: Mom receives a dozen roses on her birthday. If a test user leaves the site in a state where this will occur, we can certainly score the task as a success. If the user fails to place any order, we can just as easily determine the task a failure.
But there are other possibilities as well. For example, a user might:
- order twelve yellow tulips, twenty-four yellow roses, or some other deviant bouquet;
- fail to specify a shipping address, and thus have the flowers delivered to their own billing address;
- specify the correct address, but the wrong date; or
- do everything perfectly except forget to specify a gift message to enclose with the shipment, so that Mom gets the flowers but has no idea who they are from.
Each of these cases constitutes some degree of failure (though if in the first instance the user openly states a desire to send, say, tulips rather than roses, you could count this as a success).
If a user does not perform a task as specified, you could be strict and score it as a failure. It's certainly a simple model: Users either do everything correctly or they fail. No middle ground. Success is success, without qualification.
However, I often grant partial credit for a partially successful task. To me, it seems unreasonable to give the same score (zero) to both users who did nothing and those who successfully completed much of the task. How to score partial success depends on the magnitude of user error.
In the flower example, we might give 80% credit for placing a correct order, but omitting the gift message; 50% credit for (unintentionally) ordering the wrong flowers or having them delivered on the wrong date; and only 25% credit for having the wrong delivery address. Of course, the precise numbers would depend on a domain analysis.
There is no firm rule for assigning credit for partial success. Partial scores are only estimates, but they still provide a more realistic impression of design quality than an absolute approach to success and failure.
Case Study
The following table shows task success data from a study I recently completed. In it, we tested a fairly big content site, asking four users to perform six tasks.
| Task | Task | Task | Task | Task | Task |
User 1 | F | F | S | F | F | S |
User 2 | F | F | P | F | P | F |
User 3 | S | F | S | S | P | S |
User 4 | S | F | S | F | P | S |
| Note: S = success, F = failure, P = partial success |
In total, we observed 24 attempts to perform the tasks. Of those attempts, 9 were successful and 4 were partially successful. For this particular site, we gave each partial success half a point. In general, 50% credit works well if you have no compelling reasons to give different types of errors especially high or low scores.
In this example, the success rate was (9+(4*0.5))/24 = 46%.
Simplified success rates are best used to provide a general picture of how your site supports users and how much improvement is needed to make the site really work. You should not get too hung up on the details of such numbers, especially if you're dealing with a small number of observations and a rough estimate of partial success scores. For example, if your site scored 46% but another site scored 47%, it's not necessarily a better site.
That a 46% success rate is not at all uncommon might provide some cold comfort. In fact, most websites score less than 50%. Given this, the average Internet user's experience is one of failure. When users try to do something on the Web for the first time, they typically fail.
Although using metrics alone will not solve this dilemma, it can give us a way to measure our progress toward better, more usable designs.
--------------------------------------------------------------------------------------
Jakob Nielsen's Alertbox, March 19, 2000:
Why You Only Need to Test With 5 Users
Some people think that usability is very costly and complex and that user tests should be reserved for the rare web design project with a huge budget and a lavish time schedule. Not true. Elaborate usability tests are a waste of resources. The best results come from testing no more than 5 users and running as many small tests as you can afford.
In earlier research, Tom Landauer and I showed that the number of usability problems found in a usability test with n users is:
N(1-(1-L)n)
where N is the total number of usability problems in the design and L is the proportion of usability problems discovered while testing a single user. The typical value of L is 31%, averaged across a large number of projects we studied. Plotting the curve for L=31% gives the following result:
The most striking truth of the curve is that zero users give zero insights.
As soon as you collect data from a single test user, your insights shoot up and you have already learned almost a third of all there is to know about the usability of the design. The difference between zero and even a little bit of data is astounding.
When you test the second user, you will discover that this person does some of the same things as the first user, so there is some overlap in what you learn. People are definitely different, so there will also be something new that the second user does that you did not observe with the first user. So the second user adds some amount of new insight, but not nearly as much as the first user did.
The third user will do many things that you already observed with the first user or with the second user and even some things that you have already seen twice. Plus, of course, the third user will generate a small amount of new data, even if not as much as the first and the second user did.
As you add more and more users, you learn less and less because you will keep seeing the same things again and again. There is no real need to keep observing the same thing multiple times, and you will be very motivated to go back to the drawing board and redesign the site to eliminate the usability problems.
After the fifth user, you are wasting your time by observing the same findings repeatedly but not learning much new.
Iterative Design
The curve clearly shows that you need to test with at least 15 users to discover all the usability problems in the design. So why do I recommend testing with a much smaller number of users?
The main reason is that it is better to distribute your budget for user testing across many small tests instead of blowing everything on a single, elaborate study. Let us say that you do have the funding to recruit 15 representative customers and have them test your design. Great. Spend this budget on three tests with 5 users each!
You want to run multiple tests because the real goal of usability engineering is to improve the design and not just to document its weaknesses. After the first study with 5 users has found 85% of the usability problems, you will want to fix these problems in a redesign.
After creating the new design, you need to test again. Even though I said that the redesign should "fix" the problems found in the first study, the truth is that you think that the new design overcomes the problems. But since nobody can design the perfect user interface, there is no guarantee that the new design does in fact fix the problems. A second test will discover whether the fixes worked or whether they didn't. Also, in introducing a new design, there is always the risk of introducing a new usability problem, even if the old one did get fixed.
Also, the second test with 5 users will discover most of the remaining 15% of the original usability problems that were not found in the first test. (There will still be 2% of the original problems left - they will have to wait until the third test to be identified.)
Finally, the second test will be able to probe deeper into the usability of the fundamental structure of the site, assessing issues like information architecture, task flow, and match with user needs. These important issues are often obscured in initial studies where the users are stumped by stupid surface-level usability problems that prevent them from really digging into the site.
So the second test will both serve as quality assurance of the outcome of the first study and help provide deeper insights as well. The second test will always lead to a new (but smaller) list of usability problems to fix in a redesign. And the same insight applies to this redesign: not all the fixes will work; some deeper issues will be uncovered after cleaning up the interface. Thus, a third test is needed as well.
The ultimate user experience is improved much more by three tests with 5 users than by a single test with 15 users.
Why Not Test With a Single User?
You might think that fifteen tests with a single user would be even better than three tests with 5 users. The curve does show that we learn much more from the first user than from any subsequent users, so why keep going? Two reasons:
- There is always a risk of being misled by the spurious behavior of a single person who may perform certain actions by accident or in an unrepresentative manner. Even three users are enough to get an idea of the diversity in user behavior and insight into what's unique and what can be generalized.
- The cost-benefit analysis of user testing provides the optimal ratio around three or five users, depending on the style of testing. There is always a fixed initial cost associated with planning and running a test: it is better to depreciate this start-up cost across the findings from multiple users.
When To Test More Users
You need to test additional users when a website has several highly distinct groups of users. The formula only holds for comparable users who will be using the site in fairly similar ways.
If, for example, you have a site that will be used by both children and parents, then the two groups of users will have sufficiently different behavior that it becomes necessary to test with people from both groups. The same would be true for a system aimed at connecting purchasing agents with sales staff.
Even when the groups of users are very different, there will still be great similarities between the observations from the two groups. All the users are human, after all. Also, many of the usability problems are related to the fundamental way people interact with the Web and the influence from other sites on user behavior.
In testing multiple groups of disparate users, you don't need to include as many members of each group as you would in a single test of a single group of users. The overlap between observations will ensure a better outcome from testing a smaller number of people in each group. I recommend:
- 3-4 users from each category if testing two groups of users
- 3 users from each category if testing three or more groups of users (you always want at least 3 users to ensure that you have covered the diversity of behavior within the group)
Reference
Nielsen, Jakob, and Landauer, Thomas K.: "A mathematical model of the finding of usability problems," Proceedings of ACM INTERCHI'93 Conference (Amsterdam, The Netherlands, 24-29 April 1993), pp. 206-213.