What is the required sample size for quantitative usability testing?

Summary

  • Focus Groups (Qualitative study) : Test at least 15
  • 1:1 Interviews (Qualitative study) : Speak to at least 5
  • Surveys, Quantitative studies : Minimum 20, recommended 100 users to get statistically significant numbers; tight confidence intervals require even more users
  • Card sorting: Test at least 15 users
  • Eyetracking: Test 39 users if you want stable heatmaps

 

Qualitative studies are aimed at opinions and motivations. Not numerical data.

Quantitative studies are aimed at numerical data, not opinions and motivations)

A common contention in the board room when it comes to usability testing is on the statistical reliability of the research.

For almost all practical web development projects, a high level of accuracy is not required. Bear in mind the typical objective is to assess which design is better and NOT to find some optimum mean.

Even for quantitative studies, in practice, a confidence interval of +/- 19% is ample for most goals.  To bring it to the level of +/-10%, you could require nearly more than 4x the users and you can get much better data by testing 4 different designs with 20 users, rather than blowing your entire budget on a single design in pursuit of academic precision.

Explanation

1:1 Interviews and Focus Group Research (Qualitative Methods)

Test around 20-30

1:1 Interviews and focus groups are qualitative methods. Qualitative sample sizes should be large enough to obtain feedback until the point adding additional participants do not result in additional perspective or information.

For almost all website research projects, our experience is that approximately 30 participants comfortable place to stop, beyond which seldom justify the additional cost.

1:1 Interviews is another alternative qualitative method. In website research, we use it primarily to get the senior management opinion and corporate direction of the site. Due to their seniority in the corporate hierarchy, focus groups are not a feasible approach as the participants would be inclined to follow with the opinions of the most senior member.

This recommended number is also well within established guidelines.

For an ethnography, Morse (1994) suggests approximately 30 – 50 participants.  For grounded theory, Morse (1994) has suggested 30 – 50 interviews, while Creswell (1998) suggests only 20 – 30.  And for phenomenological studies, Creswell (1998) recommends five to 25 and Morse (1994) suggests at least six.  There are no specific rules when determining an appropriate sample size in qualitative research.  Qualitative sample size may best be determined by the time allotted, resources available, and study objectives (Patton, 1990).

Reference: http://www.statisticssolutions.com/qualitative-sample-size/

http://usabilitytesting.sg/user-experience-course/lesson-3-understanding-your-user-personas/

Usability Testing (Qualitative Methods)

Test 5 Users

Most arguments for using more test participants are wrong, but some tests should be bigger and some smaller.

With 5 users, you almost always get close to user testing’s maximum benefit-cost ratio. (nngroup)

As with any human factors issue, however, there are exceptions:

  • Quantitative studies (aiming at statistics, not insights): Test at least 20 users to get statistically significant numbers; tight confidence intervals require even more users.
  • Card sorting: Test at least 15 users.
  • Eyetracking: Test 39 users if you want stable heatmaps.

However, these exceptions shouldn’t worry you much: the vast majority of your user research should be qualitative — that is, aimed at collecting insights to drive your design, not numbers to impress people in PowerPoint.

The main argument for small tests is simply return on investment: testing costs increase with each additional study participant, yet the number of findings quickly reaches the point of diminishing returns. There’s little additional benefit to running more than 5 people through the same study; ROI drops like a stone with a bigger N. (nngroup)

Reference : https://www.nngroup.com/articles/how-many-test-users/

Information Architecture Research (Quantitative Method)

Test 15 Users

The main quantitative data from a card sorting study is a set of similarity scores that measures the similarity of user ratings for various item pairs.

You must test fifteen users to reach a correlation of 0.90, which is a more comfortable place to stop. After 15 users, diminishing returns set in and correlations increase very little: testing 30 people gives a correlation of 0.95 — certainly better, but usually not worth twice the money. There are hardly any improvements from going beyond thirty users: you have to test sixty people to reach 0.98, and doing so is definitely wasteful. (nngroup)

References: https://www.nngroup.com/articles/card-sorting-how-many-users-to-test/

Quantitative studies for Web Usability Data

Test 20 to 100+ users

We know from previous analysis that user performance on websites follows a normal distribution. This makes things easy and we just need two numbers — the mean and the standard deviation. Jakob Nielsen a leading usability expert shared his findings below:

“I analyzed 1,520 measures of user time-on-task performance for 70 different tasks from a broad spectrum of websites and intranets. Across these many studies, the standard deviation was 52% of the mean values. For example, if it took an average of 10 minutes to complete a certain task, then the standard deviation for that metric would be 5.2 minutes.”

“We know that for user testing of websites and intranets, the SD is 52% of the mean. In other words, if we tested ten users, then the SD of the average would be 16% of the mean, because .316 x .52 = .16. Let’s say we’re testing a task that takes five minutes to perform. So, the SD of the average is 16% of 300 seconds = 48 seconds. For a normal distribution, two-thirds of the cases fall within +/- 1 SD from the mean. Thus, our average would be within 48 seconds of the five-minute mean two-thirds of the time.”

So, the SD of the average is 16% of 300 seconds = 48 seconds. For a normal distribution, two-thirds of the cases fall within +/- 1 SD from the mean. Thus, our average would be within 48 seconds of the five-minute mean two-thirds of the time.”

The following chart shows the margin of error for testing various numbers of users, assuming that you want a 90% confidence interval (blue curve). This means that 90% of the time, you hit within the interval, 5% of the time you hit too low, and 5% of the time you hit too high. For practical Web projects, you really don’t need more accurate interval than this.

The red curve shows what happens if we relax our requirements to being right half of the time. (Meaning that we’d hit too low 1/4 of the time and too high 1/4 of the time.)

 2016-09-29_2148

In the chart, the margin of error is expressed as a percent of the mean value of your usability metric. For example, if you test 10 users, the margin of error is +/- 27% of the mean. This means that if the mean task time is 300 seconds (five minutes), then your margin of error is +/- 81 seconds. Your confidence interval thus goes from 219 seconds to 381 seconds: 90% of the time you’re inside this interval; 5% of the time you’re below 219, and 5% of the time you’re above 381.

This is a rather wide confidence interval, which is why I usually recommend testing with 20 users when collecting quantitative usability metrics. With 20 users, you’ll probably have one outlier (since 6% of users are outliers), so you’ll include data from 19 users in your average. This makes your confidence interval go from 243 to 357 seconds, since the margin of error is +/- 19% for testing 19 users.

You might say that this is still a wide confidence interval, but the truth is that it’s extremely expensive to tighten it up further. To get a margin of error of +/- 10%, you need data from 71 users, so you’d have to test 76 to account for the five likely outliers.

Testing 76 users is a complete waste of money for almost all practical development projects. You can get good-enough data on four different designs by testing each of them with 20 users, rather than blow your budget on only slightly better metrics for a single design.

In practice, a confidence interval of +/- 19% is ample for most goals. Mainly, you’re going to compare two designs to see which one measures best. And the average difference between websites is 68% — much more than the margin of error.

Also, remember that the +/- 19% is pretty much a worst-case scenario; you’ll do better 90% of the time. The red curve shows that half of the time you’ll be within +/- 8% of the mean if you test with 20 users and analyze data from 19. In other words, half the time you get great accuracy and the other half you get good accuracy. That’s all you need for non-academic projects.

Reference : https://www.nngroup.com/articles/quantitative-studies-how-many-users/

We will be happy to see your thoughts

Leave a reply

Search
Login/Register access is temporary disabled
Compare items () compare