Estimating the number of false positive tests



All tests have a reported sensitivity (percentage of positive tests correctly returned for infected people) and specificity (percentage of negative tests correctly returned for non-infected people). When we test people without knowing if they were infected or not, the probabilities flip, so we calculate the chance of a person being infected (or not), given the test result. The method of flipping conditional probabilities is called Bayes Theorem (click to see the basic math).

If we know three things, we can easily determine the number of true positive and false positive COVID-19 tests. Those are the prevalence of COVID-19 in the community being tested and the sensitivity and specificity of the SARS-CoV-2 rRT-PCR test. Prevalence is the probability of someone being infected, Pr(infected). Sensitivity is the probability of getting a positive test for a person known to be infected, and specificity is the probability of a negative test for a person known to not be infected.

Bayes Theorem is used to flip the conditional probabilities of sensitivity and specificity to find the probabilities of true positives, Pr(infected | + test), and false positives, Pr(not infected | + test). (This notation is read as “the probability of being infected, given a positive test” and “the probability of not being infected, given a positive test.”)

Unfortunately, we don’t know any of those parameters. The sensitivity and specificity are reported by test manufacturers, all of which claim 100% specificity, but that’s about as reliable as every car manufacturer claiming they have the best fuel economy. Cohen & Kessel‘s paper examined 8 different PCR tests that all advertise 100% specificity. However, they reviewed real data on PCR tests for RNA viruses since 2004 (not manufacturer claims) and found the median specificity was 97.7%. The majority fell between 96% and 99%, with one as low as 84%. For this analysis, we’ll use 97.7% specificity and 99% sensitivity. (The number of false positives is not affected much by sensitivity.)

The prevalence is unknown and variable as the disease spreads through the population. We can impute the prevalence based on the percentage of positive tests each day. This is not exact because the tests are not performed randomly on the entire population, so the percentage of positive tests is probably greater than the true percentage of a positive tests that would be found in a random sample of the population. This causes an overestimation of prevalence, but that actually reduces the calculated number of false positive results and makes the estimate conservative.

Here’s the procedure we used to impute prevalence:

From conditional probability for sensitivity and specificity:


We assumed the percentage of positive tests to equal Pr(+ test) and label it %+. We let the prevalence be Pr(infected) and label it with p (so Pr(not infected) = 1 – p). To simplify notation, let S1 be the sensitivity and S2 be the specificity.

Now we can solve for p with some algebra:

This presents a problem because we are not guaranteed a positive numerator, especially when the percentage of positive tests in a given day is very low. To overcome this problem, we set a baseline prevalence equal to zero (or the minimum percentage of positive tests).

Using this technique, the prevalence of the disease is assumed to fluctuate over time, mirroring the number of cases (i.e., positive tests).

Applying this technique to estimate prevalence with 99% sensitivity and 97.7% specificity to the Florida Department of Health data on testing since April 1, 2020, of the 2,108,674 positive tests to date, over a quarter of them are likely false positives. The graph below illustrates this with 7-day averages to make it easier to read.

The reason there are so many false positives is that prevalence is frequently very low (below 5%). As you can see from the graph below, the probability of a false positive moves in the opposite direction of prevalence and in much more dramatic fashion. At 10% prevalence, almost one in five positive tests is a false positive. At 5%, almost one-third of positive tests are false. By 2.5% prevalence, half of positive tests are false.

This is why the FDA’s SARS-CoV-2 rRT-PCR fact sheet, says in big, bold letters: “This test is to be performed only using  respiratory specimens collected from individuals suspected of COVID-19 by their healthcare provider” (emphasis added). By only testing symptomatic people, the prevalence in the sample should be closer to 50%, dropping the probability of a false positive all the way down to 2%. If we continue mass testing asymptomatic people when there is 1% prevalence, the probability of false positives is 70%.

  • Thank you for your excellent explanation of Bayesian statistical analysis. Many physicians are surprised at the number of false positives that occur using highly sensitive and specific tests when the incidence of the disease is low and the population is randomly tested. We saw this with AIDS, and it is the reason for obtaining confirmatory second tests using a different technology.

  • >