Mythical Code Coverage

Introduction

What is the exact coverage of the unit tests in your project?

It does not matter. If you find that blasphemous, read on.

The code coverage, usually expressed as the percentage of code-base that is exercised by unit tests, became a popular "metrics" in the area of quality assurance. A single and simple to process number is in many cases a goal in itself and sometimes even a tool for expressing the dedication of the team to reach high quality standards. As a number, the test coverage is also often (mis)used as a basis for rewarding or comparing the work of particular developers or whole teams. This is very unfortunate and I will try to explain several reasons why.

Basic problem with interpretation

The biggest problem with numbers of any kind is that we try to treat them according to their numerical nature, often forgetting the hidden, non-numerical associated details.

The code coverage, if expressed as a percentage of total code-base, does not look different from any other number expressed in the same way, which means that we give them similar implicit properties:

The numbers are presumably monotonous. This property allows us to compare different values - a glass filled to 80% looks better than a glass filled to 50%.
The numbers are also additive and linear. This allows us to predict and estimate the effort - for example, if painting 50% of the long fence took N days, then very likely painting the remaining 50% of that fence will take N days, too.

None of these properties is true with code coverage.

The coverage is not monotonous in terms of effort, because different parts of the code have different difficulty in preparing their respective tests - simple getters/setters are a lot easier to cover than code with complex control flow. It is also not monotonous in terms of effect, as explained below with the problem of test relevance. Similarly, the coverage is not linear in the sense that having 50% of coverage does not mean that we are halfway through our testing work - compare that with the problem of painting the fence.

These problems pose difficulty in interpreting the coverage numbers and in particular in the effort/effect aspect they lead to clear injustice when used as a basis for rewarding developers for their work. It should be not surprising that without these fundamental numeric properties, the code coverage can be easily abused not only by managers, but also by developers themselves, who will be prone to choose the easy path and try to pump up their statistics by targeting the low-hanging fruits and not necessarily what is actually needed - and what is actually needed can be expressed in terms of test relevance.

Relevance of tests or 80/20 rule striking again

The so-called "80/20 rule" is very often cited to discourage premature optimization. The gist of this rule is that only a small fraction of the program (say 20%) is executed most of the time (say 80%) and this is where the engineering effort should be focused.

This rule seems to be valid for proper placement of optimization effort - but is it also relevant with testing? Many engineers will probably answer that testing and quality assurance is completely different, because misplaced optimization effort will at worst result in smaller optimization effect with regard to the given investment, whereas non-tested code is a shame no matter where it is. This is certainly true, but the practice of test coverage actually leaves some unavoidable holes. In other words - do you really have 100% coverage? I guess no.

The fact that code is not usually tested with 100% coverage means that only part of the code-base is tested - but which part, actually? Is it the part that is most often used and therefore most relevant to final users? Or rather the part that is used only occasionally or maybe not at all (think about all the commands and options in your favorite text editor for ideas)?

The problem is that if code coverage information is not correlated with relevance of the covered code, then the number that expresses the coverage means nothing - it does not translate to the value that is delivered to final users. In other words, the fact that the tests have 50% coverage has no interpretation in terms of how important is that code - and the main observation from the 80/20 rule is that different parts of the code have different value.

False sense of security

The false sense of security comes from the fact that the universe of all state transitions is usually much bigger than the visible code structure - in other words, some important details are hidden. Here is an easy example in C:

int main(int argc, char * argv[])
{
    char firstArg[10];
    strcpy(firstArg, argv[1]);

    // ...
}

Above program makes a copy of its first parameter (presumably to modify it) and goes on with some important processing. It is very easy to test and the part that is shown can be covered 100% with any single test that uses some short value for first parameter. In other words, the lines shown above are covered by the test units.

Does it mean that we can feel safe about this part of code? Or that we can declare this code to be correct? Certainly no, because there is a deadly buffer overflow bomb just ticking there and waiting for its prime time - the lines of code are easily covered, but the border conditions of the code can still be untested. It is deceptively easy to miss this after reaching "good" coverage values.

Another interesting example of things that are easily forgotten are multi-threaded interactions. The tests might cover every single line of code in isolation (in other words, each unit test can cover some part of code and the totality of the tests can produce nice-looking coverage), but the code can still contain multi-threaded bugs related to improper synchronization, deadlocks, and such. Again, high coverage numbers can detract developers (and their managers) from the threats that are still there in the code.

Yet another example is related to the huge amount of implicit control flow structures that are not visible at the level of source code, but that can be triggered under some conditions. There is a spectacular C++ example from Guru of the Week by Herb Sutter:

String EvaluateSalaryAndReturnName(Employee e)
{
    if (e.Title() == "CEO" || e.Salary() > 100000)
    {
        cout << e.First() << " " << e.Last() << " is overpaid" << endl;
    }

    return e.First() + " " + e.Last();
}

The complexity of this code is very small and it is easy to construct a set of tests (two, perhaps?) that will cover every single line of this code. The problem is that due to the possible exceptions this code has 23 (yes, twenty three) different execution paths, even though only two are explicitly visible. Again, no matter how spectacular is the code coverage, it is deceptive, because it does not tell the whole story about what is being tested.

So what does it really mean to "cover" some piece of code? Unfortunately it only means that there exists at least one set of input values that went through the code without breaking anything - it does not yet mean that the code is correct. There is a lot more to correctness than just demonstrating flawless execution with a single set of input data and unit tests are not particularly efficient in capturing this difference.

Tuning the numbers or You Will Get What You Measure

There is an old rule in management saying that employees will always "optimize" their work according to the metrics that are used to judge their performance. No matter whether they are rewarded for the number of lines written per day or the number of bugs fixed or the number of phone calls serviced in the call center, they will certainly invent clever methods to tune their statistics. The volume of code for the given functionality can be inflated virtually without limits, bugs can be introduced just so that they can be fixed (no, this is not a joke - it is what was really observed) and lengthy customer phone calls can be interrupted to get more connections during the day - just to give simple examples.

Is it possible to tune the coverage statistics as well? Sure. It is enough to inflate the code that is already covered and this will increase the total coverage rate without writing any single additional test. Here is an example, this time in Java:

if (condition) {
    System.out.println("Hello world!");
} else {
    System.out.println("Bye!");
}

Suppose that there already exists a test that causes execution of this code with true condition - the first branch will be executed (and therefore covered), whereas the other branch will not. How to increase the code coverage? By writing another test case? Well, this is one possibility, but there is another one:

if (condition) {
    String message = "Hello world!";
    System.out.print(message);
    System.out.println();
} else {
    System.out.println("Bye!");
}

(Readers will certainly invent many more ways to inflate the covered branch.)

Above, the amount of code that was covered increased, even though no new test was written - this is exactly the same technique that is used when developers are rewarded for the number of lines of code, with the only exception that they have to be clever enough not to inflate everything uniformly...

Remember: You Will Get What You Measure.

What to do?

Does it mean that measuring test coverage is useless? Not really - every metrics has some merits and tells something, but certainly developers and their managers should avoid turning the code coverage into a fetish. The meaning of numbers in this metrics is so weak, they are so detached from actual problems and are so easy to fake that using any precision higher than several distinct levels (for example low..medium..high) is outside of what this metrics can really offer.

It seems that proper evaluation of the testing effort is the Art part of The Art Of Programming - do not make the mistake of trying to hammer the richness of this art into a single number. It is easy to obtain and use and so very attractive, but it is also misleading. After all, it is not possible to control anything - including the quality - if the control scheme is based on wrong numbers.