The misunderstanding of code coverage

Code coverage is an important metric about your code and tests. And like any metric you have to understand the underlying assumptions and computations to understand the value of this metric to use it efficiently. So what is code coverage:

Code Coverage is the ratio of code that is covered by automatic tests.

This simple statement has already a lot of details. First code coverage is a ratio - meaning its value is always between 0% and 100% (or 0 and 1 if you fancy the simpler mathematically notation of a ratio). It is a ratio of an absolute amount of code, which is another measure by itself and has different ways to compute it. Since the amount of code is the underlying metric, it determines the semantics of code coverage, as i will discuss in this article here. Then there is the term covered which is also part of the name of this metric. Code that is covered is just code that is executed by tests - nothing more. And the last information of the above definition limits the applicability of the metric to automated tests - be it unit test, integration tests or system tests. One could imagine that this metric could also characterize manual tests, but there is not convenient way of computing this metric in such a scenario.

How to determine the amount of code that could be covered?

This questions seems simple at first, but the answer to this question should be chosen carefully, since it has implications on the meaning of the derived value of code coverage. When talking code coverage there are two commonly used ways that are supported by most mature tools: line coverage - based on the lines of code - and branch coverage - based on the different execution branches of code. Other approaches are function coverage, edge coverage or condition coverage, but are seldom reported either because their limited usefulness or their are very close to the above mentioned coverage metrics. So in the following paragraph i will focus on the common two coverage metrics.

Line coverage describes the amount of lines of code that are covered by tests. And branch coverage describes the amount of possible execution branches covered by tests. And both these metrics tell you a similar message, since every branch is made up of lines of code. However branch coverage is the one you should care about more. Imagine the following simple one liner, which gets translated into several possible branches:

if (index % 3 && index % 5) return "FizzBuzz";

The execution branches here are index % 3 is false, index % 3 is true but index % 5 is false and index % 3 is true and index % 5 is true. You want all of them covered, because each branch can determine the behavior of your code. And it is just one line. Now imagine a single branch with lots of lines of code which gets tested, and an alternate much shorter branch which is not tested - usually this happens when testability of the alternate branch is not as good. In this case line coverage is still very high, while branch coverage is at 50%. On the other hand lines of code is not necessarily an indicator of how important the implemented behavior is - the missing branch could still be far more valuable for your software.

So my take here is that branch coverage is more sensitive to untested behavior and whenever you have the possibility to chose between the two go for branch coverage - it does not hurt to look at both when you have them.

An impressive example

If you want to delve deeper into the effect the different coverage metrics i recommend reading a bit about the testing philosophy of SQLite, which has become the most common used database system in the world, after they managed to get the right test coverage up.

How high?

Typically one wants to get test coverage as high as possible, ideally up to 100%. However there are some valid scenarios where 100% coverage is not feasible.

The most common scenario is a typical legacy code base which is poorly tested or not tested at all. In such a scenario one cannot enforce 100% test coverage because it would mean just writing tests for months if not years for poorly testable code, without delivering any business value. There are a number feasible approaches to this scenario with regard to code coverage, like any new code should be test covered with 100% and should be inserted into the old code base via suitable seams. Read the book Working effectively with legacy code from Michael Feathers for more ideas on how to treat and test legacy code.

Another closely related scenario is a code base which is coupled to a legacy framework, which was designed without testability in mind - like intensive, hard to resolve couplings or heavily relying on singletons. In such a scenario you will frequently encounter the boundaries of your framework and have to do heavy lifting to get your tests running - a typical symptom of such a system are complicated and expensive test fixtures. My advice for such situations is the humble object pattern (see xUnit test patterns by Gerard Meszaros), where one separates the testable logic from the untestable/hard-to-test framework (the humble part of your code) and combine both of them only at runtime. The business logic gets tested very well, while the other code is not tested so much. This results in a very good test coverage of the valuable code. In this scenario a overall 100% test coverage is not possible and a general recommendation on how high it should be cannot be given, since it depends on your frameworks.

There is also another argument against 100% coverage. Not all code is as complicated that it needs tests. This can be true for very trivial functions, but a test assures not only correct behavior but also the ability of long time maintainability and extensibility. So if you have a function that seems too simple to be tested, you should at least write tests for it, when change is lurking around. A common cited example of such trivial, not test-worthy functions are getters and setters in common OOP language, like Java or C#. But getters and setters are violations of the principle of encapsulation, so i recommend to avoid them as much as possible. Nevertheless such methods seem to happen. And often the answer to such not-test-worthy code is to lower the bar.

Often all the above reasons occur together in the same code base. And typically organizations answer to these oddities is a recommended or even mandatory code coverage threshold of 90% or 80% - not 100%. This is a reasonable compromise, but it can have its shortcomings when applied wrong - which will be the topic of the following paragraphs.

The fallacy of code coverage

As stated in the beginning code coverage describes the ratio of code that is executed by tests. It does not describe the amount of code that is tested by tests. Consider the following simplified example:

Tuple compute(input) {
    var a = computeA(input);
    var b = computeB(input);
    return new Tuple(a,b);
}

void TestCompute() {
   Tuple result = compute("XXX");

   assert result.First == 3;
}

The test method executes the method compute, which in turn executes computeA and computeB. However the test is only concerned with the result of computeA and ignores the result of computeB. Test coverage would still be very high here, since all lines of compute are executed and even some of computeB, however computeB is not tested.

This is a common pattern in many big code bases: Test coverage is very high, while certain results/side-effects of the tested code are simply neglected. I've seen it in many code bases and even done it myself a few times.

The above fact about coverage is often forgotten, which leads to a common misinterpretation of code coverage, that high code coverage indicates good testing. High code coverage does not automatically tell you that all parts of the code are tested. On the other hand low code coverage tells you that some parts of your code are not tested for sure. Therefore code coverage should be regarded as a one-sided metric - only low coverage has a meaning that can guide your decisions, high coverage means that you have to look for other clues to judge the quality of your testing.

The culture of governing by metrics

The following argument applies for almost all metrics. A metric is a simplified descriptor of a complex reality. Such a metric can support decisions, but they should not always be the sole source of decisions. A metric can hint on something not being as it should be, but it mostly should be the reason for further analysis and not the sole reason for decisions. If you make a metric into a hard decision criteria, instead of treating it as a hint, people will eventually game the system and adopt their behavior to optimize the simplified metric instead of dealing with the complex reality.

Code coverage is such a simplification of your code (which is a complex world in itself). It gives you an intuition about the automatic tests of your code base, but it is not the one and only truth and it takes an expert to read and understand it. However some organizations make code coverage a hard rule of their build systems - like a pull request gets automatically rejected if it does not meet a certain threshold. The idea behind such a rule is, that it enforces writing tests - if you have to convince your developers to write tests with such a rule, i guess, you have a bigger problem at hand. The bad part is: Such rules only work to a certain degree. We as developers are human beings and we adapt to the constraints that are given to us. So if we face an arbitrary mandatory code coverage threshold, eventually some developers tend to adopt their behavior towards getting coverage as high as possible without regards to the value of the written tests. For example one can always increase coverage by writing tests for trivial code like getters and setters. On the other hand if the mandatory threshold is reached one is not forced to write more tests to secure the behavior of the software. Of course such behavior is not professional, but organizations often create environments where unprofessional behavior is encouraged, especially if developers are managed via easily computed metrics, like code coverage.

How to use code coverage

I highlighted a few misconceptions about code coverage and how code coverage should not be used. And still i am convinced that code coverage is a very important tool that helps me develop better code.

Sp how do i use it? First and most important: I write a lot of tests. It is part of a normal development cycle. If i encounter untested code, i write tests. If i add new behavior, i write tests. Sometimes i do it TDD, sometimes i write tests first and when working with legacy code i have no other choice but to write tests for code that is already written. Whatever works best. But i am also lazy, i don't just want to write tests that add no value. So i start thinking about what should be tested and thinking in test equivalence classes helps me a lot to determine what tests should be written and what tests are not necessary. So basically i analyze the problem space at hand (or the already present code in the legacy scenario) until i am sure what needs to be tested next. I write the tests and when TDDing also the corresponding implementation and go back to the beginning - it is an iterative process.

And sometimes in between - mostly towards the end of a larger chunk of work - i decide to look at the code coverage. And when i do that i have a very specific expectation about the value of code coverage, sometimes i even expect certain lines to be covered a specific number of times. And when i look at the results i validate that my expectation holds. Because my mental representation of the code and the tests generate this expectation and my mental representation should match the reality. If it does not match it means to me that i have to do some work, because clearly i haven't understood the code as well as i thought i had.

So code coverage is a useful little tool for me that supports my development flow. Besides that i also sometimes look at global coverage just to get an intuition about the global state of tests, but i do not rely on this global coverage.

Now this helps me to write good tests, but i am as anybody flawed and sometimes all my efforts are not enough to write high quality tests. So how do i ensure good test quality when not based on the metric of test coverage? In my opinion the best tool to ensure good testing is a thorough code review by another expert of the code, that includes test code as well. Test code has to be read differently, but one quickly gets used to it and can start asking the important questions about missing test cases, bad to read tests and a number of known test smells. And of course the code reviewer is not perfect as well, so there is a slight risk that not always high quality test code gets accepted into the main branch, but chances are, the next time one encounters these flawed test these tests will get rewritten and improved. So altogether test quality will be high, even without relying on the metric of code coverage to judge test quality.

Update

Just by coincidence i found this little note from Martin Fowler , which is a nice short version of my above text. Seems i am not the first to notice this (i am not surprised ;) ).