How to measure code quality improvements?

When a company is aware of problems related to the quality of its software, it needs a strategy to progress on different topics: testing, deployment, code quality, etc. On this last point precisely, the action plan often includes training, technical coaching, a change in culture and team organization model, the deployment of software solutions, etc. After several months of concrete implementation, how can we measure code quality improvement? Which indicators should be implemented? Is it even possible? Here are some answers.

But first, can we define code quality?

This is the question that often comes up. What are the criteria that define code quality? How can we measure code quality? Benoit Ganteaume talks about it regularly in his excellent Podcast “Artisan Développeur” (in French), especially in this episode.

While we could say a lot about opinions and positions on this subject, let’s try to keep in mind that code quality is first and foremost clean code, readable and understandable by anyone other than its author, tested (and therefore testable by nature), and which meets the needs specified by the business. It is, therefore, a code that will be easy to maintain. For the rest of this article, we will keep that code quality is measured (in an ideal world) according to these criteria, even though others, such as performance, may be relevant in some contexts.

Here are five suggestions on how to measure indicators of code quality improvement.

#1 Code quality metrics

A traditional approach uses automatic code analysis tools to identify common issues and improvement areas, which are generally called “code smells”. Linters, such as Eslint, for example, are dedicated to that. Those tools embed a set of metrics to evaluate the maintainability, reliability, extensibility, or code clarity. For example:

Code coverage rate
Number of unit tests
Cyclomatic and cognitive complexity of methods
Code duplication rate
Size of classes and methods

We can then compute the evolution of these indicators over time to see the results obtained by the development team. Decreasing the complexity of the code, reducing the number of code defects, improving the coverage naturally evoke signs of improvement.

While they offer the first level of insight into quality trends, we should keep in mind that these indicators cannot, on their own, reflect a real improvement in code quality: do the code refactorings, which reduces its complexity, make the code more maintainable? Do the added unit tests, which have increased coverage, provide real value?

We must be careful that all the team’s efforts don’t target these indicators solely. Fundamental problems (why few tests? why complex code?) must be addressed first, and implementing new practices in the team should be a priority beyond the indicators themselves.

#2 Velocity

Velocity highlights the effort that a team can invest over a given period (usually a Sprint). In Agile methods, this effort is often measured in story points. The higher the velocity, the more tasks the team will be able to “accomplish.” Thus, we can imagine that if the state of the code has improved in the last few months, it is likely that the evolutions have been less costly, and it is obvious to observe an increase in velocity.

However, velocity is not only related to the quality of a code. The backlog management process, the external constraints to the team, and the impact of the project’s stakeholders are all factors that can generate delay and complexity for the technical team. Also, if estimating the effort (story points) evolves, or if the tasks are broken down differently, this velocity will evolve even if the state of the code remains unchanged.

The time spent validating tasks is also an interesting indicator, such as the time spent reviewing code, the number of round trips between the author and the reviewer, and the amount of feedback given in the reviews… One can imagine that clean and clear code will require less time to review.

#3 The number of bugs

When we talk about quality improvement, we naturally think of a reduction in the number of bugs. In a “Quality by design” approach, the identification of bugs is a process integrated very early and continuously in the development phases. It is possible to measure the number of bugs reported on the application over several months using ticket management tools (such as JIRA).

This value, although interesting, needs to take into account other criteria. For example, if the number of software users has quadrupled in recent months, it would not be surprising to see an increase in the number of bugs reported. It is also necessary to analyze each bug to see if it comes from a problem in the code. Sometimes a bug is due to a simple configuration error in the cloud infrastructure. A preliminary categorization of each bug reported over several months is therefore relevant.

#4 “Four Key Metrics”

DORA (the DevOps Research and Assessment) produced the “State of DevOps Report” in 2019, summarizing six years of research on the operational performance of IT organizations. Part of their research results appeared in the book “Accelerate” (Jez Humble, Nicole Forsgren Ph.D. Gene Kim).

It presents four indicators, the Four Key Metrics, which highlight the performance of development teams :

Deployment Frequency (DF) – The frequency at which releases occur
Lead Time for Changes (LTFC) – The average time for a code change to reach production
Change Failure Rate (CFR) – The percentage of releases with bugs
Mean-Time-To-Restore (MTTR) – The time required to correct a problem that appeared in production (and restore the previous version if necessary)

At first glance, these metrics seem closer to a DevOps culture than a “code quality” culture. However, it is interesting to observe that :

In a project where the code is well structured and allows for low-cost evolutions, deployments are likely to be done more frequently.

In a project where the code is automatically tested at different levels (unit/integration/functional/…), thus preventing the risk of regressions, the team will be more confident to deliver frequently and decrease the CFR.

These four indicators are, in fact, closely related. Indeed, if the continuous integration and deployment (CI/CD) process is fully automated, the DF and LTFC will be relatively high. On the other hand, if there is no testing and continuous code improvement strategy, the CFR will probably be impacted. Bugfixes and other hotfixes will be frequent.

These four key metrics are relevant indicators of the level of code control within an organization. Without them, the team will lack a vision of the impact and consequences of the changes made to the code: does the code do the same thing as before? Gaining control amplifies the team’s confidence to evolve its code base and deliver it better. “We’ll delay delivery by three days, and we still have several manual tests to perform”: if this sentence comes up often, it may indicate a lack of confidence in the team’s code, but also a perfectible testing strategy.

#5 Teams feelings & customer satisfaction

Over time, the satisfaction of the business (the customer) is measured in part by the ability of the team to adapt to changes and consistently deliver value. If tasks’ estimations will by nature be wrong on arrival, the real gap will tend to be amplified if the code is complex. In situations where a feature seems easy to implement in theory, in practice, the time required varies depending on the state of the code or the presence of non-regression tests. If these real discrepancies occur frequently, they may generate frustration on the business side.

On the developers’ side, their long-term presence in a team often reveals positive health: good cohesion between people, exciting challenges, enriching work methods, …. Conversely, frequent turnover is a sign that the environment, including the code, may be toxic for developers. Staying too long on a project with bad code and no sign of improvement can have harmful consequences.

Ultimately, end-user satisfaction is also a good indicator. A mobile application rated 5 stars on stores is a good thing, especially at launch. However, if this rating remains constant over 12 or 24 months, it’s even better! This means that there are regularly new features and added value, and almost no bugs or regressions (which comes back to point #4, don’t you think?)

This human feeling, very qualitative, both on the development team and the business side should, in our opinion, always be taken into account in this kind of continuous improvement process.

Bonus: A classic…

Proposed in 2008 by Thom Holwerda, this rather subjective indicator measures some operational complexities in the code 😉

Do we really need indicators?

Are you familiar with “Goodhart’s Law”? It states that when a measure becomes an objective, it ceases to be a good measure. In a context where we are constantly looking for Key Performance Indicators (KPIs), we have to keep a certain vigilance on these indicators, which can be easily bypassed or twisted in a counterproductive way: the best example is the addition of worthless unit tests, with the sole objective of increasing code coverage.

From our perspective, management by metrics is a slippery slope, which often reflects a problem of trust between management and IT teams. It would be better to study alternatives and work on this fundamental problem first, rather than having a strategy that only aims to improve these indicators.

What does this mean?

In this article, we covered several complementary approaches to evaluate improvements in code quality over time. The latest trends support the use of “Four Key Metrics,” which is explained by the increasing diffusion of the DevOps culture within companies. In addition to the quantitative aspect of the metrics, the qualitative approach and the human feeling are good complements to evaluate the impact of a code quality improvement approach.

“Raising the bar” in the Software Craftsmanship manifesto is a key concept: the ability of a team to continuously raise its level must be a top priority. The indicators mentioned in this article will naturally improve if this culture is there. The team must therefore be supported to bootstrap this dynamic. Tools such as peer programming, Craft Workshops, code review, and TDD are ways to achieve this.

From your perspective, do you have any indicators to suggest? Or do you need advice on how to implement a continuous improvement strategy to optimize code quality? Let’s talk about it!