Measuring the success rate of your QA group isn’t always the easiest thing to get a good read on. Any number of factors like size of your team, relationship with the dev group, location, experience level, and a host of other variables can greatly affect not only their performance, but how you can monitor and gauge said performance. The scope of the project, and your leadership are also going to be massive factors, and are something completely out of your control.
The traditional items to track tend to be very simple.
1) Number of bugs found
- Bugs found per tester, per allotment of time (i.e. Tester X has a bug find rate of 3.7 bugs per hour)
- Number of bugs found by severity (i.e. Tester X found 4 A class bugs, 9 B class bugs and 11 C class bugs in a sprint)
2) Issues entered and then flagged as Not a Bug by the Dev/Prod team
3) Issues entered and then flagged as Known Shippable by the Dev Team
4) Total number of issues addressed during a regression run
- Issues registered as Confirmed Fixed in a regression run
- Issues registered as Fix Failed in a regression run
Now, depending on your product, your timeline, your organization and a variety of other factors, there are quite a few more items you are likely tracking, but let’s use these as our baseline for items that pretty much every QA group will track. Also, terminology will be different from group to group, but the core ideas usually remain the same. Take trackable items, examine how many of each both your team and each individual addresses, and use those to gauge performance.
And it makes sense. These are absolutely items that your team should be tracking. There is valuable data there that will help you with both your current project, and to give you a historical perspective with your future projects.
But, the things to remember here are that this data does not tell you the whole picture, and that the information tracked isn’t always as valuable for your group as it is for other teams. There may be items missed if these are the only bits of data you track and analyze.
Let’s look through these.
1. Number of Bugs Found
This is the most simple and straightforward of any metric you can be looking at. A simple total of all bugs entered for a project. You can break this down to a time period (say, per week) or per sprint, or per task assigned. This is simply how many issues your group entered into your bug DB. Is having a high bug count a good thing? Does this validate your group? Are the bugs you find having a positive impact on the overall development of the product?
You can then break this down by tester, by severity, or whatever other bug criteria you feel like applying. Generally, this is used to gauge how productive each tester is. Is Tester X finding way more bugs than Tester Y? Is this a consistent and ongoing item? Does this make Tester X better than Tester Y?
Tracking the number of issues you are entering into your DB is obviously important. There is some great data there. You can use this to chart how a product is functioning, where you need to focus, where your team is, and use it for historical data later one when you are gauging future projects. But is it necessarily a good indicator of performance?
I think it’s a yes and a no. There is a lot of context needed in order to draw any real conclusions. Raw numbers don’t tell a story. They give you an idea of the story, but they lack the ability to add the needed details that really show you what the group is about. For example:
Let’s say that we have a sprint that lasts 10 days, where the QA team is tasked with testing one specific area of a product (let’s leave the product type ambiguous). At the end of the sprint, we look at the performance of the group as a whole, as well as the performance of two of our key testers.
- The QA team logged a total of 149 bugs (initial estimate was for 185 bugs)
- Tester X logged a total of 31 bugs
- Tester Y logged a total of 13 bugs
Raw numbers seem to indicate that Tester X is performing at a higher level than Tester Y. Looked at in a vacuum, it’s hard to ignore. That’s why context is so important. Is Tester X testing a less polished section? Is Tester Y being given specific tasks that could result in a lower bug count? Is Tester X recording superficial issues that are easy to find, while Tester Y is diving into “meatier” and more difficult to identify issues? Is Tester X putting in all of those bugs in a poorly written and difficult to understand manner, adding workload to either a supervisor that has to edit and vet the bugs or even to the developers themselves who have to follow up for more details?
Or look at the bug total. If the bug total for a sprint is lower than the estimate, what does that mean? Is the QA team slacking? Perhaps the initial estimate is off due to a change in design. Perhaps a feature was dropped or changed. Maybe specific requests from the development team took away some QA time. Perhaps there were unexpected showstopper bugs that prevented the QA team from looking into as many areas as was expected. Or maybe the development team just happened to bring its A game to the sprint, and there were just fewer issues to identify.
Context tells us so much more than simple numbers, which is why we have to be careful using the number of bugs found as a target KPI. This is data to be recorded and analyzed, absolutely, but it should never be the sole indicator of performance. Reducing contributions of either a group or an individual to a simple number doesn’t adequately reflect what they add to the project.
Also, we need to look at the overall value of what this data does provide. It can be of great service to a QA team, letting them know their issue tracking, where they may be lacking in attention, where they can focus more energy, or even help identify how they can improve overall and improve tester efforts. But it’s also a great stat to be looked at by the development team and the Producer/Product Owner. This type of data can help show where they may be falling short in estimates, which areas are turning out to be buggier than initially expected, how much time they should be looking at for regression in any given cycle, how the development team is functioning overall, and thus identify multiple areas for potential improvement. The data is valuable to multiple groups, and depending on what development methodology your group uses, it may not be as readily available to all stakeholders as you might think. Remember to share this information, and remember that the raw data doesn’t tell the whole story. Tracking the context is just as important as tracking the numbers.
2. Issues entered and then flagged as Not a Bug by the Dev/Prod team
The terminology will change from group to group, but the idea is the same. Some of the bugs your testers enter will be deemed “not a bug”. Either the issue represents something that is intentional, something that isn’t implemented the way you might expect, or is simply a case of the tester not understanding how any particular feature works and/or what the expected result should be. This is a very dicey area for you and your team for a few reasons.
Every issue that your group enters needs to be looked at and addressed by at least one member of a dev/prod team. That is potentially time wasted. If the issue is complex, it may require multiple people looking into the issue in order to realize that it is, in effect, a non issue. That’s a lot of time wasted. This puts some pressure on the QA team to ensure that any bugs they enter into the DB are valid ones. This is, in and of itself, a good thing. Every QA team should be accountable for the issues they enter, both in terms of quality and validity.
This makes this an important statistic to track. How many issues are being entered into the DB by the QA group as a whole, and by individual testers that are returned to the team with the “not a bug” label. Over the lifecycle of a project, this could represent a lot of time lost, and should be something that the group makes a strong effort to reduce. Tracked project to project, we can use this data to locate potential trends. Great data to have.
The thing to remember though is that there are many reasons that something can be labeled as “not a bug”. Poor documentation, feature changes, lack of communication, change in personnel, and problems during regression can all lead to issues being labeled this way. Context once again becomes very improvement if you want to strive towards improving your group. The reality is that if this is an ongoing issue with a QA group, some examination needs to take place. The first instinct is to point fingers, generally t the opposite group (prod/dev points at QA, QA points at prod/dev). Obviously this accomplishes absolutely nothing. The key to remember here is that the goal of tracking this as a KPI is to improve. Everyone. Leadership from all groups needs to come together and see where the issue is and make attempts to patch the hole. This isn’t about laying blame, this is about finding the issue, regardless of “where” it is, and helping one another fix it. If QA is making mistakes by not engaging properly and flooding the DB with non-issues, the production team can take the time to make sure that the QA team is more involved in planning. If the documentation given to the QA team is lacking, the QA team can sit with the devs and offer them insight into what their needs are, and how accommodating them can help everyone save time. If the production team isn’t sharing changes in scope regularly, the dev team can step in and offer some feedback on ways to better communicate how direction shifts can be communicated earlier.
This is an important metric to track. But again, context is key.
3. Issues entered and then flagged as Known Shippable by the Dev Team
This metric is most important after the fact. This gives you a great number to look at in the various post mortems each group is doing, especially tracked over multiple projects.
From the QA point of view, it presents an especially interesting challenge: Is this an issue we feel we should push back on?
Whenever a producer or product owner calls a bug a Known Shippable, that individual is essentially admitting that, yes, it is a valid bug, but we lack either the time or the resources to address it. It is considered a flaw in your work that is deemed acceptable to be released, possibly with a flag to fix the issue in a future update.
The QA team should be going through these with a fine toothed comb. How serious is the issue really? How much does it detract from the enjoyment of the product? How likely are we to be called out on it in user reviews (depending on the product)? How comfortable are we with letting this go out with that defect still existing?
Ultimately, the producer/product owner is the one accountable for these decisions. They are the ones that own the timeline and the resources and have to make these calls. But the QA team owns the expertise. The QA group are the ones closest to the issue, are the ones that best understand the impact of the issue, and best qualified to offer real context on the issue. QA may not have the ultimate accountability, but QA has the expertise, and should absolutely make their voice heard on these issues.
Every issue that gets flagged as Known Shippable needs to be examined. The responsibility here for QA is to know each issue, inside and out, and flag risk. To give the producer/product owner that added level of context that they may not have from looking at a DB. Occasionally some hard push back may be needed when there are issues that QA feels strongly about, and any organization that wants to succeed will listen. At the end of the day, the producer/product owner will make the call, but the advice of QA needs to be clearly established, risks need to be flagged, and if possible, a few alternate solutions should be discussed.
What makes this such an interesting metric to track is that it has little to no direct correlation to QA performance overall. These are almost always a business related driver. The dev/prod teams acknowledge that these are legitimate issues (validating the work that QA has done), but it is based off of resource limitations (time, human effort, budget…) that these issues get “waived”. The value in tracking this for QA is an internal one. How can QA better focus their testing? Better aiming their efforts to identify more critical bugs (A and B severity for example) rather than lower class bugs that may or may not be looked into, and may or may not even be worth regressing if they are claimed to be fixed. There are useful trends here to be examined, and the results could give the group a better look into how they go about testing a product.
4. Total number of issues addressed during a regression run
This is another pretty straight forward metric to look into. How many bigs the team addresses in a regression sweep. The complexities add up though when you look at the timeline as a whole, and the amount of effort put into the regression.
Generally speaking, the dev team will flag items waiting to be regressed into the DB as “Claimed Fixed”, or something similar. They are saying that they have looked into the issue, made changes, and have rebuilt with the issue solved (in theory). QA teams tend to have different methodologies on they tackle regression. Some use a rolling regression model (essentially having people dedicated to regressing issues on an ongoing basis) whilst others use a sweep method (essentially setting aside a dedicated period of time for people to do bug regression). The principle is the same, but for the moment I am moving on with the notion of a the sweep method of handling issue checks.
The method for tracking regression is generally simple: number of bugs regressed in a time period. This can be further divided amongst the testers performing the regression. For example:
Let’s say we have a regression sweep scheduled, with 53 issues that need to be addressed. The QA team lead assigns two testers (out of a team of 4) to perform this regression. This decision was made based on an initial look at the build, and the potential for there to be some fairly serious issues still to be found. By the end of the regression sweep (a set period of time), the results are:
- Total number of issues addressed: 47
- Tester 1 addressed 31 issues
- Tester 2 addressed 16 issues
Looking just at the numbers, with no context, somebody could draw the following conclusion: The team failed to address all issues, the lead may have made a poor choice in not assigning more test hours to the regression, and Tester 1 contributed a lot more to the regression than Tester 2.
Again, without context the conclusions reached may not match up with the facts.
Suppose that the 6 bugs that weren’t regressed are in an area of the product not currently testable in the latest build. Or maybe that particular area is so buggy at the moment that the results would not be deemed 100% accurate. Or perhaps the team asked, afterthefact, to not test those issues by the dev team, as another issue had been created. What on the surface may look like a failure on the part of the QA team, could be anything but. Now, the team lead should absolutely be flagging issues like these and providing context at the time so that everyone understands what the situation is. It’s just when you look at pure data later on, that these intricacies may be lost.
And the testers? Suppose the product is a game, and the bulk of the issues that Tester 2 is regressing occur at the very end of a fairly long game. This could require a fair amount of playthrough time just to get to those bugs. Or perhaps the area of the product the tester is investigating is buggy itself and the Test Lead asks the tester to report all issues whilst regressing. Or maybe the Test Lead is triaging the issues and doling them out to the individual whose skill set best matches them, with differences in complexity and test methodology included.
Just like with the number of total bugs regressed, individual statistics only tell part of the picture. A lot of times these metrics are then used to gauge overall tester performance, which to me, isn’t their best use. This data has value in showing how testers in general can better perform, not to pinpoint the inadequacies of individual testers. This is something that the Test Lead should be looking into and gaining a better understanding of where time savings can be made, how tasks can be better delegated, where strengths and weaknesses lie, and how to best provide a valuable service to the dev team. The data has value, but the value is tied to performance overall, as opposed to individualized grading.
Whilst regressing these issues, QA will usually give one of two “stamps” to each issue: Confirmed Fixed or Fix Failed (again, terminology may be different, but the idea is the same). These provide the dev team with the information they need: Either what they attempted to fix is indeed fixed and they can now move onto something else -=or=- the issue is indeed not fixed at all, and they have to go back to the drawing board.
By and large, these metrics are directly applicable to QA, and are of more benefit to the prod/dev teams to show their planning and results. The benefit for QA is more along the lines of better upfront planning, and communicating with prod/dev about regression trends. If, for example, you are finding all of your issues as Fix Failed, then there is a discussion that needs to occur to see where the ultimate problem lies. Sometimes it’s a database problem, sometimes a build delivery issue, sometimes just a communication issue, or it could a performance issue with a member of the development team. The value of this data can be used in a longer term method by seeing what values you have in regression, and what kind of time sink it can be versus the benefits derived. Again, this is where a great series of post mortems can really change how you work moving forward.
I apologize that this article went on longer than I originally imagined, and rambled off on tangents along the way. It is a *very* broad topic, with wayyyy more detail and intricacies than I can really put to words here. The value of traditional QA KPIs and overall use of metrics was something that I really wanted to dig into deeper when I was working at EA, and never really had the opportunity to do.
By and large, I think that the value of these metrics differs greatly on who and what your team is:
In a large or global test group, these metrics have a little more empirical value. You tend to have larger teams, often going through transitions, with a fairly high turnover rate and heaps of data. You may work in a different location than the prod/dev teams, and you may even be working different hours or in different timezones. The more traditional metrics get used with more value added simply because it gets really difficult to get more granular. That being said, I still firmly believe that context is important. These items may be the KPIs that determine the group performance as whole, but they still need that context if you want them to have value other than being something used for finger pointing. The more formal post mortem process is what’s going to help give you that real context, and hopefully give you data that can be used to improve overall performance.
In a smaller team, usually in a smaller company, the data still has value, absolutely, but I’d argue against calling the metrics your KPIs. They don’t *really* offer you a true indicator of performance. They give you information, information that should be used in conjunction with your prod/dev groups to see how to work better, but often they indicate trends, rather than true indicators. For these smaller groups, I’d peg things like “release success”, “engagement”, “sprint success”, project to project improvement”, and “group cohesion” to be better indicators of performance. Are they harder to quantify? Absolutely. But I think you get much more benefit from them as a team overall (prod/dev/qa) than getting overly granular on metrics.
And again, this is more of a rambling thought than a true conclusion. Over the years I’ve had some great conversations with some talented individuals on things like KPIs, metrics, regression, best practices. There are a lot of great ideas, and some ideas that are way out there but could actually lead to excellence if they were properly applied. For me personally though, I like a good dose of context and a side order of followup when it comes to looking at metrics. I’m just crazy that way.