Everyone who has ever worked on a software project knows that you don’t just simply put code and content out to your customers, employees, or stakeholders without first testing it to make sure it’s not broken or dead on delivery. Quality Assurance (QA) is such a core part of any technology or business delivery that it’s one of the essential components of any development methodology. You build. You test. You deploy. And if you’re doing this in an agile fashion, you do it in small, iterative chunks so you make sure to respond to the continuously evolving and changing needs of the customer. Surely AI projects are no different. There are iterative design, development, testing, and delivery phases, as we’ve discussed in our previous content on AI methodologies.
In our previous newsletter we talked about how AI operationalization is different than traditional deployment and putting things into “production”. But if it’s different for deployment and different in development, then it follows that AI projects are unlike traditional projects even with regards to QA. Simply put, you don’t QA AI projects like you QA other projects. This is because the concept of what we’re testing, how we test, and when we test is radically different in AI Projects.
What are we testing and when? Training and Inference Phase QA
Those experienced with AI algorithms and AI model training know that testing is actually a core element of making AI projects work. You don’t simply develop an AI algorithm, throw training data at it and call it a day. You have to actually verify that the training data does a good enough job of accurately classifying or regressing data with sufficient generalization without overfitting or underfitting the data. This is done using validation techniques and setting aside a portion of the training data to be used during the validation phase. In essence, this is a sort of QA testing where you’re making sure that the algorithm and data together, plus hyperparameter configuration data and associated metadata are all working together to provide the predictive results you’re looking for. If you get it wrong in the validation phase, you’re supposed to go back, change the hyperparameters, and rebuild the model again, perhaps with better training data if you have it. After this is done, you go back and use testing data (also part of the well-labeled training data) to verify that the model indeed works as it is supposed to. This is all testing… but it all happens during the training phase of the AI project. This is before the AI model is put into operation.
In actuality even in the training phase, we’re testing a few different things. First, we need to make sure the AI algorithm itself works. There’s no sense in tweaking hyperparameters and training the model if the algorithm is implemented wrong. In all honesty, there’s no reason for a poorly implemented algorithm because most of these algorithms are already baked into the various AI libraries. If you need K-Means Clustering or different flavors of neural networks or Support Vector Machines or K-Nearest Neighbors, you can simply just call that library function in Python scikit-learn or whatever your tool of choice is, and it should work. The algorithms are all the same – there’s just one way to do the math! Therefore you should not be coding those algorithms from scratch unless you have a really good reason to do so. That means if you’re not coding them from scratch, there’s nothing to be tested – assume that the algorithms have already passed their tests. Therefore in an AI project, QA will never be focused on the AI algorithm.
This leaves two things to be tested in the training phase for the AI model itself: the training data and the hyperparameter configuration data. In the latter case, we already addressed QA of hyperparameter settings through the use of validation methods which includes K-fold cross-validation and other approaches. Long story short, if you are doing any AI Model training at all, then you should know how to do validation. This will help determine if your hyperparameter settings are correct. Knock another activity off the QA task list.
As such, then all that remains is testing the data itself for QA of the AI Model. But what does that mean? This means not just data quality, but also completeness. Does the training model adequately represent the reality of what you’re trying to generalize? Have you inadvertently included any informational or human-induced bias in your training data? Are you skipping over things that work in training but will fail during inference because the real-world data is more complex? QA for the AI model here has to do with making sure that the training data includes a representative sample of the real world and eliminates as much human bias as possible.
Outside of the AI model itself, the other aspects of the AI system that need testing are actually external to the AI model. You need to test the code that puts the AI model into production – the operationalization component of the AI system. This can happen prior to the AI model being put into production, but then you’re not actually testing the AI model. Instead, you’re testing the systems that use the model. If the AI model is failing during testing, the other code that uses the AI model has a problem with either the training data or the configuration somewhere. You should have picked that up when you were testing the training model data and doing validation as we discussed above.
AI QA Means Testing in Production
If you’ve followed along what’s written above then you know that a properly validated, well-generalizing system using representative training data and using algorithms from an already-tested and proven source should result in expected results. But what happens when you don’t get those expected results? Reality is obviously messy. Things happen in the real world that don’t happen in your test environment. Yet what does that mean from a QA perspective? We did everything we were supposed to do in the training phase and our AI model passed meeting expectations, but it’s not passing in the “inference” phase when the AI model is operationalized. This means we need to have a QA approach to deal with AI models in production.
Problems that arise with AI models in the inference phase are almost always issues of data. We know the algorithm works. We know that our training model data and hyperparameters were configured to the best of our ability. That means that when AI models are failing we have data problems. Is the input data bad? If the problem is bad data – fix it. Is the AI model not generalizing well? Is there some nuance of the data that needs to be added to the training model? If the answer is the latter, that means we need to go through a whole new cycle of developing an AI model with new training data and hyperparameter configurations to deal with the right level of fitting to that data. Regardless of the issue, organizations that operationalize AI models need a solid approach by which they can keep close tabs on how the AI models are performing and version control which ones are in operation.
The key to making AI work is to not only use a methodology that’s data-centric and adopts agile, but to also consider AI projects to not be the same as your typical development project. AI projects are really unique in that they revolve around data. Data is the one thing in testing that is guaranteed to continuously grow and change. As such, you need to consider AI projects as also continuously growing and changing. This should give you a new perspective on QA in the context of AI.