Garbage in is garbage out. There’s no saying more true in computer science, and especially is the case with artificial intelligence. Machine learning algorithms are very dependent on accurate, clean, and well-labeled training data to learn from so that they can produce accurate results. If you train your machine learning models with garbage, it’s no surprise you’ll get garbage results. It’s for this reason that the vast majority of the time spent during AI projects are during the data collection, cleaning, preparation, and labeling phases.
According to a recent report from AI research and advisory firm Cognilytica, over 80% of the time spent in AI projects are spent dealing with and wrangling data. Even more importantly, and perhaps surprisingly, is how human-intensive much of this data preparation work is. In order for supervised forms of machine learning to work, especially the multi-layered deep learning neural network approaches, they must be fed large volumes of examples of correct data that is appropriately annotated, or “labeled”, with the desired output result. For example, if you’re trying to get your machine learning algorithm to correctly identify cats inside of images, you need to feed that algorithm thousands of images of cats, appropriately labeled as cats, with the images not having any extraneous or incorrect data that will throw the algorithm off as you build the model.