By John P. Desmond, AI Trends Editor
The three little bears strived to get it just right, and AI model builders strive to do the same thing when it comes to specifying their model. Underspecification is when you build a model that performs well on your data, but so do other models, which could lead to your model decaying over time.
The discussion of underspecification kicked off last fall when Google researchers published a paper on the subject, “Underspecification Presents Challenges for Credibility in Modern Machine Learning.”
“ML models often exhibit unexpectedly poor behavior when they are deployed in real-world domains. We identify underspecification as a key reason for these failures,” stated the paper, put together by a group of scientists led by author Alexander D’Amour, a research scientist with Google Brain of Cambridge, Mass.
Matt Brems, Senior Manager, Data Science Product & Strategy, Roboflow
In an interpretation of the paper written by Matt Brems, Senior Manager, Data Science Product & Strategy at Roboflow, a company focused on software to help in tagging of objects in images and video, describes how a developer builds a model with a good mean average precision (mAP) score for accuracy. “Then you deploy your model, and your model does worse than expected,” Brems states in a blog entry. What happened?
The developer followed all the right steps: split the data into training, validation, and testing sets; thoroughly cleaned the data; and made sure the engineering steps were rooted in subject-matter expertise.
The problem, very likely, was underspecification, meaning the same performance can be realized on the selected data set in many different ways. “This is a problem with all sorts of machine learning models, including in computer vision, natural language processing, and more,” the author states.
What is the upshot? Said another way, “The computer doesn’t know which model is a better reflection of reality—just which model happened to be better on this specific set of data,” Brems stated.
Find the Line of Best Fit? Or Detect the Object
Middle school math students are sometimes asked to find the line of best fit in a scatter plot on a two-dimensional grid. The AI engineer needs to see this as more of an object detection problem, and the choices that go into modeling it are intentional and unintentional. Intentional choices might include the types of image preprocessing the model-builder performs, or the amount of data he/she collects. An unintentional choice might include the random seed selected when fitting the model or algorithm to the data. (A random seed is a number at the start of a number generator.)
“At the end of the day, a model is a simplification of reality,” Brems stated. “Models are generally supposed to reflect or mimic the real world. But the way that models are fit, there’s no guarantee that your computer selects a model that reflects the logic or science of your specific application.”
The result of this can be, “When you deploy your very accurate, high-performing-on-the-test-set model, there’s a good chance your model immediately starts performing poorly in the real world, costing you time, money, and frustration,” Brems stated.
Among his suggestions for AI modelers: “Draw your testing data from somewhere other than the training distribution—ideally mirroring your deployment environment;” and use stress tests to detect underspecification; and “make sure your machine learning pipeline is reproducible.”
Sees Challenge to Current Approach to Machine Learning Models
Collins Ayuya, graduate student in computer science
Similar advice was offered on the blog of Section, a company offering edge-as-a-service, to help developers create workloads to run on a distributed edge. Referring to the paper from the Google researchers, the entry’s author, Collins Ayuya, who is a student in a masters in computer science program, stated, “The way we approach the training of machine learning models has been challenged.”
He explains a faulty approach this way: “Underspecification implies that the training phase of a model can produce a good model. It can also produce a flawed model, and it would not tell the difference. As a result, we wouldn’t either.”
Why is this a problem? “Underspecification threatens the credibility of models. The reliability of the process used to train models to perform in the real world as they do in training has been called into question,” Ayuya states.
He also had suggestions for addressing the issue; here is a selection briefly summarized: limit the model’s complexity; understand the proposed real-world application of a model; “Consulting domain experts could help do this;” conduct three types of stress tests: stratified performance evaluations, shifted performance evaluations and “contrastive” evaluations. The latter play a part in explainable AI, Ayuya noted.
An article entitled, “ How to Write the Specification for Your AI Project” from the blog of Apro Software compares an “ordinary” software development project to an AI project. While there are many similarities, “We would like to draw your attention to the most important differences,” stated the author of the post, Konstantin Voicehovsky, a technical manager at Apro, a software development service provider based in the Netherlands and Belarus.
Among the differences: “Data is an essential part of AI projects, and you have to add some information about it in the specification;” and, “AI projects are very similar to scientific projects: they both are based on hypotheses.”
He offered an example of an AI system to classify paintings by the style of art: classicism, pop-art, Cubism, and so on. If the system is trained on black-and-white images, it is unlikely to be able to work with colored images. “This issue needs just one line of code to be fixed,” Voicehovsky stated. “But a lot of projects cannot be changed so simply.”
Scientific projects and AI projects have much in common since both are based on hypotheses, he suggested. “AI engineers play the role of scientists who must prove the project hypothesis,” he stated.
Again comparing, he stated, “For ordinary software, goals are always (or almost always) achievable and straightforward. No place for hypothetical stuff!”
Read the source articles and information in the paper from Google researchers, “Underspecification Presents Challenges for Credibility in Modern Machine Learning,” in a blog post from Roboflow, in a blog post from Section.io, and in a blog post from Apro Software.