The Most Important Work in AI Training Is Also the Most Overlooked
In the world of high-stakes AI, data is king. For example, when the COVID-19 pandemic hit in early 2020, there was an opportunity for the AI community to build tools that would help diagnose and prevent the spread of the virus, and also improve medical AI in general. Using data from China and other countries that were among the first to be affected, hundreds of tools were built by healthcare tech companies, but very few of them actually worked and some were actually harmful. Like in many high-stakes AI projects, the quality of the data obtained (either good or bad) was consistent with the output of the models. In one case, a tool was fed a dataset of children’s chest scans as an example of non-Covid cases, but the tool learned to identify children not Covid cases.
In 2021, The Alan Turing Institute, the UK’s national institute for data science and artificial intelligence, organized a series of workshops to address the challenges experienced by the data science and AI community during the pandemic: “[...] The single most consistent message across the workshops was the importance—and at times lack—of robust and timely data. Problems around data availability, access and standardization spanned the entire spectrum of data science activity during the pandemic. The message was clear: better data would enable a better response,” a report on the workshops said.
The Nitty-Gritty Grunt Work of AI
Data work is probably the least glamorous aspect of AI development but one of the most important. This is especially true in high-stakes AI where there is very little margin for error. Spotify’s predictive algorithm can probably get away with including one or two irrelevant suggestions, but the same can’t be said for algorithms that have life or death stakes, like many healthcare and public policy tools. AI tools such as those used in judicial systems to recommend sentencing, and those used to predict tumor regrowth in cancer patients cannot afford to be wrong. Poor data quality in those instances can have extreme and sometimes life threatening consequences on communities and people.
Even outside of high-stakes AI, data quality directly correlates with the performance and effectiveness of models since ML algorithms are only as good as the data they’ve been fed. As the saying goes, "garbage in, garbage out." Preparing, curating, and assembling quality training data for machine learning models takes a lot of time and effort, but because data preparation is often seen as non-technical work, many roles involved in this process are often underpaid and overworked. OpenAI’s ChatGPT, for example, was built with datasets made by underpaid workers. This, in addition to the general undervaluing of data work can lead to data cascades that have disastrous consequences in high-stakes models.
Bad Data, Worse Outcomes
Data cascades are extremely common, 92% of data professionals in one study said that they had experienced at least one cascade, and 45% had experienced two or more cascades in a project. A Google Research paper, Everybody wants to do the model work not the data work: Data Cascades in High-stakes AI, describes data cascades as “compounding events causing negative downstream effects from data issues that result in technical debt over time.” According to the paper, a lot of data cascades start early in the lifecycle of an ML system, often at the stage of data definition and collection. And because there are no clear tools or metrics to measure data cascades, small data errors can snowball into major negative consequences on the model with complex long-term effects. For example, environmental factors like rain or wind can cause image sensors in deployment to move, triggering cascades. Minuscule details like fingerprints, shadows, and dust on physical artifacts collected for data can also affect a model’s performance. According to the paper cited above, cascades can be triggered by:
Interacting with physical world's brittleness
Inadequate application domain expertise
conflicting reward systems
Poor cross-organizational documentation
High-stakes AI is obviously the most sensitive to the consequences of unchecked data cascades, which is mostly triggered when conventional AI practices are applied to high-stakes models. Preparing, curating, and maintaining quality data is a major part of protecting high-stakes AI models from these data cascades. Unfortunately, because data work is not prioritized in many projects, some important details might be overlooked. This includes tools, resources, and even people, especially since high-stakes AI projects often require experts in the field that the model is built for. For example, doctors and researchers for healthcare models, and farmers for agricultural models, bring context to data that would very likely be missing otherwise.
Apart from this, if data quality and completeness are not prioritized, there is likely no training available for data collectors and other practitioners. This would most likely lead to bad data and eventually data cascades because data for high-stakes models is usually collected from scratch. With these in mind, it is clear that data work in high-stakes AI is the most important aspect of building ML systems and should be given the same amount of attention, if not more, that other parts of the process get. This includes all aspects of data work, as well as the people doing the work.