When artificial intelligence (AI) large language models are trained on machine — instead of human — generated data it leads to model collapse, according to a study by UK and Canadian researchers.
“In other words, the use of [large language models] at scale to publish content on the Internet will pollute the collection of data to train them,” the paper says.
This poses a problem for training generative AI in future, as more and more AI-generated text and synthetic data is published online.
Large language models like Open AI’s ChatGPT and Alphabet’s Bard were originally trained using predominantly human-generated text scraped from the Internet, and fine-tuned using further human input.
But, increasingly online content is also created by the AI models themselves.
When authors Ilia Shumailov and Zakhar Shumaylov were discussing large language models, they wondered whether increasing use of artificial (machine-generated) data used in training would cause trouble for models in the future.
“We quickly came to realise that it would,” Shumailov says in an email to Cosmos.
When AI models are learning from machine-generated rather than human-created data, “major degradation happens within just a few iterations, even when some of the original data is preserved,” he says.
“Errors from optimisation imperfections, limited models and finite data ultimately cause synthetic data to be of low(er) quality. Over time mistakes compound and ultimately force models that learn from generated data to misperceive reality even further.”
The researchers say the problem exists for all forms of generative AI.
“Model collapse is a phenomenon that affects any model trained on synthetic data,” Shumailov says.
“We discover that learning from data produced by other models causes model collapse – a degenerative process whereby, over time, models forget the true underlying data distribution, even in the absence of a shift in the distribution over time,” the authors write.
Shumailov explains the concept of model collapse using an analogy of dog pictures.
“Consider a scenario where we have a model generating dog images, and the initial dataset consists of 10 dogs with blue eyes and 90 dogs with yellow eyes. After training our initial model, it becomes quite proficient in learning from the data, albeit not perfectly. Due to the predominance of yellow-eyed dogs in the training set, the model unintentionally alters the blue eyes to appear slightly more greenish. Subsequently, we use this model to generate new dogs and share them on social media. At this point, someone decides to scrape the internet for dog images, including the generated ones. They retrieve 10 blue-eyed dogs that now appear slightly less blue and more green, along with 90 yellow-eyed dogs. They then train a new model using this data, leading to a similar outcome. Since the majority of the data comprises yellow-eyed dogs, the model becomes more adept at representing them, while its ability to understand and represent blue-eyed dogs diminishes.
“Over time, this understanding of the minority group deteriorates, progressing from blue to blue-green, then green, and eventually yellow-green before ultimately leading to a complete loss or distorted perception of this information. This phenomenon is the model collapse.“
To prevent it, Shumailov says it’s important to ensure that the minority groups from the original data are fairly represented in subsequent datasets, not merely in terms of quantity (e.g., 10 images), but also in terms of their distinctive attributes (e.g., blue-eyed).
“Training on data that has errors in it causes the models to learn these errors and misunderstand reality. Over time these misunderstandings get worse,” Shumailov says.
The paper suggests there may be value in preserving human-generated training data (“crawled from the Internet prior to the mass adoption of the technology”) particularly data including less likely occurrences, for subsequent models to learn from.
He says what matters most when it comes to avoiding model collapse is having access to data from the “tails of the distribution”. Companies and entities wanting to train AI models in future will need to “spend enough resources on data collection and annotation to ensure that their future models can learn effectively.”
Do you care about the oceans? Are you interested in scientific developments that affect them? Then our new email newsletter Ultramarine, launching soon, is for you. Click here to become an inaugural subscriber.