This site is updated Hourly Every Day

Trending Featured Popular Today, Right Now

Colorado's Only Reliable Source for Daily News @ Marijuana, Psychedelics & more...

Post: The promise and perils of synthetic data

Picture of Anschutz Medical Campus

Anschutz Medical Campus

AnschutzMedicalCampus.com is an independent website not associated or affiliated with CU Anschutz Medical Campus, CU, or Fitzsimons innovation campus.

Recent Posts

Anschutz Medical Campus

The promise and perils of synthetic data
Facebook
X
LinkedIn
WhatsApp
Telegram
Threads
Email

Image Credits: Hiretual (opens in a new window) Is it possible for an AI to be trained just on data generated by another AI? It might sound like a harebrained idea. But it’s one that’s been around for quite some time — and as new, real data is increasingly hard to come by, it’s been gaining traction.

Anthropic used some synthetic data to train one of its flagship models, Claude 3.5 Sonnet . Meta fine-tuned its Llama 3.1 models using AI-generated data . And OpenAI is said to be sourcing synthetic training data from o1 , its “reasoning” model, for the upcoming Orion .

But why does AI need data in the first place — and what kind of data does it need? And can this data really be replaced by synthetic data? The importance of annotations

AI systems are statistical machines. Trained on a lot of examples, they learn the patterns in those examples to make predictions, like that “to whom” in an email typically precedes “it may concern.”

Annotations, usually text labeling the meaning or parts of the data these systems ingest, are a key piece in these examples. They serve as guideposts, “teaching” a model to distinguish among things, places, and ideas.

Consider a photo-classifying model shown lots of pictures of kitchens labeled with the word “kitchen.” As it trains, the model will begin to make associations between “kitchen” and general characteristics of kitchens (e.g. that they contain fridges and countertops). After training, given a photo of a kitchen that wasn’t included in the initial examples, the model should be able to identify it as such. (Of course, if the pictures of kitchens were labeled “cow,” it would identify them as cows, which emphasizes the importance of good annotation.)

The appetite for AI and the need to provide labeled data for its development have ballooned the market for annotation services. Dimension Market Research estimates that it’s worth $838.2 million today — and will be worth $10.34 billion in the next ten years. While there aren’t precise estimates of how many people engage in labeling work, a 2022 paper pegs the number […]

Leave a Reply

Your email address will not be published. Required fields are marked *

You Might Be Interested...