October 14, 2024

Post: The promise and perils of synthetic data

Anschutz Medical Campus

AnschutzMedicalCampus.com is an independent website not associated or affiliated with CU Anschutz Medical Campus, CU, or Fitzsimons innovation campus.

Anschutz Medical Campus

Therapeutic Psychedelics Come To Colorado

THE GODFATHER OF PSYCHEDELICS Timothy Leary nicknamed Doctor Tim, was called

CU ANSCHUTZ REINVENTS THE WHEEL!

COLORADO MARIJUANA CAREGIVER According to Colorado medical marijuana caregivers a

CU Anschutz Medical Campus organ transplants

IS CU ANSCHUTZ INVOLVED WITH ILLEGAL HUMAN ORGANS?

With dying patients outnumbering human organs available, every 10 minutes

Aurora Colorado-2.6 Million Dollar Failure

In Aurora, Colorado, anyone can squander away money. People do

Fitzsimons Village

Fitzsimons Village – A Nationwide Boondoggle! With the unforeseen coerced

Image Credits: Hiretual (opens in a new window) Is it possible for an AI to be trained just on data generated by another AI? It might sound like a harebrained idea. But it’s one that’s been around for quite some time — and as new, real data is increasingly hard to come by, it’s been gaining traction.
Anthropic used some synthetic data to train one of its flagship models, Claude 3.5 Sonnet . Meta fine-tuned its Llama 3.1 models using AI-generated data . And OpenAI is said to be sourcing synthetic training data from o1 , its “reasoning” model, for the upcoming Orion .
But why does AI need data in the first place — and what kind of data does it need? And can this data really be replaced by synthetic data? The importance of annotations
AI systems are statistical machines. Trained on a lot of examples, they learn the patterns in those examples to make predictions, like that “to whom” in an email typically precedes “it may concern.”
Annotations, usually text labeling the meaning or parts of the data these systems ingest, are a key piece in these examples. They serve as guideposts, “teaching” a model to distinguish among things, places, and ideas.
Consider a photo-classifying model shown lots of pictures of kitchens labeled with the word “kitchen.” As it trains, the model will begin to make associations between “kitchen” and general characteristics of kitchens (e.g. that they contain fridges and countertops). After training, given a photo of a kitchen that wasn’t included in the initial examples, the model should be able to identify it as such. (Of course, if the pictures of kitchens were labeled “cow,” it would identify them as cows, which emphasizes the importance of good annotation.)
The appetite for AI and the need to provide labeled data for its development have ballooned the market for annotation services. Dimension Market Research estimates that it’s worth $838.2 million today — and will be worth $10.34 billion in the next ten years. While there aren’t precise estimates of how many people engage in labeling work, a 2022 paper pegs the number […]

The promise and perils of synthetic data