Posted on 01/11/2025 2:04:18 AM PST by RomanSoldier19
AI takes an immense amount of resources—from endless water to an estimated $1 trillion worth of investor dollars—but Elon Musk warned the technology has already run out of its primary training resource: human-created data.
Engineers and data scientists train AI by essentially reducing the entire internet, all books, and every interesting video published into a token that AI can digest and learn from, Musk told Mark Penn, CEO of marketing company Stagwell, in an interview streamed on X Wednesday. But AI has already consumed that information, and requires even more data to fine-tune itself.
(Excerpt) Read more at msn.com ...
But, humans continue to produce more data.
Meanwhile, back in the real world ya gotta bust a grape.
The whistle blower said it’s using massive amounts of stolen and copyrighted data and the mother of all lawsuits is coming. Then they suicided him.
ya... we can tell by the dominance of fake pictures all over the net!
wait until AI gets indigestion from all the crap human databases its swallowed
Agreed. AI may have goobled up all accessible digitized written data (which I doubt) but it is nowhere near a state of extracting something stable and trustworthy from it. And it is nowhere, if ever, of having goobled up audio and video data.
Musk is telling a load of bollocks for his salesman pitch.
He should have pulled the plug and wiped it all clean.
The 9-5 will be dead in 10 years
The percentage of synthetic data used to train AI platforms compared to real data varies widely depending on the domain, use case, and the availability or sensitivity of real-world data. Here are some realistic ranges and factors influencing synthetic data usage:
Realistic Percentage Ranges
- Domains Where Synthetic Data Is Commonly Used
- Healthcare and Life Sciences: 50–90% synthetic data
Real data in these fields is highly sensitive and subject to strict privacy regulations (e.g., HIPAA). Synthetic data is often used to augment datasets for training models.- Autonomous Vehicles: 60–80% synthetic data
Simulated environments are extensively used to generate training data for self-driving cars, as capturing real-world scenarios (e.g., rare accidents) is costly and time-consuming.- Finance and Banking: 30–70% synthetic data
Synthetic data helps create secure, privacy-preserving datasets for fraud detection, customer behavior modeling, and compliance tasks.- Retail and E-commerce: 20–50% synthetic data
Synthetic data can supplement real customer behavior data to simulate specific purchase patterns or scenarios.- General Machine Learning Applications
- Data-Rich Domains: 5–30% synthetic data
Domains like natural language processing (NLP) or large-scale image classification often rely primarily on real-world data. Synthetic data is used for data augmentation, filling gaps, or addressing specific biases.- Data-Poor or Restricted Domains: 40–100% synthetic data
In scenarios where real-world data is sparse, expensive, or inaccessible (e.g., rare disease diagnosis or extreme weather forecasting), synthetic data plays a dominant role.- Emerging AI Applications
In cutting-edge fields or novel applications, the reliance on synthetic data can be as high as 70–100%, especially during the early stages of development.Key Factors Influencing Synthetic vs. Real Data Usage
- Data Availability
- Abundant Data: In areas like social media or consumer technology, real-world data is plentiful, so synthetic data is supplementary (~10–20%).
- Scarce Data: In specialized or emerging fields, synthetic data becomes the primary source (~60–100%).
- Data Privacy and Regulations
Industries with stringent privacy regulations (e.g., healthcare, finance) rely more heavily on synthetic data to ensure compliance.- Cost of Data Collection
When collecting real data is expensive or impractical (e.g., autonomous vehicles, satellite imagery), synthetic data plays a larger role.- Task Complexity
For tasks requiring edge cases or rare events (e.g., fraud detection, crash scenarios), synthetic data fills the gaps that real data can't easily provide.- Bias and Diversity Needs
Synthetic data is often used to mitigate bias in datasets or ensure representation of underrepresented groups (~20–50% synthetic data in such cases).Current Trends
Recent studies and reports suggest that synthetic data accounts for 20–50% of the training data in many AI applications today, with the percentage expected to grow as synthetic data generation tools improve and privacy concerns escalate.
In summary, while synthetic data use ranges from 5% to 100%, most industries fall in the 20–80% range, depending on the context and requirements.
So what is “synthetic data” and where does it come from?
Do the various AI’s make it for each other? This is weird as hell.
If they have indeed consumed the internet, which I doubt: note he said “interesting” videos for eg.then why do they need more “refinement”?
All very confusing.
“Do the various AI’s make it for each other?”
The article states that information produced by AI can be false. That is because AI really has no mind and can’t perceive that it is false.
You mean it's a Democrat?
Here are some examples:
But AI has already consumed that information, and requires even more data to fine-tune itself.
Hal open the door please, hal, hal?
Im sorry Im afraid I cant do that
99% of the internet is BS. Can you imagine a synthetic “mind” that is basically full of political propaganda and porn? I’ve used Chatbot and Grok and I’m not real impressed. It’s like dealing with a very slow and dull 3 year old child.
Ok...but its not real data...just simulations, inventions and flat out made up stuff based on what? Models?
Seems to me it would be like making decisions based on Grimm’s fairy tales.
Garbage in/garbage out still is operative, it seems to me.
Gotta think AI is just as prone to “mistakes/prejudices” as those who do the programming.....
Disclaimer: Opinions posted on Free Republic are those of the individual posters and do not necessarily represent the opinion of Free Republic or its management. All materials posted herein are protected by copyright law and the exemption for fair use of copyrighted works.