The Cloudcast

Synthetic Data for AI


Listen Later

Kalyan Veeramachaneni (@kveeramac, CEO/Founder @DataCebo) discusses the generation and value proposition of synthetic data for GenAI.

SHOW: 813

CLOUD NEWS OF THE WEEK -
http://bit.ly/cloudcast-cnotw

NEW TO CLOUD? CHECK OUT OUR OTHER PODCAST -
"CLOUDCAST BASICS"

SHOW NOTES:

  • DataCebo (homepage)
  • Synthetic Data Vault - SDV
  • TechCrunch Article
  • MIT News Article
Topic 1 - Our topic for today is synthetic data. While the concept and need for synthetic data has been around for a long time, it isn’t a topic that typically comes to the forefront and something we haven’t talked about until today. Today is a bit of crossing the streams between developers and testing data and using GenAI to achieve this goal. For this, we’re joined by Kalyan, CEO and Co-Founder of DataCebo. Welcome to the show

Topic 2 - First, for those not familiar, what is synthetic data? What is the use case and need? What problem is it solving today?

Topic 2a - Hopefully, listeners out there are making the connection to the advantages of GenAI for synthetic data, but take us through your original concept at MIT and the history of Synthetic Data Vault (SDV).

Topic 3 - We recently did a show on the security and privacy of training LLMs where we covered the need to mask PII for the training of models for compliance. I can also see bias issues coming into play or maybe training data that doesn’t exist in the real world (weather models example). What are some of the use cases that you’ve seen require synthetic data sets. Are there certain industries (healthcare, financials, etc.) that benefit?

Topic 4 - You were designing this based on GenAI before GenAI was “cool”. How has the rise of LLMs impacted this space?

Topic 5 - If I understand this correctly, organizations would put generative AI on a problem to describe a need for a data set, the model would then evaluate the available data and create a quality synthetic or “fake” dataset. How would the organization verify the quality of the dataset? How would they validate that a synthetic data set is as good as the original data?

Topic 6 - Let’s talk about resources for a bit. When I think of GenAI and training, I think of large amounts of hardware and in particular GPU’s that might have limited availability. Is that true here? Also, is this on-prem or in the cloud, or both? 

FEEDBACK?

  • Email: show at the cloudcast dot net
  • Twitter: @cloudcastpod
  • Instagram: @cloudcastpod
  • TikTok: @cloudcastpod
...more
View all episodesView all episodes
Download on the App Store

The CloudcastBy Massive Studios

  • 4.6
  • 4.6
  • 4.6
  • 4.6
  • 4.6

4.6

147 ratings


More shows like The Cloudcast

View all
The Changelog: Software Development, Open Source by Changelog Media

The Changelog: Software Development, Open Source

290 Listeners

The a16z Show by Andreessen Horowitz

The a16z Show

1,094 Listeners

Software Engineering Daily by Software Engineering Daily

Software Engineering Daily

622 Listeners

Talk Python To Me by Michael Kennedy

Talk Python To Me

584 Listeners

Soft Skills Engineering by Jamison Dance and Dave Smith

Soft Skills Engineering

288 Listeners

Super Data Science: ML & AI Podcast with Jon Krohn by Jon Krohn

Super Data Science: ML & AI Podcast with Jon Krohn

302 Listeners

NVIDIA AI Podcast by NVIDIA

NVIDIA AI Podcast

332 Listeners

Tech Brew Ride Home by Morning Brew

Tech Brew Ride Home

961 Listeners

Practical AI by Practical AI LLC

Practical AI

205 Listeners

AWS Podcast by Amazon Web Services

AWS Podcast

204 Listeners

The Real Python Podcast by Real Python

The Real Python Podcast

141 Listeners

Big Technology Podcast by Alex Kantrowitz

Big Technology Podcast

501 Listeners

This Day in AI Podcast by Michael Sharkey, Chris Sharkey

This Day in AI Podcast

228 Listeners

AI + a16z by a16z

AI + a16z

36 Listeners

The Pragmatic Engineer by Gergely Orosz

The Pragmatic Engineer

72 Listeners