December 12, 2023

Tracing AI Data Origins

37 minutes

Let's say you're on the edge of developing an awesome new AI language model. But here's a critical question – how do you ensure that your use of training data aligns with its licensing terms? How do you even find out what the licensing terms of that data are? Here’s another question: how do you find out where the dataset came from and what's inside? And how do you prevent the dataset from introducing bias and toxicity into your model?

These are some of the key questions we're discussing in this week’s episode. I spoke with Robert Mahari and Shane Longpre from the Data Provenance Initiative, a research project and online tool that helps researchers, startups, legal scholars, and other interested parties track the lineage of AI fine-tuning datasets. Shane and Robert are both PhD candidates at MIT’s Media Lab, and Robert is also a J.D. candidate at Harvard Law School.

...more

View all episodes

By James McCammon

December 12, 2023

Tracing AI Data Origins

37 minutes

...more

Share Tracing AI Data Origins

Sign up to save your podcasts

Tracing AI Data Origins

Tracing AI Data Origins