AI 'N Stuff

Tracing AI Data Origins


Listen Later

Let's say you're on the edge of developing an awesome new AI language model. But here's a critical question – how do you ensure that your use of training data aligns with its licensing terms? How do you even find out what the licensing terms of that data are? Here’s another question: how do you find out where the dataset came from and what's inside? And how do you prevent the dataset from introducing bias and toxicity into your model?

These are some of the key questions we're discussing in this week’s episode. I spoke with Robert Mahari and Shane Longpre from the Data Provenance Initiative, a research project and online tool that helps researchers, startups, legal scholars, and other interested parties track the lineage of AI fine-tuning datasets. Shane and Robert are both PhD candidates at MIT’s Media Lab, and Robert is also a J.D. candidate at Harvard Law School.

...more
View all episodesView all episodes
Download on the App Store

AI 'N StuffBy James McCammon