
Sign up to save your podcasts
Or
Ethical concerns about the use of AI have to start with training data. Too often, the primary concern is simply generating sufficient data, rather than understanding its nature. Emily Jasper and Abby Simmons are back to continue the conversation started in episode 198 with host Eric Hanselman. With generative AI, the data is the application in its most formative sense. Unlike traditional application development, where the expectation is that functionality will be expanded in later releases, GenAI applications require careful design of training data before training takes place. The perspectives contained in data age rapidly and model training doesn’t differentiate between outdated and current indications. Old data can effectively poison model outputs. Businesses risk alienating customers with models that are trained with data that don’t properly represent them. This is particularly true with marginalized communities, where language and context can change over shorter time frames.
While there is research work on model retraining, work in AI today has to focus on effective data quality management. DeepSeek is causing a significant rethinking. Human data cleansing can be effective, but can’t scale to AI demands. Data workbench tools and synthetic data approaches can help, but better automation is needed to ensure that data sets are truly representative. Data collection and data sourcing need much greater attention to ensure that model results can engage the target audience and not be a liability. It’s a fundamental question of accountability that requires thinking in ways that are different than legacy development processes.
Mentioned in this episode:
More S&P Global Content:
Credits:
4.9
2828 ratings
Ethical concerns about the use of AI have to start with training data. Too often, the primary concern is simply generating sufficient data, rather than understanding its nature. Emily Jasper and Abby Simmons are back to continue the conversation started in episode 198 with host Eric Hanselman. With generative AI, the data is the application in its most formative sense. Unlike traditional application development, where the expectation is that functionality will be expanded in later releases, GenAI applications require careful design of training data before training takes place. The perspectives contained in data age rapidly and model training doesn’t differentiate between outdated and current indications. Old data can effectively poison model outputs. Businesses risk alienating customers with models that are trained with data that don’t properly represent them. This is particularly true with marginalized communities, where language and context can change over shorter time frames.
While there is research work on model retraining, work in AI today has to focus on effective data quality management. DeepSeek is causing a significant rethinking. Human data cleansing can be effective, but can’t scale to AI demands. Data workbench tools and synthetic data approaches can help, but better automation is needed to ensure that data sets are truly representative. Data collection and data sourcing need much greater attention to ensure that model results can engage the target audience and not be a liability. It’s a fundamental question of accountability that requires thinking in ways that are different than legacy development processes.
Mentioned in this episode:
More S&P Global Content:
Credits:
1,632 Listeners
4,320 Listeners
385 Listeners
427 Listeners
1,012 Listeners
990 Listeners
504 Listeners
30 Listeners
31 Listeners
8 Listeners
40 Listeners
12 Listeners
219 Listeners
10 Listeners
56 Listeners
4 Listeners
56 Listeners
30 Listeners
13 Listeners
8,759 Listeners
4 Listeners
1 Listeners
6 Listeners
5 Listeners
2 Listeners
5 Listeners
72 Listeners
140 Listeners
0 Listeners
445 Listeners
5 Listeners
30 Listeners
5 Listeners