“It is a capital mistake to theorize before one has data.”–Arthur Conan Doyle (Sherlock Holmes, A Study in Scarlett)
What is the fodder that feeds generative AI? Of course, there is massive software programming, but creating useful output requires data. Tons of data.
Anne Boysen has a masters in strategic foresight from the University of Houston and a graduate certificate in business analytics from Penn State University. Working in high tech for 6 years, she also works on foresight projects and uses data mining and analytics in her research. She is generally recognized as one of the top data experts in the professional futurist community. In this episode she provides an overview of the state of big data, and its importance in “feeding” today’s generative AI models.
You can subscribe to Seeking Delphi on Apple podcasts , PlayerFM, MyTuner, Listen Notes, I Heart Radio, Podchaser and Blubrry Podcasts and many others. You can also follow us on twitter @Seeking_Delphi and Facebook.
Episode #62: AI and the Future of Big Data, with Anne Boysen
Less public access and ethical considerationsBetter ability to combine different types of dataSynthetic dataMore diluted dataLess public access and ethical considerations
If data is the new oil, the land grab is coming to an end. That time when anyone could grab a piece of the digital turf and put up their yard sign unsuspectingly is fading away. You still can, but you now know you may not own mining rights to the treasure beneath the soil of your homestead.
This realization has made people more cautious, and considerations around IP and privacy make important data less accessible. Tech companies are also more protective of user generated content for liability reasons as well as their ability to capitalize on it. It wasn’t long ago that Elon Musk decided to put Twitter’s public tweets behind a paywall. I can not longer use Application Programming Interface (API) to access tweets to do sentiment analysis for my foresight research, which was vital to monitor trends I could not access any other way. Being able to take the pulse of public opinion was a phenomenal way for futurists to gain early insight into trends that otherwise would have stayed below the radar and the big headlines. This is monopolizing not only the data, but the AI models that feeds on this data.
So we see an inverse curve where there is more hope tied to advanced models, but less access for these models to feed themselves.
More ability to combine different types of dataThankfully, the way we store, extract, transform and load our data is advancing along with the models, so we can get more “bang for our buck”. Different types of data used to be stored in siloes, so businesses had a hard time accessing even their own data for analysis. It too lots of time for cleaning and combining. But with the entrance of Data Lakes, we can now store different data formats in combinable ways, giving us better access to unstructured data and then query different formats together.
Synthetic dataAnother way to overcome data scarcity is through creating synthetic data. This is a way to make sure the core distributions remain intact but we add some “jitters” to camouflage certain aspects of the original data or create larger quantities.
There are different reasons why we may want to use synthetic data. First and foremost, we may want to remove personally identifiable information (PII). Even if we remove name, address and other identifiers from an original dataset, it doesn’t take many combined data points to reconstruct a person’s identity. The beauty of synthetic data is that we can remove all this and still keep the aggregate level distributions to see the main trends.
We can also use synthetic data to create more data. I did this recently in a deep learning model and it worked remarkably well. I was worried the synthetic data would overfit the model, but when I later got access to more original data of the same source, the performance stayed very close to it.
Of course this is a drawback with synthetic data. You don’t really get to discover the outliers, what we futurists call fringe or weak signals, so it’s just going to maximize the patterns we already have.
More diluted dataIn this scenario we will still train large models even if data is less accessible. It may be tempting for some to train models using bad data or diluted derivative data produced by AI. This is like ingesting vomit. The “nutrients” have already been absorbed, meaning the variety and serendipity that existed in the original may be gone. This is very different from synthetic data, which keeps the properties intact. Many people mix this up.
A few words about Generative AI. Much Ado about not a whole lot at the moment. This has to do with an incongruence between the type of LLM GenAI is, the type of data it ingests, how it trains on it on the one and most real, “unsexy” business needs on the other.
Generative AI such as LLMs will probably help businesses in some hybrid form, but not as the “out-of-the-box” solution we see today.
Future of data conclusion
–Synthetic data will make up for reduced access. This will reduce important outliers and regress to the mean even more
–Peak access to random data is behind us
–Opt-in data will never be representative
Previous Podcast in this AI series
#59–Transitioning to AGI, Implications and Regulations with Jerome Glenn
#60–Investing in AI and AI in Investing with Jim Lee
#61–Keeping it Human, with Dennis Draeger
You can subscribe to Seeking Delphi on Apple podcasts , PlayerFM, MyTuner, Listen Notes, I Heart Radio, Podchaser and Blubrry Podcasts and many others. You can also follow us on twitter @Seeking_Delphi and Facebook.