O'Reilly Data Show Podcast

Building a natural language processing library for Apache Spark


Listen Later

When I first discovered and started using Apache Spark, a majority of the use cases I used it for involved unstructured text. The absence of libraries meant rolling my own NLP utilities, and, in many cases, implementing a machine learning library (this was pre deep learning, and MLlib was much smaller). I’d always wondered why no one bothered to create an NLP library for Spark when many people were using Spark to process large amounts of text. The recent, early success of BigDL confirms that users like the option of having native libraries.
In this episode of the Data Show, I spoke with David Talby of Pacific.AI, a consulting company that specializes in data science, analytics, and big data. A couple of years ago I mentioned the need for an NLP library within Spark to Talby; he not only agreed, he rounded up collaborators to build such a library. They eventually carved out time to build the newly released Spark NLP library. Judging by the reception received by BigDL and the number of Spark users faced with large-scale text processing tasks, I suspect Spark NLP will be a standard tool among Spark users.
Talby and I also discussed his work helping companies build, deploy, and monitor machine learning models. Tools and best practices for model development and deployment are just beginning to emerge—I summarized some of them in a recent post, and, in this episode, I discussed these topics with a leading practitioner.
Here are some highlights from our conversation:
The state of NLP in Spark
Here are your two choices today. Either you want to leverage all of the performance and optimization that Spark gives you, which means you want to stay basically within the JVM, and you want to use a Java-based library. In which case, you have options that include OpenNLP, which is open source, or Stanford NLP, which requires licensing in order to use in a commercial product. These are older and more academically oriented libraries. So, they have limitations in performance and what they do.
Another option is to look at something like spaCy—a Python-based library that really has raised the bar in terms of usability, and the trade-offs between analytical accuracy and performance. But then your challenge is that you have your text in Spark, but to call the spaCy pipeline, you basically have to move the data from the JVM to a Python process, do some processing there, and send it back, which in practice means you take a huge performance hit because most of the processing you do is really moving strings between operating system processors.
… So, really what we were looking for is a solution to work on text directly, within a data frame. A tool that will take into account everything Spark gives in terms of caching, distributed computation, and the other optimizations. This enable users to basically run an NLP and machine learning pipeline directly on their text.
Enter Spark NLP
Spark NLP. Image by David Talby, used with permission.
The core purpose of an NLP library is the ability to take text and then apply a set of annotations on the text. So, the basic annotations we ship in this initial version of Spark NLP include things like a tokenizer, a lemmatizer, sentence boundary detection, and paragraph boundary detection. Then on top of that, we include things like sentiment analysis, spell checker so we can auto-suggest corrections, and a dependency parser so we can not just know that we have a noun and a verb, but also that this verb talks about the specific noun, which is often semantically interesting. We also include named entity recognition algorithms.
Deploying and monitoring machine learning models in production
I think what’s happening is that people expect basic model development to be very similar to software development. When we started doing software development, we started it wrong. We assumed software engineering was a lot like civil engineering or mechanical engineering. It took a good 30 years until we said no, this is actual
...more
View all episodesView all episodes
Download on the App Store

O'Reilly Data Show PodcastBy O'Reilly Media

  • 4
  • 4
  • 4
  • 4
  • 4

4

63 ratings


More shows like O'Reilly Data Show Podcast

View all
The Changelog: Software Development, Open Source by Changelog Media

The Changelog: Software Development, Open Source

285 Listeners

O'Reilly Radar Podcast - O'Reilly Media Podcast by O'Reilly Media

O'Reilly Radar Podcast - O'Reilly Media Podcast

35 Listeners

Data Skeptic by Kyle Polich

Data Skeptic

475 Listeners

Talk Python To Me by Michael Kennedy

Talk Python To Me

580 Listeners

Software Engineering Daily by Software Engineering Daily

Software Engineering Daily

624 Listeners

O'Reilly Design Podcast - O'Reilly Media Podcast by O'Reilly Media

O'Reilly Design Podcast - O'Reilly Media Podcast

8 Listeners

AWS Podcast by Amazon Web Services

AWS Podcast

203 Listeners

Super Data Science: ML & AI Podcast with Jon Krohn by Jon Krohn

Super Data Science: ML & AI Podcast with Jon Krohn

295 Listeners

Python Bytes by Michael Kennedy and Brian Okken

Python Bytes

214 Listeners

Data Engineering Podcast by Tobias Macey

Data Engineering Podcast

139 Listeners

DataFramed by DataCamp

DataFramed

266 Listeners

Practical AI by Practical AI LLC

Practical AI

196 Listeners

Google DeepMind: The Podcast by Hannah Fry

Google DeepMind: The Podcast

188 Listeners

Me, Myself, and AI by MIT Sloan Management Review and Boston Consulting Group (BCG)

Me, Myself, and AI

99 Listeners

AI Chat: ChatGPT & AI News, Artificial Intelligence, OpenAI, Machine Learning by Jaeden Schafer

AI Chat: ChatGPT & AI News, Artificial Intelligence, OpenAI, Machine Learning

139 Listeners

This Day in AI Podcast by Michael Sharkey, Chris Sharkey

This Day in AI Podcast

178 Listeners

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis by Nathaniel Whittemore

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

397 Listeners