Building Things with Machine Learning

Ep 6: Extracting Data from Old Documents with Rosa Lin, Founder, Tolstoy


Listen Later

Rosa Lin is the founder of Tolstoy [www.tolstoy.ai], which specializes in extracting data from documents. As I learned, this is a much tougher problem than traditional OCR! It requires a combination of deep learning and classic CV methods. Rosa also talks about her fascinating background as a journalist and her experience going through Y-Combinator.

For more about this podcast, visit www.yaoshiang.com/podcast.html

For the video version including visual examples of Tolstoy's work, visit https://www.youtube.com/watch?v=QtHEXvcGGRs&t=9s

0:26: The problems Tolstoy solves: extracting data from documents like emails, news articles, forms, and handwritten notes and then running NLP algorithms to classify and summarize. 

02:54: Typical customers: tech startups, news organizations, utilities, energy companies, legal firms, and educational institutions.

05:05: First walk-through of a use case: Digitizing articles for The Wall Street Journal (with images showing why off the shelf OCR failed).

07:19: Specifics of why OCR fails: multiple articles in a single page, columns, images, headings, and handwriting.

09:18: Training a custom model to deal with columns, with visuals showing how Tolstoy works better than Google Cloud Vision. 

11:30: A classic computer vision algorithm for identifying paragraphs.

12:30: Transfer learning with modern Convolution Neural Networks to identify images vs text.

13:38: Second walk-through of a use case: a classification task for a utility company to help find lead pipes. 

15:20: Can you spot the handwritten word “lead”? 

17:50: Tips for building products around inevitably imprecise ML models. 

19:37: Rosa’s personal journey from biology and journalism to entrepreneurship and ML.

22:49: Seeing the promise of AI in 2015 while at the World Bank and starting an AI hobbyist club.

26:25: How training in journalism translated to the skills required for journalism.

28:40: Rosa’s experience with Y-Combinator (YC W17)

 

...more
View all episodesView all episodes
Download on the App Store

Building Things with Machine LearningBy Yaoshiang ho