October 24, 2023

Ep 6: Extracting Data from Old Documents with Rosa Lin, Founder, Tolstoy

35 minutes

Rosa Lin is the founder of Tolstoy [www.tolstoy.ai], which specializes in extracting data from documents. As I learned, this is a much tougher problem than traditional OCR! It requires a combination of deep learning and classic CV methods. Rosa also talks about her fascinating background as a journalist and her experience going through Y-Combinator.

For more about this podcast, visit www.yaoshiang.com/podcast.html.

For the video version including visual examples of Tolstoy's work, visit https://www.youtube.com/watch?v=QtHEXvcGGRs&t=9s.

0:26: The problems Tolstoy solves: extracting data from documents like emails, news articles, forms, and handwritten notes and then running NLP algorithms to classify and summarize.

02:54: Typical customers: tech startups, news organizations, utilities, energy companies, legal firms, and educational institutions.

05:05: First walk-through of a use case: Digitizing articles for The Wall Street Journal (with images showing why off the shelf OCR failed).

07:19: Specifics of why OCR fails: multiple articles in a single page, columns, images, headings, and handwriting.

09:18: Training a custom model to deal with columns, with visuals showing how Tolstoy works better than Google Cloud Vision.

11:30: A classic computer vision algorithm for identifying paragraphs.

12:30: Transfer learning with modern Convolution Neural Networks to identify images vs text.

13:38: Second walk-through of a use case: a classification task for a utility company to help find lead pipes.

15:20: Can you spot the handwritten word "lead"?

17:50: Tips for building products around inevitably imprecise ML models.

19:37: Rosa's personal journey from biology and journalism to entrepreneurship and ML.

22:49: Seeing the promise of AI in 2015 while at the World Bank and starting an AI hobbyist club.

26:25: How training in journalism translated to the skills required for journalism.

28:40: Rosa's experience with Y-Combinator (YC W17)

...more

View all episodes

By Yaoshiang ho