The Python Podcast.__init__

Entity Extraction, Document Processing, And Knowledge Graphs For Investigative Journalists with Friedrich Lindenberg


Listen Later

Summary

Investigative reporters have a challenging task of identifying complex networks of people, places, and events gleaned from a mixed collection of sources. Turning those various documents, electronic records, and research into a searchable and actionable collection of facts is an interesting and difficult technical challenge. Friedrich Lindenberg created the Aleph project to address this issue and in this episode he explains how it works, why he built it, and how it is being used. He also discusses his hopes for the future of the project and other ways that the system could be used.

Preface
  • Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
  • When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200 Gbit/s private networking, scalable shared block storage, node balancers, and a 40 Gbit/s public network, all controlled by a brand new API you’ve got everything you need to scale up. Go to podcastinit.com/linode today to get a $20 credit and launch a new server in under a minute.
  • Visit the site to subscribe to the show, sign up for the newsletter, and read the show notes. And if you have any questions, comments, or suggestions I would love to hear them. You can reach me on Twitter at @Podcast__init__ or email [email protected])
  • To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media.
  • Join the community in the new Zulip chat workspace at podcastinit.com/chat
  • Registration for PyCon US, the largest annual gathering across the community, is open now. Don’t forget to get your ticket and I’ll see you there!
  • Your host as usual is Tobias Macey and today I’m interviewing Friedrich Lindenberg about Aleph, a tool to perform entity extraction across documents and structured data
  • Interview
    • Introductions
    • How did you get introduced to Python?
    • Can you start by explaining what Aleph is and how the project got started?
    • What is investigative journalism?
      • How does Aleph fit into their workflow?
      • What are some other tools that would be used alongside Aleph?
      • What are some ways that Aleph could be useful outside of investigative journalism?

      • How is Aleph architected and how has it evolved since you first started working on it?

      • What are the major components of Aleph?

        • What are the types of documents and data formats that Aleph supports?

        • Can you describe the steps involved in entity extraction?

          • What are the most challenging aspects of identifying and resolving entities in the documents stored in Aleph?

          • Can you describe the flow of data through the system from a document being uploaded through to it being displayed as part of a search query?

          • What is involved in deploying and managing an installation of Aleph?

          • What have been some of the most interesting or unexpected aspects of building Aleph?

          • Are there any particularly noteworthy uses of Aleph that you are aware of?

          • What are your plans for the future of Aleph?

          • Keep In Touch
            • Website
            • @pudo on Twitter
            • pudo on GitHub
            • Picks
              • Tobias
                • Mechanical Soup

                • Friedrich

                  • phonenumbers – because it’s useful
                  • pyicu – super nerdy but amazing
                  • sqlalchemy – my all-time favorite python package

                  • Links
                    • Aleph
                    • Organized Crime and Corruption Reporting Project
                    • OCR (Optical Character Recognition)
                    • Jorge Luis Borges
                    • Buenos Aires
                    • Investigative Journalism
                    • Azerbaijan
                    • Signal
                    • Open Corporates
                    • Open Refine
                    • Money Laundering
                    • E-Discovery
                    • CSV
                    • SQL
                    • Entity Extraction (Named Entity Recognition)
                    • Apache Tika
                    • Polyglot
                    • SpaCy
                      • Podcast.__init__ Episode

                      • LibreOffice

                      • Tesseract

                      • followthemoney

                      • Elasticsearch

                      • Knowledge Graph

                      • Neo4J

                      • Gephi

                      • Edward Snowden

                      • Document Cloud

                      • Overview Project

                      • Veracrypt

                      • Qubes OS

                      • I2 Analyst Notebook

                      • The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA

                        ...more
                        View all episodesView all episodes
                        Download on the App Store

                        The Python Podcast.__init__By Tobias Macey

                        • 4.4
                        • 4.4
                        • 4.4
                        • 4.4
                        • 4.4

                        4.4

                        100 ratings


                        More shows like The Python Podcast.__init__

                        View all
                        The Changelog: Software Development, Open Source by Changelog Media

                        The Changelog: Software Development, Open Source

                        283 Listeners

                        Data Skeptic by Kyle Polich

                        Data Skeptic

                        481 Listeners

                        Chat With Traders by Tessa Dao

                        Chat With Traders

                        1,979 Listeners

                        Talk Python To Me by Michael Kennedy

                        Talk Python To Me

                        593 Listeners

                        Software Engineering Daily by Software Engineering Daily

                        Software Engineering Daily

                        623 Listeners

                        The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence) by Sam Charrington

                        The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

                        445 Listeners

                        Super Data Science: ML & AI Podcast with Jon Krohn by Jon Krohn

                        Super Data Science: ML & AI Podcast with Jon Krohn

                        297 Listeners

                        Python Bytes by Michael Kennedy and Brian Okken

                        Python Bytes

                        215 Listeners

                        Data Engineering Podcast by Tobias Macey

                        Data Engineering Podcast

                        142 Listeners

                        Machine Learning Guide by OCDevel

                        Machine Learning Guide

                        764 Listeners

                        Syntax - Tasty Web Development Treats by Wes Bos & Scott Tolinski - Full Stack JavaScript Web Developers

                        Syntax - Tasty Web Development Treats

                        981 Listeners

                        DataFramed by DataCamp

                        DataFramed

                        267 Listeners

                        Practical AI by Practical AI LLC

                        Practical AI

                        190 Listeners

                        The Real Python Podcast by Real Python

                        The Real Python Podcast

                        140 Listeners

                        Hard Fork by The New York Times

                        Hard Fork

                        5,426 Listeners