Data Engineering Podcast

An Exploration Of The Data Engineering Requirements For Bioinformatics


Listen Later

Summary

Biology has been gaining a lot of attention in recent years, even before the pandemic. As an outgrowth of that popularity, a new field has grown up that pairs statistics and compuational analysis with scientific research, namely bioinformatics. This brings with it a unique set of challenges for data collection, data management, and analytical capabilities. In this episode Jillian Rowe shares her experience of working in the field and supporting teams of scientists and analysts with the data infrastructure that they need to get their work done. This is a fascinating exploration of the collaboration between data professionals and scientists.

Announcements
  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
  • Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
  • Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the world’s first end-to-end, fully automated Data Observability Platform! In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence, reducing time to detection and resolution from weeks or days to just minutes. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/impact today to save your spot at IMPACT: The Data Observability Summit a half-day virtual event featuring the first U.S. Chief Data Scientist, founder of the Data Mesh, Creator of Apache Airflow, and more data pioneers spearheading some of the biggest movements in data. The first 50 to RSVP with this link will be entered to win an Oculus Quest 2 — Advanced All-In-One Virtual Reality Headset. RSVP today – you don’t want to miss it!
  • Your host is Tobias Macey and today I’m interviewing Jillian Rowe about data engineering practices for bioinformatics projects
  • Interview
    • Introduction
    • How did you get involved in the area of data management?
    • How did you get into the field of bioinformatics?
    • Can you describe what is unique about data needs in bioinformatics?
    • What are some of the problems that you have found yourself regularly solving for your clients?
    • When building data engineering stacks for bioinformatics, what are the attributes that you are optimizing for? (e.g. speed, UX, scale, correctness, etc.)
    • Can you describe a typical set of technologies that you implement when working on a new project?
      • What kinds of systems do you need to integrate with?
      • What are the data formats that are widely used for bioinformatics?
        • What are some details that a data engineer would need to know to work effectively with those formats while preparing data for analysis?
        • What amount of domain expertise is necessary for a data engineer to work in life sciences?
        • What are the most interesting, innovative, or unexpected solutions that you have seen for manipulating bioinformatics data?
        • What are the most interesting, unexpected, or challenging lessons that you have learned while working on bioinformatics projects?
        • What are some of the industry/academic trends or upcoming technologies that you are tracking for bioinformatics?
        • Contact Info
          • LinkedIn
          • jerowe on GitHub
          • Website
          • Parting Question
            • From your perspective, what is the biggest gap in the tooling or technology for data management today?
            • Links
              • Bioinformatics
              • How Perl Saved The Human Genome Project
              • Neo4J
              • AWS Parallel Cluster
              • Datashader
              • R Shiny
              • Plotly Dash
              • Apache Parquet
              • Dask
                • Podcast Episode
                • HDF5
                • Spark
                • Superset
                  • Data Engineering Podcast Episode
                  • Podcast.__init__ Episode
                  • FastQ file format
                  • BAM (Binary Alignment Map) File
                  • Variant Call Format (VCF)
                  • HIPAA
                  • DVC
                    • Podcast Episode
                    • LakeFS
                    • BioThings API
                    • The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

                      Support Data Engineering Podcast

                      ...more
                      View all episodesView all episodes
                      Download on the App Store

                      Data Engineering PodcastBy Tobias Macey

                      • 4.5
                      • 4.5
                      • 4.5
                      • 4.5
                      • 4.5

                      4.5

                      140 ratings


                      More shows like Data Engineering Podcast

                      View all
                      Software Engineering Radio by se-radio@computer.org

                      Software Engineering Radio

                      273 Listeners

                      The Changelog: Software Development, Open Source by Changelog Media

                      The Changelog: Software Development, Open Source

                      292 Listeners

                      Software Engineering Daily by Software Engineering Daily

                      Software Engineering Daily

                      624 Listeners

                      The Cloudcast by Massive Studios

                      The Cloudcast

                      153 Listeners

                      Talk Python To Me by Michael Kennedy

                      Talk Python To Me

                      585 Listeners

                      Thoughtworks Technology Podcast by Thoughtworks

                      Thoughtworks Technology Podcast

                      42 Listeners

                      Super Data Science: ML & AI Podcast with Jon Krohn by Jon Krohn

                      Super Data Science: ML & AI Podcast with Jon Krohn

                      303 Listeners

                      Python Bytes by Michael Kennedy and Brian Okken

                      Python Bytes

                      214 Listeners

                      Syntax - Tasty Web Development Treats by Wes Bos & Scott Tolinski - Full Stack JavaScript Web Developers

                      Syntax - Tasty Web Development Treats

                      983 Listeners

                      DataFramed by DataCamp

                      DataFramed

                      268 Listeners

                      Practical AI by Practical AI LLC

                      Practical AI

                      212 Listeners

                      AWS Podcast by Amazon Web Services

                      AWS Podcast

                      201 Listeners

                      The Stack Overflow Podcast by The Stack Overflow Podcast

                      The Stack Overflow Podcast

                      62 Listeners

                      The Real Python Podcast by Real Python

                      The Real Python Podcast

                      141 Listeners

                      Latent Space: The AI Engineer Podcast by swyx + Alessio

                      Latent Space: The AI Engineer Podcast

                      96 Listeners