the bioinformatics chat

#14 Generating functions for read mapping with Guillaume Filion


Listen Later

Guillaume Filion recently published a preprint in which he applies

generating functions, a concept from analytic combinatorics,
to estimating the optimal seed length for read mapping.

In this episode, Guillaume and I attempt to explain the core concepts from

analytic combinatorics and why they are useful in modeling sequences.

Links:

  • Guillaume’s preprint: Analytic combinatorics for bioinformatics I: seeding methods
  • Once upon a BLAST
  • Guillaume’s blog, «The Grand Locus»
  • Dan Gusfield’s home page
  • featuring the fast fourier transform lectures I mention in the podcast

    After we recorded the podcast, Guillaume wrote to me to clarify the

    relationship between read mapping and BLAST:

    I looked into my notes about BLAST. The problem that it solves is the following: “Given that a local alignment has score S, what is the probability that it does not contain a word of score T or greater”? The background work of Karlin and Altschul is used to give a statistical significance for S (what is the probability that a “Smith-Waterman random walk” starting at height 0 would reach height S, i.e. what is the probability that aligning two random proteins would yield a score S). The authors write in the original paper “Theory does not yet exist to calculate the probability q that such segment pair will contain a word pair with a score of at least T. However, one argument suggests that q should depend exponentially upon the score of the MSP”.

    This is the part that I did not remember well. MSP stands for Maximal Segment Pair, this is the “longest fragment” with “highest score” in the alignment. I thought that Karlin and Altschul solved this part as well, but the authors just go empirical and they calibrate the relationship between T and S with simulations.

    I realize a little bit better now that my work is precisely about this problem that the authors of BLAST could not solve, but as you pointed out, I am attacking only a very specific sub-case that is much easier because the models of sequencing error are much simpler than protein evolution. BLAST is concerned with local alignment, so it wants to get all the hits with an MSP score above S. Short read mapping just wants the true location of the read, which does not really have the notion of a score S. But still, mathematically, it is equivalent to the case where S is a constant that depends only on the read size and the distribution of the score T depends only on the seed length and the error rate. I have a few ideas of how to use analytic combinatorics to solve the problem for proteins, but it is mostly complicated because the variable of interest T is a fractional numbers and not an integer…

    So what is different from BLAST? The right answer (I think) is that BLAST finds all the hits with an MSP above statistical background, but it says nothing of the probability that the true location contains such an MSP, so it is hard to calibrate the heuristic for that specific problem. In reality, the parallel with BLAST is just the basic strategy: make a statistical model for your problem and use it to calibrate the heuristic.

    If you enjoyed this episode, please consider supporting the podcast on Patreon.

    ...more
    View all episodesView all episodes
    Download on the App Store

    the bioinformatics chatBy Roman Cheplyaka

    • 4.8
    • 4.8
    • 4.8
    • 4.8
    • 4.8

    4.8

    34 ratings


    More shows like the bioinformatics chat

    View all
    Radiolab by WNYC Studios

    Radiolab

    43,909 Listeners

    This American Life by This American Life

    This American Life

    90,830 Listeners

    The Bioinformatics and Beyond Podcast by Leo Elworth

    The Bioinformatics and Beyond Podcast

    10 Listeners

    If Books Could Kill by Michael Hobbes & Peter Shamshiri

    If Books Could Kill

    8,747 Listeners

    the bioinformatics lab by The Bioinformatics Lab

    the bioinformatics lab

    0 Listeners