Share Minimally-overlapping words for sequence similarity search

Copy link

July 26, 2020

Minimally-overlapping words for sequence similarity search

Link to bioRxiv paper:

http://biorxiv.org/cgi/content/short/2020.07.24.220616v1?rss=1

Authors: Frith, M. C., Noe, L., Kucherov, G.

Abstract:

Analysis of genetic sequences is usually based on finding similar parts of sequences, e.g. DNA reads and/or genomes. For big data, this is typically done via "seeds": simple similarities (e.g. exact matches) that can be found quickly. For huge data, sparse seeding is useful, where we only consider seeds at a subset of positions in a sequence. Here we study a simple sparse-seeding method: using seeds at positions of certain "words" (e.g. ac, at, gc, or gt). Sensitivity is maximized by using words with minimal overlaps. That is because, in a random sequence, minimally-overlapping words are anti-clumped. We provide evidence that this is often superior to acclaimed "minimizer" sparse-seeding methods. Our approach can be unified with design of inexact (spaced and subset) seeds, further boosting sensitivity. Thus, we present a promising approach to sequence similarity search, with open questions on how to optimize it.

Copy rights belong to original authors. Visit the link for more info

...more

View all episodes

By Multimodal LLC