PaperPlayer biorxiv bioinformatics

Set-Min sketch: a probabilistic map for power-law distributions with application to k-mer annotation


Listen Later

Link to bioRxiv paper:
http://biorxiv.org/cgi/content/short/2020.11.14.382713v1?rss=1
Authors: Shibuya, Y., Kucherov, G.
Abstract:
Motivation: In many bioinformatics pipelines, k-mer counting is often a required step, with existing methods focusing on optimizing time or memory usage. These methods usually produce very large count tables explicitly representing k-mers themselves. Solutions avoiding explicit representation of k-mers include Minimal Perfect Hash Functions (MPHFs) or Count-Min sketches. The former is only applicable to static maps not subject to updates, while the latter suffers from potentially very large point-query errors, making it unsuitable when counters are required to be highly accurate. Results: We introduce Set-Min sketch, a sketching technique inspired by Count-Min sketch, for representing associative maps, more specifically, k-mer count tables. We show that Set-Min sketch provides a very low error rate, both in terms of the probability and the size of errors, much lower than a Count-Min sketch of similar dimensions. On the other hand, Set-Min sketches are shown to take up to an order of magnitude less space than MPHF-based solutions, especially for large values of k. Space-efficiency of Set-min takes advantage of the power-law distribution of k-mer counts in genomic datasets.
Copy rights belong to original authors. Visit the link for more info
...more
View all episodesView all episodes
Download on the App Store

PaperPlayer biorxiv bioinformaticsBy Multimodal LLC