Using inverted indexes and a novel data structure dubbed aggregative Bloom filters, a PAC query can need single random access and be performed in constant time in favorable instances. It shows a 3 to 6 fold improvement in construction time compared to other compressed methods for comparable index size. PAC index construction works in a streaming fashion without any disk footprint besides the index itself. PAC presents several advantages over the state-of-the-art, enabling users to scale to the next order of magnitude. Here we present PAC, a novel approximate membership query data structure for querying collections of sequence datasets. However, at present, more than 10,000 eukaryotic samples are not analyzable in reasonable time and space frames. Among those methods, approximate membership query data structures conjugate the ability to query small signatures or variants while being scalable to collections of thousands of sequencing experiments. In the last years, an abundant literature tackled the fundamental task of locating a sequence in an extensive dataset collection by converting the query and datasets to k-mers sets and computing their intersections.
While BLAST-like methods can routinely search a sequence in a single genome or a small collection of genomes, making accessible such immense resources is out of reach for alignment-based strategies. A public database such as the SRA (Sequence Read Archive) has reached 30 peta-bases of raw sequences and doubles its nucleotide content every two years.