: This module contains a series of functions to operate on q-grams as well as two q-gram index data-structures together with very high throughput parallel construction algorithms and efficient q-gram counting primitives.

Performance

: The first graph shows the throughput of NVBIO's q-gram indexing algorithms on a K40 GPU. The benchmark consists in building a q-gram index on differently sized subsets of the 8-, 16- and 20-mers obtained from a set of 1M x 150bp reads (SRR493095). Notice that the throughput increases with the size of the set, until it saturates around ~32M q-grams.

: The next graph shows the performance of NVBIO's q-gram counting queries. The benchmark consists in building a q-gram index on the 22-mers obtained sampling a set of 1M x 150bp reads (SRR493095) every 10 bases, and streaming the whole human genome hg19 against it to find all matching q-grams. Specifically, the graph shows the throughput of the following three stages:

ranking: the process of finding the range of hits matching each query q-gram in the q-gram index
locating: the process of enumerating all found hits as (read-id,text-diagonal) pairs
counting: the process of bucketing the found hits by diagonal (separately for each read) and counting the occurrences in each bucket

Q-Gram Indices

: Q-gram indices are data-structures providing fast searching of exact or approximate q-grams (or k-mers), i.e. short strings of text containing q symbols. This module provides two such data-structures:

the Q-Group Index, replicating the data-structure described in:
Massively parallel read mapping on GPUs with PEANUT
Johannes Koester and Sven Rahmann
http://arxiv.org/pdf/1403.1706v1.pdf
this data-structure requires O(A^q) bits of storage in the alphabet-size A and the q-gram length q (to be precise, 2*A^q bits + (min(A^q,|T|) + |T|) words), and provides O(1) query time;
the compact Q-Gram Index, which can be built over a string T, with memory consumption and query time proportional to O(|T|) and O(log(unique(T))) respectively, where unique(T) is the number of unique q-grams in T. This is achieved by keeping a plain sorted list of the unique q-grams in T, together with an index of their occurrences in the original string T. This data-structure offers up to 5x higher construction speed and a potentially unbounded improvement in memory consumption compared to the Q-Group Index, though the query time is asymptotically higher.

: Q-gram indices can be built both on strings and on string sets (in which case we call them set-indices). The difference relies on the format of the coordinates associated to their q-grams: for strings, the coordinates are simple linear indices, whereas for string-sets the coordinates are (string-id,string-position) index pairs.

: The following code sample shows how to build a QGramIndex over a simple string:
#include <nvbio/qgram/qgram.h>

#include <nvbio/basic/vector.h>

#include <nvbio/basic/dna.h>

...

// consider a DNA string in ASCII format

const char* a_string = "ACGTACGTACGTACGTACGTACGTACGTACGT";

const uint32 string_len = strlen( a_string );

// convert the string to a 2-bit DNA alphabet

nvbio::vector<host_tag,uint8> h_string( string_len );

string_to_dna( a_string, string_len, h_string.begin() );

// copy the string to the device

nvbio::vector<device_tag,uint8> d_string( h_string );

// build a q-gram index on the device

QGramIndexDevice qgram_index;

qgram_index.build(

20u, // q-group size

2u, // the alphabet size, in bits

string_len, // the length of the string we want to index

d_string.begin() ); // the string we want to index

: The next example shows how to do the same with a string-set:
#include <nvbio/qgram/qgram.h>

#include <nvbio/basic/vector.h>

#include <nvbio/strings/string_set.h>

#include <nvbio/basic/dna.h>

...

// consider a DNA string in ASCII format

const char* a_string = "ACGTACGTACGTACGTACGTACGTACGTTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTTACGTACGTACGTACGTACGTACGTACGTACGT";

const uint32 string_len = strlen( a_string );

// convert the string to a 2-bit DNA alphabet

nvbio::vector<host_tag,uint8> h_string( string_len );

string_to_dna( a_string, string_len, h_string.begin() );

// copy the string to the device

const uint32 n_strings = 2u;

nvbio::vector<device_tag,uint8> d_string( h_string );

nvbio::vector<device_tag,uint32> d_string_offsets( n_strings+1 );

d_string_offsets[0] = 0; // offset to the first string

d_string_offsets[1] = 20; // offset to the second string

d_string_offsets[2] = 40; // end of the last string

typedef ConcatenatedStringSet<const uint8*,const uint32*> string_set_type;

const string_set_type string_set(

n_strings,

nvbio::plain_view( d_string ),

nvbio::plain_view( d_string_offsets ) );

// build a q-gram index on the device

QGramSetIndexDevice qgram_index;

const uint32 q = 20u;

qgram_index.build(

q, // q-group size

2u, // the alphabet size, in bits

string_set ); // the string-set we want to index

: Note: sometimes you might not want to index all the q-grams in your string-set, but rather extract one every N bases, or perhaps use some custom logic of your own. Through Seeding Functors, NVBIO provides a mechanism to customize the seeding logic. For example, the following code will extract a seed every 10 bases:
qgram_index.build(

q, // q-group size

2u, // the alphabet size, in bits

string_set, // the string-set we want to index

uniform_seeds_functor( q, 10u ) ); // extract a q-gram every 10 bases

Q-Gram Index Queries

: Once a q-gram index is built, it would be interesting to perform some queries on it. There's various ways to accomplish this, and here we'll show a couple. The simplest query one can do is locating for a given q-gram the range of indexed q-grams which match it exactly. This can be done as follows:
// consider a sample string - we'll want to find all matching locations between all

// q-grams in this string and all q-grams in the index

const char* a_query_string = "CGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTA"

const uint32 query_string_len = strlen( a_query_string );

// convert the string to a 2-bit DNA alphabet

nvbio::vector<host_tag,uint8> h_string( query_string_len );

string_to_dna( a_query_string, query_string_len, h_query_string.begin() );

// copy the string to the device

nvbio::vector<device_tag,uint8> d_query_string( h_query_string );

const uint32 q = 20u;

// extract a set of query q-gram indices, say one every 3 bases

nvbio::vector<device_tag,uint32> d_query_indices;

const uint32 n_query_qgrams = enumerate_string_seeds( query_string_len, uniform_seed_functor( q, 3u ), d_query_indices );

// build the set of query q-grams: this can be done using the string_qgram_functor

// to extract them; note that we need at least 20 x 2 = 40 bits per q-gram, hence we

// store them in a uint64 vector

nvbio::vector<device_tag,uint64> d_query_qgrams( n_query_qgrams );

generate_qgrams(

q, 2u, // q-gram length, alphabet size

query_string_len, nvbio::plain_view( d_query_string ), // input string

n_query_qgrams, // the number of q-grams

d_query_indices.begin(), // input q-gram coordinates

d_query_qgrams.begin() ); // output q-grams

// now sort the q-grams and their original indices, this improves coherence and throughput

thrust::sort_by_key(

d_query_qgrams.begin(),

d_query_qgrams.begin() + n_query_qgrams,

d_query_indices.begin() );

// find the above q-grams

nvbio::vector<device_tag,uint32> d_ranges( query_string_len - 20 );

// use the plain-view of the q-gram index itself as a search functor, that we "apply"

// to our query q-grams to obtain the corresponding match ranges

thrust::transform(

d_query_qgrams.begin(),

d_query_qgrams.begin() + n_query_qgrams,

d_ranges.begin(),

nvbio::plain_view( qgram_index ) );

: This of course was just a toy example; in reality, you'll want to this kind of operations with much larger q-gram indices and much larger batches of queries.

Q-Gram Counting

: The previous example was only showing how to get the ranges of matching q-grams inside an index: it didn't show how to get the actual list of hits. One way to go about it is to ask the q-gram index, which given an entry inside each non-empty range, can provide its location. This can be done using the qgram_locate_functor. However, if one is interested in getting the complete list of hits things are more difficult, as the process involves a variable rate data-expansion (as each range might expand to a variable number of hits).

: The QGramFilter provides a convenient way to do this:
#include <nvbio/qgram/filter.h>

...

// suppose we have the previous vectors d_query_qgrams and d_query_indices

...

// find all hits using a q-gram filter

typedef QGramFilterDevice<QGramSetIndexDevice, const uint64*, const uint32*> qgram_filter_type;

typedef qgram_filter_type::hit_type hit_type;

typedef qgram_filter_type::diagonal_type diagonal_type;

qgram_filter_type qgram_filter;

const uint32 n_hits = qgram_filter.rank(

qgram_index,

n_query_qgrams,

nvbio::raw_pointer( d_query_qgrams ),

nvbio::raw_pointer( d_query_indices ) );

//

// loop through large batches of hits and locate them

//

const uint32 batch_size = 16*1024*1024; // 16M hits per batch

// reserve enough storage for each batch

nvbio::vector<device_tag,hit_type> hits( batch_size );

for (uint32 hits_begin = 0; hits_begin < n_hits; hits_begin += batch_size)

{

const uint32 hits_end = nvbio::min( hits_begin + batch_size, n_hits );

// locate all hits in the range [hits_begin, hits_end)

qgram_filter.locate(

hits_begin,

hits_end,

hits.begin() );

}

Note: the hit coordinates are different according to the type of q-gram index;

for simple string indices, the coordinates are (index-pos,query-idx) uint2 pairs
for string-set indices, the coordinates are (index-id,index-pos,query-idx) tuples represented as a uint4, where index-id and index-pos are the index into the string-set used to built the q-gram index.

: Finally, the generated hits can be sorted and merged by diagonal bucket, effectively performing so called q-gram counting.
nvbio::vector<device_tag,hit_type> hits( batch_size );

nvbio::vector<device_tag,diagonal_type> merged_hits( batch_size );

nvbio::vector<device_tag,uint16> merged_counts( batch_size );

for (uint32 hits_begin = 0; hits_begin < n_hits; hits_begin += batch_size)

{

const uint32 hits_end = nvbio::min( hits_begin + batch_size, n_hits );

// locate all hits in the range [hits_begin, hits_end)

qgram_filter.locate(

hits_begin,

hits_end,

hits.begin() );

// merge all hits within the same diagonal interval

const uint32 n_merged = qgram_filter.merge(

16u, // merging interval

hits_end - hits_begin,

hits.begin(),

merged_hits.begin(),

merged_counts.begin() );

}

Technical Overview

: A complete list of the classes and functions in this module is given in the Q-Gram Module documentation.