NVBIO
 All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Friends Macros Groups Pages
Q-Gram Module


This module contains a series of functions to operate on q-grams as well as two q-gram index data-structures together with very high throughput parallel construction algorithms and efficient q-gram counting primitives.

Performance

The first graph shows the throughput of NVBIO's q-gram indexing algorithms on a K40 GPU. The benchmark consists in building a q-gram index on differently sized subsets of the 8-, 16- and 20-mers obtained from a set of 1M x 150bp reads (SRR493095). Notice that the throughput increases with the size of the set, until it saturates around ~32M q-grams.
The next graph shows the performance of NVBIO's q-gram counting queries. The benchmark consists in building a q-gram index on the 22-mers obtained sampling a set of 1M x 150bp reads (SRR493095) every 10 bases, and streaming the whole human genome hg19 against it to find all matching q-grams. Specifically, the graph shows the throughput of the following three stages:
  • ranking: the process of finding the range of hits matching each query q-gram in the q-gram index
  • locating: the process of enumerating all found hits as (read-id,text-diagonal) pairs
  • counting: the process of bucketing the found hits by diagonal (separately for each read) and counting the occurrences in each bucket

Q-Gram Indices

Q-gram indices are data-structures providing fast searching of exact or approximate q-grams (or k-mers), i.e. short strings of text containing q symbols. This module provides two such data-structures:
  • the Q-Group Index, replicating the data-structure described in:
    Massively parallel read mapping on GPUs with PEANUT
    Johannes Koester and Sven Rahmann
    http://arxiv.org/pdf/1403.1706v1.pdf
    this data-structure requires O(A^q) bits of storage in the alphabet-size A and the q-gram length q (to be precise, 2*A^q bits + (min(A^q,|T|) + |T|) words), and provides O(1) query time;

  • the compact Q-Gram Index, which can be built over a string T, with memory consumption and query time proportional to O(|T|) and O(log(unique(T))) respectively, where unique(T) is the number of unique q-grams in T. This is achieved by keeping a plain sorted list of the unique q-grams in T, together with an index of their occurrences in the original string T. This data-structure offers up to 5x higher construction speed and a potentially unbounded improvement in memory consumption compared to the Q-Group Index, though the query time is asymptotically higher.
Q-gram indices can be built both on strings and on string sets (in which case we call them set-indices). The difference relies on the format of the coordinates associated to their q-grams: for strings, the coordinates are simple linear indices, whereas for string-sets the coordinates are (string-id,string-position) index pairs.
The following code sample shows how to build a QGramIndex over a simple string:
...
// consider a DNA string in ASCII format
const char* a_string = "ACGTACGTACGTACGTACGTACGTACGTACGT";
const uint32 string_len = strlen( a_string );
// convert the string to a 2-bit DNA alphabet
nvbio::vector<host_tag,uint8> h_string( string_len );
string_to_dna( a_string, string_len, h_string.begin() );
// copy the string to the device
nvbio::vector<device_tag,uint8> d_string( h_string );
// build a q-gram index on the device
QGramIndexDevice qgram_index;
qgram_index.build(
20u, // q-group size
2u, // the alphabet size, in bits
string_len, // the length of the string we want to index
d_string.begin() ); // the string we want to index
The next example shows how to do the same with a string-set:
...
// consider a DNA string in ASCII format
const char* a_string = "ACGTACGTACGTACGTACGTACGTACGTTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTTACGTACGTACGTACGTACGTACGTACGTACGT";
const uint32 string_len = strlen( a_string );
// convert the string to a 2-bit DNA alphabet
nvbio::vector<host_tag,uint8> h_string( string_len );
string_to_dna( a_string, string_len, h_string.begin() );
// copy the string to the device
const uint32 n_strings = 2u;
nvbio::vector<device_tag,uint8> d_string( h_string );
nvbio::vector<device_tag,uint32> d_string_offsets( n_strings+1 );
d_string_offsets[0] = 0; // offset to the first string
d_string_offsets[1] = 20; // offset to the second string
d_string_offsets[2] = 40; // end of the last string
typedef ConcatenatedStringSet<const uint8*,const uint32*> string_set_type;
const string_set_type string_set(
n_strings,
nvbio::plain_view( d_string ),
nvbio::plain_view( d_string_offsets ) );
// build a q-gram index on the device
QGramSetIndexDevice qgram_index;
const uint32 q = 20u;
qgram_index.build(
q, // q-group size
2u, // the alphabet size, in bits
string_set ); // the string-set we want to index
Note: sometimes you might not want to index all the q-grams in your string-set, but rather extract one every N bases, or perhaps use some custom logic of your own. Through Seeding Functors, NVBIO provides a mechanism to customize the seeding logic. For example, the following code will extract a seed every 10 bases:
qgram_index.build(
q, // q-group size
2u, // the alphabet size, in bits
string_set, // the string-set we want to index
uniform_seeds_functor( q, 10u ) ); // extract a q-gram every 10 bases

Q-Gram Index Queries

Once a q-gram index is built, it would be interesting to perform some queries on it. There's various ways to accomplish this, and here we'll show a couple. The simplest query one can do is locating for a given q-gram the range of indexed q-grams which match it exactly. This can be done as follows:
// consider a sample string - we'll want to find all matching locations between all
// q-grams in this string and all q-grams in the index
const char* a_query_string = "CGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTA"
const uint32 query_string_len = strlen( a_query_string );
// convert the string to a 2-bit DNA alphabet
nvbio::vector<host_tag,uint8> h_string( query_string_len );
string_to_dna( a_query_string, query_string_len, h_query_string.begin() );
// copy the string to the device
nvbio::vector<device_tag,uint8> d_query_string( h_query_string );
const uint32 q = 20u;
// extract a set of query q-gram indices, say one every 3 bases
const uint32 n_query_qgrams = enumerate_string_seeds( query_string_len, uniform_seed_functor( q, 3u ), d_query_indices );
// build the set of query q-grams: this can be done using the string_qgram_functor
// to extract them; note that we need at least 20 x 2 = 40 bits per q-gram, hence we
// store them in a uint64 vector
nvbio::vector<device_tag,uint64> d_query_qgrams( n_query_qgrams );
q, 2u, // q-gram length, alphabet size
query_string_len, nvbio::plain_view( d_query_string ), // input string
n_query_qgrams, // the number of q-grams
d_query_indices.begin(), // input q-gram coordinates
d_query_qgrams.begin() ); // output q-grams
// now sort the q-grams and their original indices, this improves coherence and throughput
thrust::sort_by_key(
d_query_qgrams.begin(),
d_query_qgrams.begin() + n_query_qgrams,
d_query_indices.begin() );
// find the above q-grams
nvbio::vector<device_tag,uint32> d_ranges( query_string_len - 20 );
// use the plain-view of the q-gram index itself as a search functor, that we "apply"
// to our query q-grams to obtain the corresponding match ranges
d_query_qgrams.begin(),
d_query_qgrams.begin() + n_query_qgrams,
d_ranges.begin(),
nvbio::plain_view( qgram_index ) );
This of course was just a toy example; in reality, you'll want to this kind of operations with much larger q-gram indices and much larger batches of queries.

Q-Gram Counting

The previous example was only showing how to get the ranges of matching q-grams inside an index: it didn't show how to get the actual list of hits. One way to go about it is to ask the q-gram index, which given an entry inside each non-empty range, can provide its location. This can be done using the qgram_locate_functor. However, if one is interested in getting the complete list of hits things are more difficult, as the process involves a variable rate data-expansion (as each range might expand to a variable number of hits).
The QGramFilter provides a convenient way to do this:
...
// suppose we have the previous vectors d_query_qgrams and d_query_indices
...
// find all hits using a q-gram filter
typedef QGramFilterDevice<QGramSetIndexDevice, const uint64*, const uint32*> qgram_filter_type;
typedef qgram_filter_type::hit_type hit_type;
typedef qgram_filter_type::diagonal_type diagonal_type;
qgram_filter_type qgram_filter;
const uint32 n_hits = qgram_filter.rank(
qgram_index,
n_query_qgrams,
nvbio::raw_pointer( d_query_qgrams ),
nvbio::raw_pointer( d_query_indices ) );
//
// loop through large batches of hits and locate them
//
const uint32 batch_size = 16*1024*1024; // 16M hits per batch
// reserve enough storage for each batch
for (uint32 hits_begin = 0; hits_begin < n_hits; hits_begin += batch_size)
{
const uint32 hits_end = nvbio::min( hits_begin + batch_size, n_hits );
// locate all hits in the range [hits_begin, hits_end)
qgram_filter.locate(
hits_begin,
hits_end,
hits.begin() );
}
Note: the hit coordinates are different according to the type of q-gram index;
  • for simple string indices, the coordinates are (index-pos,query-idx) uint2 pairs
  • for string-set indices, the coordinates are (index-id,index-pos,query-idx) tuples represented as a uint4, where index-id and index-pos are the index into the string-set used to built the q-gram index.
Finally, the generated hits can be sorted and merged by diagonal bucket, effectively performing so called q-gram counting.
nvbio::vector<device_tag,diagonal_type> merged_hits( batch_size );
nvbio::vector<device_tag,uint16> merged_counts( batch_size );
for (uint32 hits_begin = 0; hits_begin < n_hits; hits_begin += batch_size)
{
const uint32 hits_end = nvbio::min( hits_begin + batch_size, n_hits );
// locate all hits in the range [hits_begin, hits_end)
qgram_filter.locate(
hits_begin,
hits_end,
hits.begin() );
// merge all hits within the same diagonal interval
const uint32 n_merged = qgram_filter.merge(
16u, // merging interval
hits_end - hits_begin,
hits.begin(),
merged_hits.begin(),
merged_counts.begin() );
}

Technical Overview

A complete list of the classes and functions in this module is given in the Q-Gram Module documentation.