: In this example, we want to show NVBIO's support for larger alphabets, particularly showing the composability of compact string representations like Wavelet Trees, and self-compressed indices like the FM-index. We are hence going to build a small app performing pattern matching on a protein string.

Building the BWT

: If we want to build an FM-index, the first thing is computing the BWT of the database string we want to index. In this example, we'll use a very naive string containing all protein characters in a row - but in a real app this string might contain millions of entries, hence we'll do this using NVBIO's very fast parallel construction algorithm running on a CUDA device.

: int main(int argc, char* argv[])

{

log_info(stderr, "waveletfm... started\n");

// this string will be our protein database we want to search in - just a short string

// with all protein characters in sequence

const char* proteins = "ACDEFGHIKLMNOPQRSTVWYBZX";

const uint32 text_len = 24;

// print the text

log_info(stderr, " text: %s\n", proteins);

// allocate a host packed vector

PackedVector<host_tag,8u,true> h_text( text_len );

// pack the string

from_string<PROTEIN>( proteins, proteins + text_len, h_text.begin() );

// copy it to the device

PackedVector<device_tag,8u,true> d_text( h_text );

// allocate a vector for the BWT

PackedVector<device_tag,8u,true> d_bwt( text_len + 1 );

BWTParams bwt_params;

// build the BWT

const uint32 primary = cuda::bwt( text_len, d_text.begin(), d_bwt.begin(), &bwt_params );

// print the BWT

{

char proteins_bwt[ 25 ];

to_string<PROTEIN>( d_bwt.begin(), d_bwt.begin() + text_len, proteins_bwt );

proteins_bwt[24] = '\0';

log_info(stderr, " bwt: %s (primary %u)\n", proteins_bwt, primary);

}

...

: Done: if everything works correctly, this should print the string "bwt: XACDEFGHIKLMNOPQRSTVWYBZ (primary 1)".

It's a large alphabet, let's use a Wavelet Tree

: Compared to DNA, the protein alphabet is a largish one, with 24 letters. If we were to use a plain BWT encoding and a standard rank dictionary based on sparse occurrence tables to build an FM-index, we would need storage proportional to the string length times the size of the alphabet. The thing is made even worse by the fact that, for the purpose of efficient string packing, NVBIO treats protein characters as 8-bit characters, as 5-bit packing would need cumbersome, slow bit-manipulation logic (because individual characters would stride word boundaries). This means an FM-index would require 128 times more storage than for the 2-bit DNA alphabet!

: But no worries: we can encode our BWT using Wavelet Trees, a wonderful succint data structure designed to encode large alphabet strings in small space (proportional to the string length times the number of bits required to represent each symbol, i.e. the logarithm of the alphabet size) while allowing O(1) ranking. Let's build a wavelet tree, then...

: ...

// define our wavelet tree storage type and its plain view

typedef WaveletTreeStorage<device_tag> wavelet_tree_type;

typedef WaveletTreeStorage<device_tag>::const_plain_view_type wavelet_tree_view_type;

// build a wavelet tree

wavelet_tree_type wavelet_bwt;

// setup the wavelet tree

setup( text_len, d_bwt.begin(), wavelet_bwt );

...

: Et voila. At this point, we could even throw away the original bwt (the vector d_bwt), as the wavelet bwt encodes all the information we need. Notice that, behind the curtains, the wavelet tree construction is parallelized, in this case on the device.

Building and querying the FM-index

: We now need to setup an FM-index on top of the wavelet bwt. Remember the basic fm_index class has three template parameters, a TRankDictionary type, a TSuffixArray type, and a TL2 type. These represent the kind of rank dictionary used to encode the bwt (in our case the wavelet tree), the sampled suffix array (if needed), and an iterator to the L2 table (a table containing the sum of all character frequencies). In our case, we want to specialize this cass to use:

the WaveletTreeStorage::const_plain_view_type class
a null_type suffix array (as the sampled suffix array is only needed for locating, which we won't do here - though it might be needed in a real app)
an L2 iterator to a device nvbio::vector of integers, i.e. nvbio::vector<device_tag,uint32>::const_iterator

: Notice that the default value for the TL2 iterator is null_type, in which case a raw pointer will be used. However, in this example we want to show that all of NVBIO's data structures, even when instanced on the device, can be accessed and used for computing from the host just as well - albeit potentially paying a very long latency tax when unified memory is not present. This can be achieved using proper iterators, which know the memory space they are pointing to, and issue cuda mem-copies when dereferenced.

: ...

typedef nvbio::vector<device_tag,uint32>::const_iterator l2_iterator;

typedef fm_index<wavelet_tree_view_type, null_type, l2_iterator> fm_index_type;

// take the const view of the wavelet tree

const wavelet_tree_view_type wavelet_bwt_view = plain_view( (const wavelet_tree_type&)wavelet_bwt );

// build the L2 vector

nvbio::vector<device_tag,uint32> L2(257);

// note we perform this on the host, even though the wavelet tree is on the device -

// gonna be slow, but it's not much stuff, and this is just an example anyway...

// and we just want to show that NVBIO is designed to make everything work!

L2[0] = 0;

for (uint32 i = 1; i <= 256; ++i)

L2[i] = L2[i-1] + rank( wavelet_bwt_view, text_len, i-1u );

// build the FM-index

const fm_index_type fmi(

text_len,

primary,

L2.begin(),

wavelet_bwt_view,

null_type() );

// do some string matching using our newly built FM-index - once again

// we are doing it on the host, though all data is on the device: the entire

// loop would be better moved to the device in a real app.

log_info(stderr, " string matching:\n");

for (uint32 i = 0; i < text_len; ++i)

{

// match the i-th suffix of the text

const uint32 pattern_len = text_len - i;

// compute the SA range containing the occurrences of the pattern we are after

const uint2 range = match( fmi, d_text.begin() + i, pattern_len );

// print the number of occurrences of our pattern, equal to the SA range size

log_info(stderr, " rank(%s): %u\n", proteins + i, 1u + range.y - range.x);

}

log_info(stderr, "waveletfm... done\n");

return 0;

}

Compile, and run. With a bit of luck, you should see the following output:

*  info    : waveletfm... started
*  info    :   text: ACDEFGHIKLMNOPQRSTVWYBZX
*  info    :   bwt: XACDEFGHIKLMNOPQRSTVWYBZ (primary 1)
*  info    :   string matching:
*  info    :     rank(ACDEFGHIKLMNOPQRSTVWYBZX): 1
*  info    :     rank(CDEFGHIKLMNOPQRSTVWYBZX): 1
*  info    :     rank(DEFGHIKLMNOPQRSTVWYBZX): 1
*  info    :     rank(EFGHIKLMNOPQRSTVWYBZX): 1
*  info    :     rank(FGHIKLMNOPQRSTVWYBZX): 1
*  info    :     rank(GHIKLMNOPQRSTVWYBZX): 1
*  info    :     rank(HIKLMNOPQRSTVWYBZX): 1
*  info    :     rank(IKLMNOPQRSTVWYBZX): 1
*  info    :     rank(KLMNOPQRSTVWYBZX): 1
*  info    :     rank(LMNOPQRSTVWYBZX): 1
*  info    :     rank(MNOPQRSTVWYBZX): 1
*  info    :     rank(NOPQRSTVWYBZX): 1
*  info    :     rank(OPQRSTVWYBZX): 1
*  info    :     rank(PQRSTVWYBZX): 1
*  info    :     rank(QRSTVWYBZX): 1
*  info    :     rank(RSTVWYBZX): 1
*  info    :     rank(STVWYBZX): 1
*  info    :     rank(TVWYBZX): 1
*  info    :     rank(VWYBZX): 1
*  info    :     rank(WYBZX): 1
*  info    :     rank(YBZX): 1
*  info    :     rank(BZX): 1
*  info    :     rank(ZX): 1
*  info    :     rank(X): 1
*  info    : waveletfm... done
*

: Showing that, indeed, there's one match for each suffix of the original string.

Next: Smith-Waterman Alignment of Protein Strings Top: NVBIO