NVBIO
 All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Friends Macros Groups Pages
Protein Search using Wavelet Trees and an FM-index
In this example, we want to show NVBIO's support for larger alphabets, particularly showing the composability of compact string representations like Wavelet Trees, and self-compressed indices like the FM-index. We are hence going to build a small app performing pattern matching on a protein string.

Building the BWT

If we want to build an FM-index, the first thing is computing the BWT of the database string we want to index. In this example, we'll use a very naive string containing all protein characters in a row - but in a real app this string might contain millions of entries, hence we'll do this using NVBIO's very fast parallel construction algorithm running on a CUDA device.
int main(int argc, char* argv[])
{
log_info(stderr, "waveletfm... started\n");
// this string will be our protein database we want to search in - just a short string
// with all protein characters in sequence
const char* proteins = "ACDEFGHIKLMNOPQRSTVWYBZX";
const uint32 text_len = 24;
// print the text
log_info(stderr, " text: %s\n", proteins);
// allocate a host packed vector
PackedVector<host_tag,8u,true> h_text( text_len );
// pack the string
from_string<PROTEIN>( proteins, proteins + text_len, h_text.begin() );
// copy it to the device
PackedVector<device_tag,8u,true> d_text( h_text );
// allocate a vector for the BWT
PackedVector<device_tag,8u,true> d_bwt( text_len + 1 );
BWTParams bwt_params;
// build the BWT
const uint32 primary = cuda::bwt( text_len, d_text.begin(), d_bwt.begin(), &bwt_params );
// print the BWT
{
char proteins_bwt[ 25 ];
to_string<PROTEIN>( d_bwt.begin(), d_bwt.begin() + text_len, proteins_bwt );
proteins_bwt[24] = '\0';
log_info(stderr, " bwt: %s (primary %u)\n", proteins_bwt, primary);
}
...
Done: if everything works correctly, this should print the string "bwt: XACDEFGHIKLMNOPQRSTVWYBZ (primary 1)".

It's a large alphabet, let's use a Wavelet Tree

Compared to DNA, the protein alphabet is a largish one, with 24 letters. If we were to use a plain BWT encoding and a standard rank dictionary based on sparse occurrence tables to build an FM-index, we would need storage proportional to the string length times the size of the alphabet. The thing is made even worse by the fact that, for the purpose of efficient string packing, NVBIO treats protein characters as 8-bit characters, as 5-bit packing would need cumbersome, slow bit-manipulation logic (because individual characters would stride word boundaries). This means an FM-index would require 128 times more storage than for the 2-bit DNA alphabet!
But no worries: we can encode our BWT using Wavelet Trees, a wonderful succint data structure designed to encode large alphabet strings in small space (proportional to the string length times the number of bits required to represent each symbol, i.e. the logarithm of the alphabet size) while allowing O(1) ranking. Let's build a wavelet tree, then...
...
// define our wavelet tree storage type and its plain view
typedef WaveletTreeStorage<device_tag> wavelet_tree_type;
typedef WaveletTreeStorage<device_tag>::const_plain_view_type wavelet_tree_view_type;
// build a wavelet tree
wavelet_tree_type wavelet_bwt;
// setup the wavelet tree
setup( text_len, d_bwt.begin(), wavelet_bwt );
...
Et voila. At this point, we could even throw away the original bwt (the vector d_bwt), as the wavelet bwt encodes all the information we need. Notice that, behind the curtains, the wavelet tree construction is parallelized, in this case on the device.

Building and querying the FM-index

We now need to setup an FM-index on top of the wavelet bwt. Remember the basic fm_index class has three template parameters, a TRankDictionary type, a TSuffixArray type, and a TL2 type. These represent the kind of rank dictionary used to encode the bwt (in our case the wavelet tree), the sampled suffix array (if needed), and an iterator to the L2 table (a table containing the sum of all character frequencies). In our case, we want to specialize this cass to use:
  • the WaveletTreeStorage::const_plain_view_type class
  • a null_type suffix array (as the sampled suffix array is only needed for locating, which we won't do here - though it might be needed in a real app)
  • an L2 iterator to a device nvbio::vector of integers, i.e. nvbio::vector<device_tag,uint32>::const_iterator
Notice that the default value for the TL2 iterator is null_type, in which case a raw pointer will be used. However, in this example we want to show that all of NVBIO's data structures, even when instanced on the device, can be accessed and used for computing from the host just as well - albeit potentially paying a very long latency tax when unified memory is not present. This can be achieved using proper iterators, which know the memory space they are pointing to, and issue cuda mem-copies when dereferenced.
...
typedef fm_index<wavelet_tree_view_type, null_type, l2_iterator> fm_index_type;
// take the const view of the wavelet tree
const wavelet_tree_view_type wavelet_bwt_view = plain_view( (const wavelet_tree_type&)wavelet_bwt );
// build the L2 vector
// note we perform this on the host, even though the wavelet tree is on the device -
// gonna be slow, but it's not much stuff, and this is just an example anyway...
// and we just want to show that NVBIO is designed to make everything work!
L2[0] = 0;
for (uint32 i = 1; i <= 256; ++i)
L2[i] = L2[i-1] + rank( wavelet_bwt_view, text_len, i-1u );
// build the FM-index
const fm_index_type fmi(
text_len,
primary,
L2.begin(),
wavelet_bwt_view,
null_type() );
// do some string matching using our newly built FM-index - once again
// we are doing it on the host, though all data is on the device: the entire
// loop would be better moved to the device in a real app.
log_info(stderr, " string matching:\n");
for (uint32 i = 0; i < text_len; ++i)
{
// match the i-th suffix of the text
const uint32 pattern_len = text_len - i;
// compute the SA range containing the occurrences of the pattern we are after
const uint2 range = match( fmi, d_text.begin() + i, pattern_len );
// print the number of occurrences of our pattern, equal to the SA range size
log_info(stderr, " rank(%s): %u\n", proteins + i, 1u + range.y - range.x);
}
log_info(stderr, "waveletfm... done\n");
return 0;
}
Compile, and run. With a bit of luck, you should see the following output:
*  info    : waveletfm... started
*  info    :   text: ACDEFGHIKLMNOPQRSTVWYBZX
*  info    :   bwt: XACDEFGHIKLMNOPQRSTVWYBZ (primary 1)
*  info    :   string matching:
*  info    :     rank(ACDEFGHIKLMNOPQRSTVWYBZX): 1
*  info    :     rank(CDEFGHIKLMNOPQRSTVWYBZX): 1
*  info    :     rank(DEFGHIKLMNOPQRSTVWYBZX): 1
*  info    :     rank(EFGHIKLMNOPQRSTVWYBZX): 1
*  info    :     rank(FGHIKLMNOPQRSTVWYBZX): 1
*  info    :     rank(GHIKLMNOPQRSTVWYBZX): 1
*  info    :     rank(HIKLMNOPQRSTVWYBZX): 1
*  info    :     rank(IKLMNOPQRSTVWYBZX): 1
*  info    :     rank(KLMNOPQRSTVWYBZX): 1
*  info    :     rank(LMNOPQRSTVWYBZX): 1
*  info    :     rank(MNOPQRSTVWYBZX): 1
*  info    :     rank(NOPQRSTVWYBZX): 1
*  info    :     rank(OPQRSTVWYBZX): 1
*  info    :     rank(PQRSTVWYBZX): 1
*  info    :     rank(QRSTVWYBZX): 1
*  info    :     rank(RSTVWYBZX): 1
*  info    :     rank(STVWYBZX): 1
*  info    :     rank(TVWYBZX): 1
*  info    :     rank(VWYBZX): 1
*  info    :     rank(WYBZX): 1
*  info    :     rank(YBZX): 1
*  info    :     rank(BZX): 1
*  info    :     rank(ZX): 1
*  info    :     rank(X): 1
*  info    : waveletfm... done
* 
Showing that, indeed, there's one match for each suffix of the original string.

Next: Smith-Waterman Alignment of Protein Strings Top: NVBIO