NVBIO
 All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Friends Macros Groups Pages
Strings Module
This module provides generic constructs to work with strings and string-sets. A string-set is a collection of strings. As there's many ways to encode a string, there's even more ways to represent a string set. For example, one might want to store the strings into a single concatenated array, and store offsets to the beginning of each one. Or he might want to build a small string set out of a sparse subset of a big string (for example, to carve a few isolated regions out of a genome). And the support string might be either encoded with ascii characters, or packed using a few bits per symbol.

Alphabets

This module provides a set of operators to work with various alphabets:
The list of all available operators is available in the Alphabet Module.

String-Set Interfaces

String sets are generic, interchangeable containers that expose the same interface:
interface StringSet
{
// specify the type of the strings
typedef ... string_type;
// return the size of the set
uint32 size() const;
// return the i-th string in the set
string_type operator[] (const uint32) [const];
}
This module provides a few generic string set adaptors that can be "configured" by means of the underlying template iterators. The philosophy behind all of these containers is that they are just shallow representations, holding no storage. Hence they can be used both in host and device code.
Furthermore, the module provides efficient generic copy() (resp. cuda::copy()) implementations to copy a given host (resp. device) string set from a given layout into another with a different layout.

Seeding

Many bioinformatics applications need to extract short seeds out of strings and string-sets: such seeds are nothing but infixes. This module provides a few convenience functions to enumerate all seeds resulting by applying a Seeding Functor to a string or string-set. Internally, these simple functions employ massively parallel algorithms, which run either on the host or on the bound cuda device, depending on whether the output is a host or device vector.
The following is a sample showing how to extract uniformly sampled seeds from a host or device string-set, and construct an InfixSet to represent them:
// extract a set of uniformly spaced seeds from a string-set and return it as an InfixSet
//
template <typename system_tag, typename string_set_type>
InfixSet<string_set_type, const string_set_infix_coord_type*>
extract_seeds(
const string_set_type string_set,
const uint32 seed_len,
const uint32 seed_interval,
{
// enumerate all seeds
string_set,
uniform_seeds_functor<>( seed_len, seed_interval ),
seed_coords );
// and build the output infix-set
return InfixSet<string_set_type, const string_set_infix_coord_type*>(
n_seeds,
string_set,
nvbio::plain_view( seed_coords ) );
}
See examples/seeding/seeding.cu for more details.

Wavelet Trees

A Wavelet Tree is a data structure that can be used to encode a string T of n symbols from an alphabet of s characters in space O(n log(s)), allowing both symbol access and ranking in O(log(s)) time, i.e:
  • each character T[i] can be recovered in O(log(s)) time
  • the number of occurrences of a character c in the substring T[0,i] can be counted in O(log(s)) time
In other words, a Wavelet Tree is both an alternative string representation (often more amenable to compression), and a storage-efficient rank dictionary. For the sake of comparison, notice that the rank_dictionary class, which is based on a standard sampled occurrence table built directly on top of the original string T, needs O(s) storage - exponentially more in the number of bits per symbol b = log(s).
NVBIO provides two classes:
  • WaveletTreeStorage : a proper wavelet tree implementation, which can instanced using either host or device memory
  • WaveletTree : a shallow, storage-less representation, which can be instanced on arbitrary user supplied iterators to the actual data containers
as well as parallel host and device construction and lookup functions. See the Wavelet Trees module for further documentation.
The following is a sample showing how to build a WaveletTree and query it:
void test_wavelet_tree()
{
const char* text = "This is a test String";
const uint32 text_len = uint32( nvbio::length( text ) );
// copy the text string onto the device, reinterpreting as uint8 as the WaveletTree
// expects symbols to be encoded using unsigned integral types
nvbio::vector<device_tag,uint8> d_text( text_len );
thrust::copy( (const uint8*)text, (const uint8*)text + text_len, d_text );
// instantiate a wavelet tree
WaveletTreeStorage<device_tag> wavelet_tree;
// setup the wavelet tree
setup( text_len, d_text.begin(), wavelet_tree );
// extract the text in parallel using the WaveletTree's unary functor operator()
// to transform the sequence of indices [0,text_len) into the sequence of
// corresponding characters
nvbio::transform<device_tag>(
text_len, // number of characters to extract
thrust::make_counting_iterator<uint32>(0), // list of character indices
d_text.begin(), // output sequence
plain_view( wavelet_tree ) ); // plain_view of the WaveletTreeStorage class
// copy the results back to the host
nvbio::vector<host_tag,uint8> h_extracted_text( d_text );
// check we extracted the right string...
printf("extracted \"%s\"", (const char*)raw_pointer( h_extracted_text ));
// check that in position 0 there should be exactly one 'T'
printf("rank(0,T) = %u\n", rank( plain_view( wavelet_tree ), 0, (uint8)'T' ));
// check that the range [0,6] contains exactly two 's'
printf("rank(6,s) = %u\n", rank( plain_view( wavelet_tree ), 6, (uint8)'s' ));
}

Technical Overview

For a detailed description of all classes and functions see the Strings Module documentation.