: This module provides generic constructs to work with strings and string-sets. A string-set is a collection of strings. As there's many ways to encode a string, there's even more ways to represent a string set. For example, one might want to store the strings into a single concatenated array, and store offsets to the beginning of each one. Or he might want to build a small string set out of a sparse subset of a big string (for example, to carve a few isolated regions out of a genome). And the support string might be either encoded with ascii characters, or packed using a few bits per symbol.

Alphabets

: This module provides a set of operators to work with various alphabets:

Alphabet = { DNA, DNA_N, DNA_IUPAC, RNA, RNA_N, PROTEIN }
to_char
from_char
to_string
from_string

: The list of all available operators is available in the Alphabet Module.

String-Set Interfaces

: String sets are generic, interchangeable containers that expose the same interface:
interface StringSet

{

// specify the type of the strings

typedef ... string_type;

// return the size of the set

uint32 size() const;

// return the i-th string in the set

string_type operator[] (const uint32) [const];

}

: This module provides a few generic string set adaptors that can be "configured" by means of the underlying template iterators. The philosophy behind all of these containers is that they are just shallow representations, holding no storage. Hence they can be used both in host and device code.

: Furthermore, the module provides efficient generic copy() (resp. cuda::copy()) implementations to copy a given host (resp. device) string set from a given layout into another with a different layout.

Seeding

: Many bioinformatics applications need to extract short seeds out of strings and string-sets: such seeds are nothing but infixes. This module provides a few convenience functions to enumerate all seeds resulting by applying a Seeding Functor to a string or string-set. Internally, these simple functions employ massively parallel algorithms, which run either on the host or on the bound cuda device, depending on whether the output is a host or device vector.

: The following is a sample showing how to extract uniformly sampled seeds from a host or device string-set, and construct an InfixSet to represent them:
// extract a set of uniformly spaced seeds from a string-set and return it as an InfixSet

//

template <typename system_tag, typename string_set_type>

InfixSet<string_set_type, const string_set_infix_coord_type*>

extract_seeds(

const string_set_type string_set,

const uint32 seed_len,

const uint32 seed_interval,

nvbio::vector<system_tag,string_set_infix_coord_type>& seed_coords)

{

// enumerate all seeds

const uint32 n_seeds = enumerate_string_set_seeds(

string_set,

uniform_seeds_functor<>( seed_len, seed_interval ),

seed_coords );

// and build the output infix-set

return InfixSet<string_set_type, const string_set_infix_coord_type*>(

n_seeds,

string_set,

nvbio::plain_view( seed_coords ) );

}

See examples/seeding/seeding.cu for more details.

Wavelet Trees

: A Wavelet Tree is a data structure that can be used to encode a string T of n symbols from an alphabet of s characters in space O(n log(s)), allowing both symbol access and ranking in O(log(s)) time, i.e:

each character T[i] can be recovered in O(log(s)) time
the number of occurrences of a character c in the substring T[0,i] can be counted in O(log(s)) time

: In other words, a Wavelet Tree is both an alternative string representation (often more amenable to compression), and a storage-efficient rank dictionary. For the sake of comparison, notice that the rank_dictionary class, which is based on a standard sampled occurrence table built directly on top of the original string T, needs O(s) storage - exponentially more in the number of bits per symbol b = log(s).

: NVBIO provides two classes:

WaveletTreeStorage : a proper wavelet tree implementation, which can instanced using either host or device memory
WaveletTree : a shallow, storage-less representation, which can be instanced on arbitrary user supplied iterators to the actual data containers

: as well as parallel host and device construction and lookup functions. See the Wavelet Trees module for further documentation.

: The following is a sample showing how to build a WaveletTree and query it:
void test_wavelet_tree()

{

const char* text = "This is a test String";

const uint32 text_len = uint32( nvbio::length( text ) );

// copy the text string onto the device, reinterpreting as uint8 as the WaveletTree

// expects symbols to be encoded using unsigned integral types

nvbio::vector<device_tag,uint8> d_text( text_len );

thrust::copy( (const uint8*)text, (const uint8*)text + text_len, d_text );

// instantiate a wavelet tree

WaveletTreeStorage<device_tag> wavelet_tree;

// setup the wavelet tree

setup( text_len, d_text.begin(), wavelet_tree );

// extract the text in parallel using the WaveletTree's unary functor operator()

// to transform the sequence of indices [0,text_len) into the sequence of

// corresponding characters

nvbio::transform<device_tag>(

text_len, // number of characters to extract

thrust::make_counting_iterator<uint32>(0), // list of character indices

d_text.begin(), // output sequence

plain_view( wavelet_tree ) ); // plain_view of the WaveletTreeStorage class

// copy the results back to the host

nvbio::vector<host_tag,uint8> h_extracted_text( d_text );

// check we extracted the right string...

printf("extracted \"%s\"", (const char*)raw_pointer( h_extracted_text ));

// check that in position 0 there should be exactly one 'T'

printf("rank(0,T) = %u\n", rank( plain_view( wavelet_tree ), 0, (uint8)'T' ));

// check that the range [0,6] contains exactly two 's'

printf("rank(6,s) = %u\n", rank( plain_view( wavelet_tree ), 6, (uint8)'s' ));

}

Technical Overview

: For a detailed description of all classes and functions see the Strings Module documentation.