NVBIO
 All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Friends Macros Groups Pages
Hello DNA! - Part 2
This page will teach you to familiarize with NVBIO's notion of string sets and with some basic parallel constructs. Assume you have the string from the previous example, but now want to extract all of its 3-mers, for example to use them in a seed & extend aligner. The collection of all 3-mers can be thought of as a set of substrings, or infixes of your original string; NVBIO represents such an object as an InfixSet. Moreover, it provides a function to extract the coordinates of such seeds in parallel.
using namespace nvbio;
void main()
{
// our hello world ASCII string
const char dna_string[] = "ACGTTGCA";
const uint32 len = uint32( strlen( dna_string ) );
// our DNA alphabet size, in bits
// instantiate a packed host vector
// and fill it in with the contents of our original string, converted
// to a 2-bit DNA alphabet (i.e. A = 0, C = 1, G = 2, T = 3)
nvbio::assign( len, nvbio::from_string<DNA>( dna_string ), h_dna.begin() );
// copy the packed vector to the device
// prepare a vector to store the coordinates of the resulting infixes
const uint32 n_seeds = enumerate_string_seeds(
len, // the string length
uniform_seeds_functor<>( 3u, 1u ), // a seeding functor, specifying to extract
// all 3-mers offset by 1 base each
d_seed_coords ); // the output infix coordinates
// define an infix-set to represent the resulting infixes
infix_set_type d_seeds(
n_seeds, // the number of infixes in the set
d_dna.begin(), // the underlying string
d_seed_coords.begin() ); // the iterator to the infix coordinates
...
At this point it would be nice to do something with the string-set we just created. For the sake of the example, we will just print all its strings. Technically, we could just loop through the strings in the set, and print them on the host. However, suppose we want to do it in the native space of the set, i.e. on the device, and suppose we want to do it in parallel: how can we do this? Well, we will do it here using a parallel primitive, and particularly the for_each() construct:
...
nvbio::for_each<device_tag>(
n_seeds, // the loop's sequence size
seeds.begin(), // the loop's begin iterator
print_strings<infix_set_type>() ); // the loop's body
}
for_each() takes a functor encoding the body of our loop, and applies it to all elements of a sequence. In this case, the sequence is the string-set, and the body of our loop will be the following:
template <typename string_set_type>
struct print_strings
{
typedef typename string_set_type::string_type string_type;
void operator() (const string_type string) const
{
// build an ASCII string
char seed[4];
nvbio::to_string<DNA>(
string, // the input string
nvbio::length( string ), // its length
seed ); // the output ASCII string
// and print it...
printf("%s\n", seed);
}
};
Notice how we marked our print_strings methods with the NVBIO_HOST_DEVICE qualifier: this tells the CUDA compiler that these methods can be compiled both for the host and the device.
It is also worth mentioning that if we kept our vectors on the host (i.e. replacing the device_tag specifiers with host_tag), everything would have still run in parallel, except it would have run on the host.

Next: All-Mapping using an FM-index Top: NVBIO