NVBIO
 All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Friends Macros Groups Pages
All-Mapping using an FM-index
NVBIO tries to make it easy to write modern bioinformatics applications exploiting parallel architectures. In this page we'll see how to apply it to build a prototype FM-index based all-mapper.

Input

The first step for a reference-based aligner is loading a reference index, and streaming a set of reads. Let's start by loading the reference sequence and its FM-index, using the corresponding Sequence and FM-Index I/O classes.
#include <stdio.h>
#include <stdlib.h>
#include "util.h"
void main(int argc, const char** argv)
{
// perform some basic option parsing
Params params( argc-2, argv );
const char* reads = argv[argc-1];
const char* index = argv[argc-2];
// load a reference sequence in RAM
if (!io::load_sequence_file( DNA, &h_ref, index ))
{
log_error(stderr, " failed loading reference \"%s\"\n", index);
return 1u;
}
// load an FM-index in RAM
io::FMIndexDataHost h_fmi;
if (!h_fmi.load( index, io::FMIndexData::FORWARD | io::FMIndexData::SA ))
{
log_error(stderr, " failed loading index \"%s\"\n", index);
return 1u;
}
...
Besides some simple option parsing, at this point we have an application that loads an FM-index from disk into system memory, including the genome, its forward BWT and the Sampled Suffix Array. At this point, if we want our mapper to run on the GPU we can copy the FM-index in device memory with a single line, again specifying which of its components to load. Rather than keeping it on the stack we will store it in a Pipeline object, which we'll use to keep all the state of our mapper.
struct Pipeline
{
Params params; // the mapping params
SharedPointer<io::SequenceDataDevice> ref_data; // the device reference
SharedPointer<io::FMIndexDataDevice> fm_data; // the device FM-index
};
...
Pipeline pipeline;
// save the program options
pipeline.params = params;
// build the device index
pipeline.ref_data = new io::SequenceDataDevice( h_ref );
// build the device index
pipeline.fm_data = new io::FMIndexDataDevice( h_fmi, io::FMIndexData::FORWARD |
io::FMIndexData::SA );
...
The second second step is to start streaming reads in; NVBIO's philosophy here is to use a batched approach, where the size of the batch will roughly determine the amount of available parallelism in the rest of the pipeline. In order to amortize fixed costs (e.g. grid launch / synchronization overhead) it's good to allow for as much parallelism as possible given your memory constraints - but as a rough guideline, suffice it to say that lightweight kernels should have at least ~128K items to process, possibly more (for example, sorting performance saturates at 16-32M keys).
So if some stages of the alignment pipeline will parallelize across reads, it would be good to have at least 1M or so to process. Obviously, one million reads could require plenty of memory if the reads are long, so we might want to cap the maximum number of base pairs as well:
...
const uint32 batch_reads = 1*1024*1024;
const uint32 batch_bps = 100*1024*1024;
// open a read file, storing for each read both its forward and its reverse-complemented representations
SharedPointer<io::SequenceDataStream> read_data_file(
reads,
2*params.max_reads,
uint32(-1),
// check whether the file opened correctly
if (read_data_file == NULL || read_data_file->is_ok() == false)
{
log_error(stderr, " failed opening file \"%s\"\n", reads);
return 1u;
}
io::SequenceDataHost h_read_data;
// start looping to consume the input reads
while (1)
{
// load a batch of reads
if (io::next( DNA_N, &h_read_data, read_data_file, batch_reads, batch_bps ) == 0);
break;
log_info(stderr, " loading reads... started\n");
// copy it to the device
const io::SequenceDataDevice d_read_data( *h_read_data );
// remember we generated two strands per input read...
const uint32 n_reads = d_read_data.size() / 2;
log_info(stderr, " loading reads... done\n");
log_info(stderr, " %u reads\n", n_reads);
// do the real mapping work
map( pipeline, d_read_data );
}
}
Now we are ready to start the real mapping work. In order to do it, we'll need to store another data structure in our pipeline, namely the FMIndexFilter. This data structure is just a context for performing FM-index based filtering, i.e. the process of finding potential matches of a set of strings using an FM-index.
struct Pipeline
{
typedef io::FMIndexDataDevice::fm_index_type fm_index_type; // the fm-index view type
typedef FMIndexFilterDevice<fm_index_type> fm_filter_type; // the fm-index filter type
Params params; // the mapping params
SharedPointer<io::SequenceDataDevice> ref_data; // the device reference
SharedPointer<io::FMIndexDataDevice> fm_data; // the device FM-index
fm_filter_type fm_filter; // the FM-index filter
};
Here you go, ready to write the bulk of the aligner.

Mapping

We'll write a simple seed & extend aligner, so the first thing we'll want to do is split our reads in equal segments. As shown in the previous example, the collection of these seeds can be seen as an InfixSet . Let's start with some boiler-plate code to make sure we have everythin we need:
void map(Pipeline& pipeline, const io::SequenceDataAccess<DNA_N> reads)
{
// the reads can be accessed as a string-set, of this type
typedef io::SequenceDataAccess<DNA>::sequence_string_set_type genome_string;
// the reads can be accessed as a string-set, of this type
typedef io::SequenceDataAccess<DNA_N>::sequence_string_set_type read_string_set_type;
// the seeds will be infixes of a string-set, with this coordinates type
typedef string_set_infix_coord_type infix_coord_type;
// and we'll store them in a device vector
typedef nvbio::vector<device_tag,infix_coord_type> infix_vector_type;
// finally, this will be the type of their infix-set
typedef InfixSet<read_string_set_type, const string_set_infix_coord_type*> seed_string_set_type;
// fetch the program options
const Params& params = pipeline.params;
// fetch the genome string
const io::SequenceDataAccess<DNA> genome_access( *pipeline.ref_data );
const uint32 genome_len = genome_access.bps();
const genome_string genome( genome_access.sequence_stream() );
// fetch an fm-index view
const Pipeline::fm_index_type fm_index = pipeline.fm_data->index();
// fetch the fm-index filter
Pipeline::fm_filter_type& fm_filter = pipeline.fm_filter;
// prepare some vectors to store the query qgrams
infix_vector_type seed_coords;
// extract the seeds
const read_string_set_type read_string_set = reads.sequence_string_set();
const seed_string_set_type seed_string_set = extract_seeds(
read_string_set,
params.seed_len,
params.seed_intv,
seed_coords );
...
The next step is mapping the seeds using the FM-index filter; we start by ranking their occurrences:
...
// first step: rank the query seeds
const uint64 n_hits = fm_filter.rank( fm_index, seed_string_set );
...
Now, each seed might have mapped to a whole range of suffixes in the Suffix Array of the genome, and all those locations are potential candidates for a valid alignment. We want to visit them all, which means we need to locate their coordinates.
Notice that this process might involve a severe data expansion, as some seeds might map to millions of genome locations: hence, we cannot simply locate all the n_hits in one go. The FMIndexFilter interface gives us the chance to do it in batches, while doing the delicate job of parallelizing work at the finest possible grain and with the most evenly distributed load-balancing: i.e. spreading the variable sized suffix array ranges across multiple threads, so that each thread gets a single hit to locate. Well, in practice, you can ignore this, as NVBIO will do it for you. The only thing you need to remember is to process your hits in large batches:
...
typedef uint2 hit_type; // each hit will be a (genome-pos,seed-id) coordinate pair
// prepare storage for the output hits
nvbio::vector<device_tag,uint32> out_reads( batch_size );
nvbio::vector<device_tag,int16> out_scores( batch_size );
// loop through large batches of hits and locate & merge them
for (uint64 hits_begin = 0; hits_begin < n_hits; hits_begin += batch_size)
{
const uint64 hits_end = nvbio::min( hits_begin + batch_size, n_hits );
fm_filter.locate(
hits_begin, // the beginning of the range of hits to locate
hits_end, // the end of the range of hits to locate
hits.begin() ); // the output vector to be filled
...
At this point, remember the hit coordinates refer to the seeds, not to the reads they belong to. We hence want to convert them to diagonals of the genome/read matrices, and we do this employing the transform() parallel primitive:
...
// transform the (index-pos,seed-id) hit coordinates into diagonals
nvbio::transform<device_tag>(
hits_end - hits_begin,
hits.begin(),
hits.begin(),
hit_to_diagonal( nvbio::plain_view( seed_coords ) ) );
...
// transform an (index-pos,seed-id) hit into a diagonal (text-pos = index-pos - seed-pos, read-id)
struct hit_to_diagonal
{
typedef uint2 argument_type;
typedef uint2 result_type;
// constructor
hit_to_diagonal(const string_set_infix_coord_type* _seed_coords) : seed_coords(_seed_coords) {}
// functor operator
uint2 operator() (const uint2 hit) const
{
const uint32 index_pos = hit.x;
const uint32 seed_id = hit.y;
const string_set_infix_coord_type seed = seed_coords[ seed_id ];
const uint32 read_pos = infix_begin( seed );
const uint32 read_id = string_id( seed );
return make_uint2( index_pos - read_pos, read_id );
}
const string_set_infix_coord_type* seed_coords;
};
We are finally ready for the extension or verification phase: we need to check if the candidate hits identified by the seeds are true alignments or not, i.e. whether they map to within a certain edit-distance from the reference genome. We'll do this using the banded Myers bit-vector aligner, applied to the batch alignment problem of aligning the string-set of reads associated to all hits against the string-set of corresponding genome locations. Both these string-sets can be thought of as sparse string sets, i.e. sparse collections of substrings of:
  • the contiguously packed string of input reads
  • the reference genome
...
// build the set of read infixes
nvbio::transform<device_tag>(
hits_end - hits_begin
hits.begin(),
read_infix_coords.begin(),
// build the set of genome infixes
nvbio::transform<device_tag>(
hits_end - hits_begin
hits.begin(),
genome_infix_coords.begin(),
genome_infixes<BAND_LEN>( genome_len, nvbio::plain_view( reads ) ) );
typedef io::SequenceDataAccess<DNA_N>::sequence_stream_type read_stream;
const SparseStringSet<read_stream,infix_iterator> read_infix_set(
hits_end - hits_begin,
reads.sequence_stream(),
read_infix_coords.begin() );
const SparseStringSet<genome_string,infix_iterator> genome_infix_set(
hits_end - hits_begin,
genome,
genome_infix_coords.begin() );
// spawn a batched banded DP alignment
typedef aln::MyersTag<5u> myers_dna5_tag;
aln::batch_banded_alignment_score<BAND_LEN>(
aln::make_edit_distance_aligner<aln::SEMI_GLOBAL, myers_dna5_tag>(),
read_infix_set,
genome_infix_set,
sinks.begin(),
reads.max_read_len(),
reads.max_read_len() + BAND_LEN );
}
}
Technically, we are almost done: we now have the mapping score for each candidate hit. In this tutorial we won't do anything with them, but the accompanying code actually shows how to keep track of the best scores for each read by using segmented reductions.
This example should have given you an idea of how to build a pipeline based on some of NVBIO's parallel constructs, from FM-index filters to DP alignment. Hopefully, you are now set to read through the rest of the documentation and discover how many other useful tools might come to your help!

Next: Protein Search using Wavelet Trees and an FM-index Top: NVBIO