- NVBIO tries to make it easy to write modern bioinformatics applications exploiting parallel architectures. In this page we'll see how to apply it to build a prototype FM-index based all-mapper.
Input
- The first step for a reference-based aligner is loading a reference index, and streaming a set of reads. Let's start by loading the reference sequence and its FM-index, using the corresponding Sequence and FM-Index I/O classes.
#include <stdio.h>
#include <stdlib.h>
#include "util.h"
void main(
int argc,
const char** argv)
{
Params params( argc-2, argv );
const char* reads = argv[argc-1];
const char* index = argv[argc-2];
{
log_error(stderr,
" failed loading reference \"%s\"\n", index);
return 1u;
}
io::FMIndexDataHost h_fmi;
{
log_error(stderr,
" failed loading index \"%s\"\n", index);
return 1u;
}
...
- Besides some simple option parsing, at this point we have an application that loads an FM-index from disk into system memory, including the genome, its forward BWT and the Sampled Suffix Array. At this point, if we want our mapper to run on the GPU we can copy the FM-index in device memory with a single line, again specifying which of its components to load. Rather than keeping it on the stack we will store it in a Pipeline object, which we'll use to keep all the state of our mapper.
struct Pipeline
{
Params params;
SharedPointer<io::SequenceDataDevice> ref_data;
SharedPointer<io::FMIndexDataDevice> fm_data;
};
...
Pipeline pipeline;
pipeline.params = params;
io::FMIndexData::SA );
...
- The second second step is to start streaming reads in; NVBIO's philosophy here is to use a batched approach, where the size of the batch will roughly determine the amount of available parallelism in the rest of the pipeline. In order to amortize fixed costs (e.g. grid launch / synchronization overhead) it's good to allow for as much parallelism as possible given your memory constraints - but as a rough guideline, suffice it to say that lightweight kernels should have at least ~128K items to process, possibly more (for example, sorting performance saturates at 16-32M keys).
So if some stages of the alignment pipeline will parallelize across reads, it would be good to have at least 1M or so to process. Obviously, one million reads could require plenty of memory if the reads are long, so we might want to cap the maximum number of base pairs as well:
...
const uint32 batch_reads = 1*1024*1024;
const uint32 batch_bps = 100*1024*1024;
SharedPointer<io::SequenceDataStream> read_data_file(
reads,
2*params.max_reads,
if (read_data_file == NULL || read_data_file->is_ok() == false)
{
log_error(stderr,
" failed opening file \"%s\"\n", reads);
return 1u;
}
while (1)
{
if (
io::next(
DNA_N, &h_read_data, read_data_file, batch_reads, batch_bps ) == 0);
break;
log_info(stderr,
" loading reads... started\n");
const uint32 n_reads = d_read_data.size() / 2;
log_info(stderr,
" loading reads... done\n");
log_info(stderr,
" %u reads\n", n_reads);
map( pipeline, d_read_data );
}
}
- Now we are ready to start the real mapping work. In order to do it, we'll need to store another data structure in our pipeline, namely the FMIndexFilter. This data structure is just a context for performing FM-index based filtering, i.e. the process of finding potential matches of a set of strings using an FM-index.
struct Pipeline
{
typedef io::FMIndexDataDevice::fm_index_type fm_index_type;
typedef FMIndexFilterDevice<fm_index_type> fm_filter_type;
Params params;
SharedPointer<io::SequenceDataDevice> ref_data;
SharedPointer<io::FMIndexDataDevice> fm_data;
fm_filter_type fm_filter;
};
- Here you go, ready to write the bulk of the aligner.
Mapping
- We'll write a simple seed & extend aligner, so the first thing we'll want to do is split our reads in equal segments. As shown in the previous example, the collection of these seeds can be seen as an InfixSet . Let's start with some boiler-plate code to make sure we have everythin we need:
void map(Pipeline& pipeline,
const io::SequenceDataAccess<DNA_N> reads)
{
typedef io::SequenceDataAccess<DNA>::sequence_string_set_type genome_string;
typedef io::SequenceDataAccess<DNA_N>::sequence_string_set_type read_string_set_type;
typedef InfixSet<read_string_set_type, const string_set_infix_coord_type*> seed_string_set_type;
const Params& params = pipeline.params;
const io::SequenceDataAccess<DNA> genome_access( *pipeline.ref_data );
const uint32 genome_len = genome_access.bps();
const genome_string genome( genome_access.sequence_stream() );
const Pipeline::fm_index_type fm_index = pipeline.fm_data->index();
Pipeline::fm_filter_type& fm_filter = pipeline.fm_filter;
infix_vector_type seed_coords;
const read_string_set_type read_string_set = reads.sequence_string_set();
const seed_string_set_type seed_string_set = extract_seeds(
read_string_set,
params.seed_len,
params.seed_intv,
seed_coords );
...
- The next step is mapping the seeds using the FM-index filter; we start by ranking their occurrences:
...
const uint64 n_hits = fm_filter.rank( fm_index, seed_string_set );
...
- Now, each seed might have mapped to a whole range of suffixes in the Suffix Array of the genome, and all those locations are potential candidates for a valid alignment. We want to visit them all, which means we need to locate their coordinates.
Notice that this process might involve a severe data expansion, as some seeds might map to millions of genome locations: hence, we cannot simply locate all the n_hits in one go. The FMIndexFilter interface gives us the chance to do it in batches, while doing the delicate job of parallelizing work at the finest possible grain and with the most evenly distributed load-balancing: i.e. spreading the variable sized suffix array ranges across multiple threads, so that each thread gets a single hit to locate. Well, in practice, you can ignore this, as NVBIO will do it for you. The only thing you need to remember is to process your hits in large batches:
...
typedef uint2 hit_type;
for (
uint64 hits_begin = 0; hits_begin < n_hits; hits_begin += batch_size)
{
fm_filter.locate(
hits_begin,
hits_end,
hits.begin() );
...
- At this point, remember the hit coordinates refer to the seeds, not to the reads they belong to. We hence want to convert them to diagonals of the genome/read matrices, and we do this employing the transform() parallel primitive:
...
nvbio::transform<device_tag>(
hits_end - hits_begin,
hits.begin(),
hits.begin(),
...
struct hit_to_diagonal
{
typedef uint2 argument_type;
typedef uint2 result_type;
uint2 operator() (const uint2 hit) const
{
const uint32 index_pos = hit.x;
return make_uint2( index_pos - read_pos, read_id );
}
};
- We are finally ready for the extension or verification phase: we need to check if the candidate hits identified by the seeds are true alignments or not, i.e. whether they map to within a certain edit-distance from the reference genome. We'll do this using the banded Myers bit-vector aligner, applied to the batch alignment problem of aligning the string-set of reads associated to all hits against the string-set of corresponding genome locations. Both these string-sets can be thought of as sparse string sets, i.e. sparse collections of substrings of:
- the contiguously packed string of input reads
- the reference genome
...
nvbio::transform<device_tag>(
hits_end - hits_begin
hits.begin(),
read_infix_coords.begin(),
nvbio::transform<device_tag>(
hits_end - hits_begin
hits.begin(),
genome_infix_coords.begin(),
typedef io::SequenceDataAccess<DNA_N>::sequence_stream_type read_stream;
const SparseStringSet<read_stream,infix_iterator> read_infix_set(
hits_end - hits_begin,
reads.sequence_stream(),
read_infix_coords.begin() );
const SparseStringSet<genome_string,infix_iterator> genome_infix_set(
hits_end - hits_begin,
genome,
genome_infix_coords.begin() );
typedef aln::MyersTag<5u> myers_dna5_tag;
aln::batch_banded_alignment_score<BAND_LEN>(
aln::make_edit_distance_aligner<aln::SEMI_GLOBAL, myers_dna5_tag>(),
read_infix_set,
genome_infix_set,
sinks.begin(),
reads.max_read_len(),
reads.max_read_len() + BAND_LEN );
}
}
- Technically, we are almost done: we now have the mapping score for each candidate hit. In this tutorial we won't do anything with them, but the accompanying code actually shows how to keep track of the best scores for each read by using segmented reductions.
- This example should have given you an idea of how to build a pipeline based on some of NVBIO's parallel constructs, from FM-index filters to DP alignment. Hopefully, you are now set to read through the rest of the documentation and discover how many other useful tools might come to your help!
Next: Protein Search using Wavelet Trees and an FM-index Top: NVBIO