NVBIO
 All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Friends Macros Groups Pages
Sequence Data Input
This module contains a series of classes to load and represent sequence streams. The idea is that a sequence stream is an object implementing a simple interface, SequenceDataInputStream, which allows to stream through a file or other set of reads in batches, which are represented in memory with an object inheriting from SequenceData. There are several kinds of SequenceData containers to keep the reads in the host or in CUDA device memory. Additionally, the same containers can be viewed with different SequenceData Views, in order to allow reinterpreting the base arrays with iterators of different types, e.g. to perform vector loads or use LDG.
Specifically, it exposes the following core classes and methods:
as well as some additional accessors:

Sequence Data

The SequenceData class is the base class for all containers holding storage of sequence data. These containers are:
Each SequenceData object might contain several distinct sequences, which are represented as a packed string-set of sequence symbols accompanied by corresponding string-sets of sequence quality scores and sequence names. Internally, all the string-sets are stored as ConcatenatedStringSet's, with an index specifying the position of the i-th sequence in the concatenated arrays. The packed sequences can in turn be encoded with a user-specified alphabet. However, SequenceData has only runtime knowledge of the alphabet encoding, and hence does not provide any method to perform decoding - rather, it only exposes methods to obtain plain-views of the underlying sequence storage. However, by providing compile-time knowledge of the alphabet, one can construct a SequenceDataAccess wrapper around any SequenceData (or SequenceDataView) object and access the decoded string-sets transparently. The following example shows how to load a sequence file and access it at compile-time:
// load a SequenceData object
SharedPointer<io::SequenceDataHost> genome = io::load_sequence_data( DNA, "drosophila.fa" );
// access it specifying the alphabet at compile-time
const io::SequenceDataAccess<DNA> genome_access( genome.get() );
// fetch the decoding string-set
const sequence_string_set_type genome_string_set = genome_access.sequence_string_set();
for (uint32 i = 0; i < n; ++i)
{
// fetch the i-th sequence
const sequence_string_set_type::string_type gene = genome_string_set[i];
// and do something with it...
printf("gene %u contains %u bps:\n", i, length( gene ) );
for (uint32 j = 0; j < length( gene ); ++j)
printf("%c", to_char<DNA>( gene[j] ));
printf("\n");
}

Sequence Data Streams

Sometimes it is convenient to stream through sequences in batches. SequenceDataStream provides an abstract interface for doing just this:
// open a sequence file
SharedPointer<io::SequenceDataInputStream> reads_file = io::open_sequence_file( "reads.fastq" );
// instantiate a host SequenceData object
// declare how much sequence data we want to load in each batch
const uint32 seqs_per_batch = 128*1024; // the maximum number of sequences
const uint32 bps_per_batch = 128*1024*100; // the maximum number of base pairs
// loop through the stream in batches
while (io::next( DNA_N, &reads, reads_file.get(), seqs_per_batch, bps_per_batch ))
{
// copy the loaded batch on the device
const io::SequenceDataDevice device_reads( reads );
...
}

Technical Documentation

More documentation is available in the Sequence Data Input module.