: This module contains a series of classes to load and represent sequence streams. The idea is that a sequence stream is an object implementing a simple interface, SequenceDataInputStream, which allows to stream through a file or other set of reads in batches, which are represented in memory with an object inheriting from SequenceData. There are several kinds of SequenceData containers to keep the reads in the host or in CUDA device memory. Additionally, the same containers can be viewed with different SequenceData Views, in order to allow reinterpreting the base arrays with iterators of different types, e.g. to perform vector loads or use LDG.

: Specifically, it exposes the following core classes and methods:

: as well as some additional accessors:

Sequence Data

: The SequenceData class is the base class for all containers holding storage of sequence data. These containers are:

: Each SequenceData object might contain several distinct sequences, which are represented as a packed string-set of sequence symbols accompanied by corresponding string-sets of sequence quality scores and sequence names. Internally, all the string-sets are stored as ConcatenatedStringSet's, with an index specifying the position of the i-th sequence in the concatenated arrays. The packed sequences can in turn be encoded with a user-specified alphabet. However, SequenceData has only runtime knowledge of the alphabet encoding, and hence does not provide any method to perform decoding - rather, it only exposes methods to obtain plain-views of the underlying sequence storage. However, by providing compile-time knowledge of the alphabet, one can construct a SequenceDataAccess wrapper around any SequenceData (or SequenceDataView) object and access the decoded string-sets transparently. The following example shows how to load a sequence file and access it at compile-time:
typedef io::SequenceDataAccess<DNA>::sequence_string_set_type sequence_string_set_type;

// load a SequenceData object

SharedPointer<io::SequenceDataHost> genome = io::load_sequence_data( DNA, "drosophila.fa" );

// access it specifying the alphabet at compile-time

const io::SequenceDataAccess<DNA> genome_access( genome.get() );

// fetch the decoding string-set

const sequence_string_set_type genome_string_set = genome_access.sequence_string_set();

for (uint32 i = 0; i < n; ++i)

{

// fetch the i-th sequence

const sequence_string_set_type::string_type gene = genome_string_set[i];

// and do something with it...

printf("gene %u contains %u bps:\n", i, length( gene ) );

for (uint32 j = 0; j < length( gene ); ++j)

printf("%c", to_char<DNA>( gene[j] ));

printf("\n");

}

Sequence Data Streams

: Sometimes it is convenient to stream through sequences in batches. SequenceDataStream provides an abstract interface for doing just this:
// open a sequence file

SharedPointer<io::SequenceDataInputStream> reads_file = io::open_sequence_file( "reads.fastq" );

// instantiate a host SequenceData object

io::SequenceDataHost reads;

// declare how much sequence data we want to load in each batch

const uint32 seqs_per_batch = 128*1024; // the maximum number of sequences

const uint32 bps_per_batch = 128*1024*100; // the maximum number of base pairs

// loop through the stream in batches

while (io::next( DNA_N, &reads, reads_file.get(), seqs_per_batch, bps_per_batch ))

{

// copy the loaded batch on the device

const io::SequenceDataDevice device_reads( reads );

...

}

Technical Documentation

: More documentation is available in the Sequence Data Input module.