NVBIO
 All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Friends Macros Groups Pages
nvSetBWT

nvSetBWT is an application built on top of NVBIO to build the BWT of a set of strings, typically reads.
Given an input fastq or text file with one read per line, it will create a file containing the BWT of their forward and reverse-complemented strands. Alongside with the main BWT file, a file containing the mapping between the primary dollar tokens and their position in the BWT will be generated. e.g.
*  ./nvSetBWT my-reads.fastq my-reads.bwt
* 
will generate the following files:
*  my-reads.bwt
*  my-reads.pri
* 

Options

nvSetBWT supports the following command options:
*    nvSetBWT [options] input_file output_file
*    options:
*     -v       | --verbosity     int (0-6) [5]
*     -c       | --compression   string    [1R]   (e.g. \"1\", ..., \"9\", \"1R\")
*     -F       | --skip-forward
*     -R       | --skip-reverse
* 

File Formats

The output BWT can be saved in one of the following formats:
*  .txt        ASCII
*  .txt.gz     ASCII, gzip compressed
*  .txt.bgz    ASCII, block-gzip compressed
*  .bwt        2-bit packed binary
*  .bwt.gz     2-bit packed binary, gzip compressed
*  .bwt.bgz    2-bit packed binary, block-gzip compressed
*  .bwt4       4-bit packed binary
*  .bwt4.gz    4-bit packed binary, gzip compressed
*  .bwt4.bgz   4-bit packed binary, block-gzip compressed
* 
The accompanying primary map file (.pri|.pri.gz|.pri.bgz), is a plain list of (position,string-id) pairs, either in ASCII or binary form. The ASCII file has the form:
*   #PRI
*   position[1] string[1]
*   ...
*   position[n] string[n]
* 
The binary file has the format:
*   char[4] header = "PRIB";
*   struct { uint64 position; uint32 string_id; } pairs[n];
* 

Details

nvSetBWT implements a novel algorithm for the BWT construction of very large string sets, called set-bwte:

http://arxiv.org/pdf/1410.0562.pdf

The algorithm can be considered an adaptation of Ferragina's serial bwte algorithm to string sets and massive parallelism. Among its properties, it is well suited to process reads of arbitrary length, and it allows incremental updates (though this option is not yet implemented).