nvSetBWT is an application built on top of NVBIO to build the BWT of a set of strings, typically reads.
- Given an input fastq or text file with one read per line, it will create a file containing the BWT of their forward and reverse-complemented strands. Alongside with the main BWT file, a file containing the mapping between the primary dollar tokens and their position in the BWT will be generated. e.g.
* ./nvSetBWT my-reads.fastq my-reads.bwt
*
- will generate the following files:
* my-reads.bwt
* my-reads.pri
*
Options
- nvSetBWT supports the following command options:
* nvSetBWT [options] input_file output_file
* options:
* -v | --verbosity int (0-6) [5]
* -c | --compression string [1R] (e.g. \"1\", ..., \"9\", \"1R\")
* -F | --skip-forward
* -R | --skip-reverse
*
File Formats
- The output BWT can be saved in one of the following formats:
* .txt ASCII
* .txt.gz ASCII, gzip compressed
* .txt.bgz ASCII, block-gzip compressed
* .bwt 2-bit packed binary
* .bwt.gz 2-bit packed binary, gzip compressed
* .bwt.bgz 2-bit packed binary, block-gzip compressed
* .bwt4 4-bit packed binary
* .bwt4.gz 4-bit packed binary, gzip compressed
* .bwt4.bgz 4-bit packed binary, block-gzip compressed
*
- The accompanying primary map file (.pri|.pri.gz|.pri.bgz), is a plain list of (position,string-id) pairs, either in ASCII or binary form. The ASCII file has the form:
* #PRI
* position[1] string[1]
* ...
* position[n] string[n]
*
- The binary file has the format:
* char[4] header = "PRIB";
* struct { uint64 position; uint32 string_id; } pairs[n];
*
Details
- nvSetBWT implements a novel algorithm for the BWT construction of very large string sets, called set-bwte:
http://arxiv.org/pdf/1410.0562.pdf
- The algorithm can be considered an adaptation of Ferragina's serial bwte algorithm to string sets and massive parallelism. Among its properties, it is well suited to process reads of arbitrary length, and it allows incremental updates (though this option is not yet implemented).