This module defines a series of classes to represent and operate on 2bit FMindices, both from the host and device CUDA code.
Rank Dictionaries
 A rank_dictionary is a data structure that, given a text T, allows to count the number of occurrences of any given character c in a given prefix T[0,i] in O(1) time.
 rank_dictionary is storagefree, in the sense it doesn't directly hold any allocated data  hence it can be instantiated both on host and device datastructures, and it can be conveniently passed as a kernel parameter (provided its template parameters are also PODs).
 It supports the following functions:
Function  Inputs  Description 
rank()
 dict, i, c  return the number of occurrences of a given character c in the prefix [0,i] 
rank()
 dict, range, c  return the number of occurrences of a given character c in the prefixes [0,range.x] and [0,range.y] 
rank_all()
 dict, i  return the number of occurrences of all characters of the alphabet in the prefix [0,i] 
rank_all()
 dict, range  return the number of occurrences of all characters of the alphabet in the prefixes [0,range.x] and [0,range.y] 
 NOTE: the O(1) time refers to the complexity in the length n of the text; more properly, if the alphabet contains s characters (i.e. if the number of bits per symbol is b = log(s)), the complexity is O(log(s)). Similarly, the amount of space needed for a sparse occurrence table is O(n s). In other words, the amount of space is exponential in the number of bits per character.
 For a more compact data structure requiring O(n log(s)) storage, useful with larger alphabets, please refer to Wavelet Trees.
Sampled Suffix Arrays
 A Sampled Suffix Array is a succint suffix array which has been sampled at a subset of its indices. Ferragina & Manzini showed that given such a data structure and the BWT of the original text it is possible to reconstruct the missing locations. Two such data structures have been proposed, with different tradeoffs:
 one storing only the entries that are a multiple of K, { SA[i]  SA[i] % K = 0 }
 one storing only the entries whose index is a multiple of K, { SA[i]  i % K = 0 }
 NVBIO provides both:

 Unlike for the rank_dictionary, which is a storagefree class, these classes own the (internal) storage needed to represent the underlying data structures, which resides on the host. Similarly, the module provides some counterparts that hold the corresponding storage for the device:

 While these classes hold device data, they are meant to be used from the host and cannot be directly passed to CUDA kernels. Plain views (see Host & Device), or contexts, can be obtained with the usual plain_view() function.
 The contexts expose the following interface:
struct SSA
{
bool has(
const uint32 i)
const
}
 Detailed documentation can be found in the Sampled Suffix Arrays module documentation.
FMIndices
 An fm_index is a selfcompressed text index as described by Ferragina & Manzini. It is built on top of the following ingredients:
 Given the above, it allows to count and locate all the occurrences in T of arbitrary patterns P in O(length(P)) time. Moreover, it does so with an incremental algorithm that proceeds character by character, a fundamental property that allows to implement sophisticated pattern matching algorithms based on backtracking.
 fm_index is storagefree, in the sense it doesn't directly hold any allocated data  hence it can be instantiated both on host and device datastructures, and it can be conveniently passed as a kernel parameter (provided its template parameters are also PODs).
 It supports the following functions:
Function  Inputs  Description 
rank()
 fmi, i, c  return the number of occurrences of a given character c in the prefix [0,i] 
rank_all()
 fmi, i  return the number of occurrences of all characters of the alphabet in the prefix [0,i] 
match()
 fmi, pattern  return the SA range of occurrences of a given pattern 
match_reverse()
 fmi, pattern  return the SA range of occurrences of a given reversed pattern 
locate()
 fmi, i  given a suffix array coordinate i, return its linear coordinate SA[i] 
Bidirectional FMindices
 NVBIO also supports forward and backwards extension using bidirectional FMindices:
 extend_forward() : extend the range corresponding to a pattern P to that of the pattern Pc
 extend_backwards() : extend the range corresponding to a pattern P to that of the pattern cP
 Instead of requiring an adhoc data structure, bidirectional queries are supported by simply using both a forward and a reverse FMindex. Note that extension can be done without a sampled suffix array, so that there's no need to store two of them: in practice, the FMindices can also be of type fm_index <RankDictionary,null_type>.
Batch Filtering
 Performing massively parallel FMindex queries requires careful load balancing, as finding all occurrences of a given set of strings in a text is a onetomany process with variablerate data expansion. NVBIO offers simple host and device contexts to rank a set of strings and enumerate all their occurrences automatically:

 Here is an example showing how to extract a set of seeds from a string set and find their occurrences on the device:
template <typename fm_index_type>
void search_seeds(
const fm_index_type fm_index,
const string_set_type string_set,
{
string_set,
uniform_seeds_functor<>( seed_len, seed_interval ),
seed_coords );
typedef InfixSet<string_set_type, const string_set_infix_coord_type*> seed_set_type;
seed_set_type seeds(
n_seeds,
string_set,
FMIndexFilterDevice fm_filter;
const uint64 n_hits = fm_filter.rank( fm_index, seeds );
for (
uint64 hits_begin = 0; hits_begin < n_hits; hits_begin += batch_size)
{
fm_filter.locate(
hits_begin,
hits_end,
hits.begin() );
...
}
}
MEM Filtering
 Additionally to the exact matching filters above, NVBIO also some provides builtin functionality to find all (Super) Maximal Extension Matches in a string or a stringset:
 find_kmems() : a host/device perthread function to find all kMEMs overlapping a given base of a pattern string
 find_threshold_kmems() : a host/device perthread function to find all kMEMs overlapping a given base of a pattern string for all threshold values of k
 MEMFilterHost : a parallel host context to enumerate all MEMs of a stringset
 MEMFilterDevice : a parallel device context to enumerate all MEMs of a stringset
 The filters are analogous to the ones introduced in the previous section, except that rather than finding exact matches for each string in a set, they will find all their MEMs or SMEMs.
Performance
 The graphs below show the performance of exact and approximate matching using the FMindex on the CPU and GPU, searching for 32bp fragments inside the whole human genome. Approximate matching is performed with the hamming_backtrack() function, in this case allowing up to 1 mismatch per fragment:
Technical Overview
 A complete list of the classes and functions in this module is given in the FMIndex documentation.