CUB
 All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Friends Groups
List of all members
cub::DeviceSpmv Struct Reference

Detailed description

DeviceSpmv provides device-wide parallel operations for performing sparse-matrix * dense-vector multiplication (SpMV).

Overview
The SpMV computation performs the matrix-vector operation y = alpha*A*x + beta*y, where:
  • A is an mxn sparse matrix whose non-zero structure is specified in compressed-storage-row (CSR) format (i.e., three arrays: values, row_offsets, and column_indices)
  • x and y are dense vectors
  • alpha and beta are scalar multiplicands
Usage Considerations
  • Dynamic parallelism. DeviceSpmv methods can be called within kernel code on devices in which CUDA dynamic parallelism is supported.

Static Public Methods

CSR matrix operations
template<typename ValueT >
static CUB_RUNTIME_FUNCTION
cudaError_t 
CsrMV (void *d_temp_storage, size_t &temp_storage_bytes, ValueT *d_values, int *d_row_offsets, int *d_column_indices, ValueT *d_vector_x, ValueT *d_vector_y, int num_rows, int num_cols, int num_nonzeros, cudaStream_t stream=0, bool debug_synchronous=false)
 This function performs the matrix-vector operation y = A*x. More...
 

Member Function Documentation

template<typename ValueT >
static CUB_RUNTIME_FUNCTION cudaError_t cub::DeviceSpmv::CsrMV ( void *  d_temp_storage,
size_t &  temp_storage_bytes,
ValueT *  d_values,
int *  d_row_offsets,
int *  d_column_indices,
ValueT *  d_vector_x,
ValueT *  d_vector_y,
int  num_rows,
int  num_cols,
int  num_nonzeros,
cudaStream_t  stream = 0,
bool  debug_synchronous = false 
)
inlinestatic

This function performs the matrix-vector operation y = A*x.

Snippet
The code snippet below illustrates SpMV upon a 9x9 CSR matrix A representing a 3x3 lattice (24 non-zeros).
#include <cub/cub.cuh> // or equivalently <cub/device/device_spmv.cuh>
// Declare, allocate, and initialize device-accessible pointers for input matrix A, input vector x,
// and output vector y
int num_rows = 9;
int num_cols = 9;
int num_nonzeros = 24;
float* d_values; // e.g., [1, 1, 1, 1, 1, 1, 1, 1,
// 1, 1, 1, 1, 1, 1, 1, 1,
// 1, 1, 1, 1, 1, 1, 1, 1]
int* d_column_indices; // e.g., [1, 3, 0, 2, 4, 1, 5, 0,
// 4, 6, 1, 3, 5, 7, 2, 4,
// 8, 3, 7, 4, 6, 8, 5, 7]
int* d_row_offsets; // e.g., [0, 2, 5, 7, 10, 14, 17, 19, 22, 24]
float* d_vector_x; // e.g., [1, 1, 1, 1, 1, 1, 1, 1, 1]
float* d_vector_y; // e.g., [ , , , , , , , , ]
...
// Determine temporary device storage requirements
void* d_temp_storage = NULL;
size_t temp_storage_bytes = 0;
cub::DeviceSpmv::CsrMV(d_temp_storage, temp_storage_bytes, d_values,
d_row_offsets, d_column_indices, d_vector_x, d_vector_y,
num_rows, num_cols, num_nonzeros, alpha, beta);
// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);
// Run SpMV
cub::DeviceSpmv::CsrMV(d_temp_storage, temp_storage_bytes, d_values,
d_row_offsets, d_column_indices, d_vector_x, d_vector_y,
num_rows, num_cols, num_nonzeros, alpha, beta);
// d_vector_y <-- [2, 3, 2, 3, 4, 3, 2, 3, 2]
Template Parameters
ValueT[inferred] Matrix and vector value type (e.g., /p float, /p double, etc.)
Parameters
[in]d_temp_storageDevice-accessible allocation of temporary storage. When NULL, the required allocation size is written to temp_storage_bytes and no work is done.
[in,out]temp_storage_bytesReference to size in bytes of d_temp_storage allocation
[in]d_valuesPointer to the array of num_nonzeros values of the corresponding nonzero elements of matrix A.
[in]d_row_offsetsPointer to the array of m + 1 offsets demarcating the start of every row in d_column_indices and d_values (with the final entry being equal to num_nonzeros)
[in]d_column_indicesPointer to the array of num_nonzeros column-indices of the corresponding nonzero elements of matrix A. (Indices are zero-valued.)
[in]d_vector_xPointer to the array of num_cols values corresponding to the dense input vector x
[out]d_vector_yPointer to the array of num_rows values corresponding to the dense output vector y
[in]num_rowsnumber of rows of matrix A.
[in]num_colsnumber of columns of matrix A.
[in]num_nonzerosnumber of nonzero elements of matrix A.
[in]stream[optional] CUDA stream to launch kernels within. Default is stream0.
[in]debug_synchronous[optional] Whether or not to synchronize the stream after every kernel launch to check for errors. May cause significant slowdown. Default is false.

The documentation for this struct was generated from the following file: