This module contains a series of classes that implement a variety of CUDA work-queues. A work-queue is an object which allows to execute (or consume) a stream of work items according to different schedulers. Notably, there's two distinguishing features:

continuations: work items can produce continuations: breaking up long computations in shorter pieces and using smart schedulers might allow to get better utilization when the execution time of individual work items is highly non-uniform;
capacity constraints: queues can be assigned a maximum capacity, which can be used to control the amount of resources consumed to execute the work items in parallel (e.g. if each work item needs 1MB of temporary storage, on a 4GB GPU one might only execute 4k items in parallel)

Work-Streams

: The user of these classes have to specify a WorkStream class responsible for feeding work to the queue in the shape of a subclass WorkStream::WorkUnit. The latter is responsible for specifying the data and execution of each unit. WorkStream has to implement the following interface:

interface WorkStream
{
   uint32 size() const
   void   get(const uint32 i, WorkUnit* unit, const uint2 queue_slot)
}

: queue_slot specifies a (queue position, queue id) pair, assuming there can be one input and one output continuation queue which are swapped among iterations. Knowing the queue slot can be useful to bind external data to a work-unit.

: When the method WorkQueue::consume( stream ) is called, the queue will launch a kernel to consume all WorkUnit's in the stream. WorkUnit has to implement a single method:

bool WorkUnit::run(const WorkStream& context)

: which should run the associated work and indicate whether the unit has finished execution, or whether it has produced a continuation (stored in the WorkUnit itself), that has to be run further. The WorkQueue will automatically queue the continuation for later execution.

: Optionally, the class can also be passed a WorkMover which is responsible for moving external data attached to any WorkUnit when its continuation gets assigned a new execution slot. This must implement a method:

void move(
    const WorkStream& stream,
    const uint2 src_slot, WorkUnit* src_unit,
    const uint2 dst_slot, WorkUnit* dst_unit) const;

Example

struct MyWorkStream;
// A work unit returning a continuation for odd indices.
//
struct MyWorkUnit
{
    // construct this work unit
    __device__ MyWorkUnit(const uint32 _i) : i(_i) {}
    // run this work unit
    __device__ bool run(MyWorkStream&);
private:
    uint32 i;
}
// A stream of work units
//
struct MyWorkStream
{
    // construct this work stream
    MyWorkStream(const _size) : m_size(_size) {}
    // return the size of the stream
    __host__ __device__ uint32 size() const { return m_size; }
    // get a work unit, assigned to a given execution slot
    __device__ void get(const uint32 i, MyWorkUnit* unit, const uint2 execution_slot) const;
private:
    uint32 m_size;
}
// get a work unit
__device__ void MyWorkStream::get(const uint32 i, MyWorkUnit* unit, const uint2 execution_slot) const { *unit = MyWorkUnit(i); }
// run the work unit
__device__ bool MyWorkUnit::run(MyWorkStream&) { if (i&1) { i/=2; return true; return false; }
void test()
{
    // instantiate a work stream
    MyWorkStream work_stream( 1024*1024 );
    // instantiated a work queue
    cuda::WorkQueue<cuda::InplaceQueueTag,MyWorkUnit> work_queue;
    // consume all work in the work stream
    work_queue.consume( work_stream );
}

Work-Queue Schedulers

The WorkQueue class is parameterized by a template tag parameter specifying the scheduler. The available schedulers are: