Hello Renderer!

Top: Contents

We can now go on to writing our first "Hello World" renderer in Fermat. We'll start step-by-step with a renderer that implements some basic path tracing.
Let's start from the class definition:
#include <renderer_interface.h>
#include <buffers.h>
#include <tiled_sequence.h>
// A "Hello Path Tracing" renderer
void init(int argc, char** argv, RenderingContext& renderer);
void render(const uint32 instance, RenderingContext& renderer);
void destroy() { delete this; }
So far, there should be hardly any surprises. We simply derived our class from the RendererInterface, and declared that we'll implement both its RendererInterface::init() and RendererInterface::render() methods. Our renderer will also have some options, which we'll pack into a simple struct:

<< Options Declaration >> :=

// our "Hello Path Tracing" options
uint32 max_path_length;
// default constructor
HelloPTOptions() : max_path_length(6) {}
// do some simple option parsing
void parse(const int argc, char** argv)
for (int i = 0; i < argc; ++i)
if (strcmp(argv[i], "-path-length") == 0)
max_path_length = atoi(argv[++i]);
Now, we are ready to declare the members of our class: its options, some device memory buffer to hold temporary data, and a sampling sequence.

<< Members Declaration >> :=

HelloPTOptions m_options; // the rendering options
DomainBuffer<CUDA_BUFFER, uint8> m_memory_pool; // some device storage for queues and such...
TiledSequence m_sequence; // a nice sampling sequence
Once the declaration is ready, we can start with the definition of the init method:
Parsing options is straightforward:

<< Parse options >> :=

// parse the options
m_options.parse( argc, argv );
The next step is allocating all temporary storage we'll need. Our renderer will be organized as a pipeline of separate ray tracing and shading stages, implemented as parallel kernels communicating through global memory queues. In order to simplify our life we'll make use of some prepackaged queue definition provided by Fermat, the PTRayQueue. These queues are defined by an array of rays, an array of ray hits, an array of sample weights associated to each ray, and finally an array of pixel descriptors. For reasons that will be explained later on, we'll need three such queues, and each of them will need to contain enough storage for as many pixels there are in our target framebuffer.

<< Alloc queue storage >> :=

// pre-alloc some queue storage
// keep track of how much storage we'll need
PTRayQueue input_queue;
PTRayQueue scatter_queue;
PTRayQueue shadow_queue;
alloc_queues( n_pixels, input_queue, scatter_queue, shadow_queue, arena );
fprintf(stderr, " allocating queue storage: %.1f MB\n", float(arena.size) / (1024*1024));
where alloc_queues() is a utility function defined as:
const uint32 n_pixels,
PTRayQueue& input_queue,
PTRayQueue& scatter_queue,
PTRayQueue& shadow_queue,
input_queue.rays = arena.alloc<MaskedRay>(n_pixels);
input_queue.hits = arena.alloc<Hit>(n_pixels);
input_queue.weights = arena.alloc<float4>(n_pixels);
input_queue.pixels = arena.alloc<uint4>(n_pixels);
input_queue.size = arena.alloc<uint32>(1);
scatter_queue.rays = arena.alloc<MaskedRay>(n_pixels);
scatter_queue.hits = arena.alloc<Hit>(n_pixels);
scatter_queue.weights = arena.alloc<float4>(n_pixels);
scatter_queue.pixels = arena.alloc<uint4>(n_pixels);
scatter_queue.size = arena.alloc<uint32>(1);
shadow_queue.rays = arena.alloc<MaskedRay>(n_pixels);
shadow_queue.hits = arena.alloc<Hit>(n_pixels);
shadow_queue.weights = arena.alloc<float4>(n_pixels);
shadow_queue.pixels = arena.alloc<uint4>(n_pixels);
shadow_queue.size = arena.alloc<uint32>(1);
Similarly, for sampling we'll use the TiledSequence class, considering that we will need to allocate up to 6 random numbers per path vertex:

<< Initialize sampler >> :=

// build the set of samples assuming 6 random numbers per path vertex and a tile size of 256 pixels
const uint32 n_dimensions = 6 * (m_options.max_path_length + 1);
const uint32 tile_size = 256;
fprintf(stderr, " initializing sampler: %u dimensions\n", n_dimensions);
m_sequence.setup(n_dimensions, tile_size);
And finally, we'll have to initialize our mesh light sampler:

<< Initialize mesh lights >> :=

// initialize the mesh lights sampler
renderer.m_mesh_lights.init( n_pixels, renderer );
Now that we have covered initialization, we can move on the definition of the render() method. We'll start by defining a context class that will be passed down to our device kernels. This class will encapsulate the renderer state, including views of the sampler, queues and so on.
struct HelloPTContext
HelloPTOptions options; // the options
TiledSequenceView sequence; // the sampling sequence
float frame_weight; // the weight given to samples in this frame
uint32 in_bounce; // the current path tracing bounce
PTRayQueue in_queue; // the input queue
PTRayQueue shadow_queue; // the scattering queue
PTRayQueue scatter_queue; // the shadow queue
At this point we can start sketching the main rendering algorithm. The idea is that, after some proper initializations, we'll generate primary rays and enqueue them to the input queue, and then start executing a pipeline, where we:
  1. trace rays
  2. shade the ray hits (i.e. the new path vertices), potentially generating new shadow and scattering rays
  3. trace any queued shadow rays
  4. shade the shadow hits
  5. and swap the input and scattering queues
which in code becomes:
Initializations are fairly trivial: we need to really allocate the queues out of the pre-allocated memory pool, initialize the sampling sequence, setup our context, and finally rescale the render targets containing results up to frame instance by a factor of instance / (instance + 1), for blending in the new one:

<< Perform initializations >> :=

const uint2 res = renderer.res();
const uint32 n_pixels = res.x * res.y;
// carve an arena out of our pre-allocated memory pool
cugar::memory_arena arena( m_memory_pool.ptr() );
// alloc all the queues
PTRayQueue in_queue;
PTRayQueue scatter_queue;
PTRayQueue shadow_queue;
arena );
// fetch a view of the renderer
RenderingContextView renderer_view = renderer.view(instance);
// fetch the ray tracing context
RTContext* rt_context = renderer.get_rt_context();
// setup the samples for this frame
// setup our context
HelloPTContext context;
context.options = m_options;
context.sequence = m_sequence.view();
context.frame_weight = 1.0f / (instance + 1);
context.in_queue = in_queue;
context.scatter_queue = scatter_queue;
context.shadow_queue = shadow_queue;
// rescale the previous render targets for blending/averaging in the new one
renderer.rescale_frame( instance );
In order to generate primary rays, we assume we have some kernel already available and a function to dispatch it:

<< Generate primary rays >> :=

// generate the primary rays
generate_primary_rays(context, renderer_view);
CUDA_CHECK(cugar::cuda::sync_and_check_error("generate primary rays"));
Now, we can start tackling the inner loop. Checking the size of the input queue is just a matter of copying the value of the queue's size field to the host:

<< Check input queue size and possibly bail-out >> :=

uint32 in_queue_size;
// fetch the amount of tasks in the queue
cudaMemcpy(&in_queue_size, context.in_queue.size, sizeof(uint32), cudaMemcpyDeviceToHost);
// check whether there's still any work left
if (in_queue_size == 0)
while tracing the rays can be done with a single RTContext::trace() call:

<< Trace rays inside the input queue >> :=

// trace the rays generated at the previous bounce
rt_context->trace(in_queue_size, (Ray*)context.in_queue.rays, context.in_queue.hits);
This step will generate a bunch of ray hit points, i.e. all we need to reconstruct the next wave of path vertices. Before we proceed to shading them, we'll have to clear the sizes of the shadow and scattering queues - again, just a matter of performing a memset in device memory:

<< Clear the scattering and shadow queues >> :=

// reset the output queue counters
cudaMemset(context.shadow_queue.size, 0x00, sizeof(uint32));
cudaMemset(context.scatter_queue.size, 0x00, sizeof(uint32));
Finally, we can proceed to shade the new path vertices, trace any shadow rays that their shading might have generated, and finally swap the input and scattering queues. For this, we'll assume two more functions exist, shade_vertices() and resolve_occlusion().

<< Shade ray hits >> :=

// perform lighting at this bounce
shade_vertices(in_queue_size, context, renderer_view);
CUDA_CHECK(cugar::cuda::sync_and_check_error("shade hits"));

<< Trace and shade shadow rays >> :=

// trace & accumulate occlusion queries
// fetch the amount of tasks in the queue
uint32 shadow_queue_size;
cudaMemcpy(&shadow_queue_size, context.shadow_queue.size, sizeof(uint32), cudaMemcpyDeviceToHost);
if (shadow_queue_size)
// trace the rays
rt_context->trace_shadow(shadow_queue_size, (MaskedRay*)context.shadow_queue.rays, context.shadow_queue.hits);
// shade the results
resolve_occlusion(shadow_queue_size, context, renderer_view);
CUDA_CHECK(cugar::cuda::sync_and_check_error("resolve occlusion"));

<< Swap the intput and scattering queues >> :=

// swap the input and output queues
std::swap(context.in_queue, context.scatter_queue);
At this point, we just need to fill in the missing details, and define the generate_primary_rays(), shade_vertices() and resolve_occlusion() kernels.

Generating Primary Rays

Generating primary rays will require us to write a fairly simple kernel, and a function to dispatch it:
// a kernel to generate the primary rays
HelloPTContext context,
// calculate the 2d pixel index given from the thread id
const uint2 pixel = make_uint2(
threadIdx.x + blockIdx.x*blockDim.x,
threadIdx.y + blockIdx.y*blockDim.y );
// check whether the pixel/thread is inside the render target
if (pixel.x >= renderer.res_x || pixel.y >= renderer.res_y)
// calculate a 1d pixel index
const int idx = pixel.x + pixel.y*renderer.res_x;
const MaskedRay ray = generate_primary_ray( context, renderer, pixel );
// write the output ray
context.in_queue.rays[idx] = ray;
// write the path weight
context.in_queue.weights[idx] = cugar::Vector4f(1.0f, 1.0f, 1.0f, 1.0f);
// write the pixel index
context.in_queue.pixels[idx] = make_uint4( idx, uint32(-1), uint32(-1), uint32(-1) );
// use thread 0 write out the total number of primary rays in the queue descriptor
if (pixel.x == 0 && pixel.y)
*context.in_queue.size = renderer.res_x * renderer.res_y;
// dispatch the generate_primary_rays kernel
HelloPTContext context,
dim3 blockSize(32, 16);
dim3 gridSize(cugar::divide_ri(renderer.res_x, blockSize.x), cugar::divide_ri(renderer.res_y, blockSize.y));
generate_primary_rays_kernel << < gridSize, blockSize >> > (context, renderer);

Shading Vertices

We'll start from a single-threaded device function that will be called each time we process a vertex. This function will receive an incoming ray and a Hit object, and use those to reconstruct a full path vertex. After the vertex is set up, it will proceed initializing the local sample sequence, performing next-event estimation, evaluating the local emission at the hit in the direction of the incoming ray, and finally evaluating any scattering/absorption events.
// shade a path vertex
// \param pixel_index the 1d pixel index associated with this path
// \param pixel the 2d pixel coordinates
// \param ray the incoming ray direction
// \param hit the hit point defining this vertex
// \param w the current path weight
// \param p_prev the solid angle probability of the last scattering event
// \return true if the path is continued, false if it terminates here
HelloPTContext& context,
const uint32 pixel_index,
const uint2 pixel,
const MaskedRay& ray,
const Hit hit,
const cugar::Vector3f w,
const float p_prev)
// check if this is a valid hit
if (hit.t > 0.0f && hit.triId >= 0)
return false; // this path terminates here
In order to setup the vertex, we'll use Fermat's EyeVertex class, an object representing a vertex sampled from the eye (i.e. using forward path tracing), which helps interpolating vertex attributes, keep tracking of the path sampling probabilities, setting up vertex Bsdf (evaluating the material, including any textures), and so on.

<< Setup path vertex >> :=

// setup an eye-vertex given the input ray, hit point, and path weight
ev.setup(ray, hit, w.xyz(), cugar::Vector4f(0.0f), context.in_bounce, renderer);
Once we have all the vertex information, we can write out any G-buffer information on primary rays (i.e. when we are processing the zero-th bounce):

<< Write out G-buffer on primary hits >> :=

// write out gbuffer information
if (context.in_bounce == 0)
renderer.fb.gbuffer.geo(pixel_index) = GBufferView::pack_geometry(ev.geom.position, ev.geom.normal_s);
renderer.fb.gbuffer.uv(pixel_index) = make_float4(hit.u, hit.v, ev.geom.texture_coords.x, ev.geom.texture_coords.y);
renderer.fb.gbuffer.tri(pixel_index) = hit.triId;
renderer.fb.gbuffer.depth(pixel_index) = hit.t;
Initializing the sampling coordinates requires fetching a batch 6 random numbers (3 for next-event estimation and 3 for scattering) from the TiledSequenceView sampling sequence, which again, can be done using a prepackaged utility function: vertex_sample()

<< Initialize sampling sequence >> :=

// initialize our shifted sampling sequence
float samples[6];
for (uint32 i = 0; i < 6; ++i)
samples[i] = vertex_sample(pixel, context, i);

Next-Event Estimation

Next-event estimation is trickier business. It will involve four basic steps: sampling a point on the light sources (in this case, the scene's mesh), evaluating the the EDF in the direction of the current vertex and the local BSDF in the direction joining the current vertex to the sampled point, calculating the sample weight, and finally enqueuing a shadow ray to check if the sample is occluded or visible. Notice that NEE essentially adds a vertex to a path, so if we allow a maximum path length of N, we can only perform it if our current path is shorter than N-1.

<< Perform Next-Event Estimation >> :=

// perform next-event estimation to compute direct lighting
if (context.in_bounce + 2 <= context.options.max_path_length)
// fetch the sampling dimensions
const float z[3] = { samples[0], samples[1], samples[2] }; // use dimensions 0,1,2
VertexGeometryId light_vertex;
VertexGeometry light_vertex_geom;
float light_pdf;
Edf light_edf;
// sample the light source surface
renderer.mesh_light.sample(z, &light_vertex.prim_id, &light_vertex.uv, &light_vertex_geom, &light_pdf, &light_edf);
// join the light sample with the current vertex
cugar::Vector3f out = (light_vertex_geom.position - ev.geom.position);
const float d2 = fmaxf(1.0e-8f, cugar::square_length(out));
// normalize the outgoing direction
out *= rsqrtf(d2);
// evaluate the light's EDF, predivided by the sample pdf
const cugar::Vector3f f_L = light_edf.f(light_vertex_geom, light_vertex_geom.position, -out) / light_pdf;
cugar::Vector3f f_s(0.0f);
float p_s(0.0f);
// evaluate the surface BSDF f() and its sampling pdf p() in one go
ev.bsdf.f_and_p(ev.geom, ev.in, out, f_s, p_s, cugar::kProjectedSolidAngle);
Computing the sample value would require multiplying together the current path weight, w, the EDF, f_L, the BSDF, f_s and the geometric throughput term, G, and divide everything by the sample pdf. In practice, we have pre-divided f_L by the pdf light_pdf, so we can now avoid this last division. However, as our path sampler will also be able to hit the light sources, we'll use multiple importance sampling (in short, MIS) between NEE and the possibility of hitting the same light on the same point.

<< Compute the sample value >> :=

// evaluate the geometric term
const float G = fabsf(cugar::dot(out, ev.geom.normal_s) * cugar::dot(out, light_vertex_geom.normal_s)) / d2;
// perform MIS with the possibility of directly hitting the light source
const float p1 = light_pdf;
const float p2 = p_s * G;
const float mis_w = context.in_bounce > 0 ? mis_heuristic<MIS_HEURISTIC>(p1, p2) : 1.0f;
// calculate the cumulative sample weight, equal to f_L * f_s * G / p
const cugar::Vector3f out_w = w.xyz() * f_L * f_s * G * mis_w;
Finally, we can enqueue a shadow ray, carrying the pixel index and sample weight together with the ray itself.

<< Enqueue a shadow ray >> :=

if (cugar::max_comp(out_w) > 0.0f && cugar::is_finite(out_w))
// enqueue the output ray
MaskedRay out_ray;
out_ray.origin = ev.geom.position - ray.dir * 1.0e-4f; // shift back in space along the viewing direction
out_ray.dir = (light_vertex_geom.position - out_ray.origin); //out;
out_ray.mask = 0x2u;
out_ray.tmax = 0.9999f; //d * 0.9999f;
// append the ray to the shadow queue
context.shadow_queue.warp_append( pixel_index, out_ray, cugar::Vector4f(out_w, 0.0f) );

Evaluating Emissive Hits

Evaluating emissive surface hits allows us to have a second technique to form complete light paths joining the camera to the light sources, which is often far more efficient than NEE when the BSDF is glossy or close to specular. In principle, it is very similar to evaluating NEE, except that sampling a point on the light source has to be replaced by evaluating the pdf of generating it, which again we'll need to perform MIS. Also, in this case we will not need to enqueue any additional rays (since we have already landed on a light source), and can just add the weighted sample contribution to the framebuffer. Notice that we will further weight the sample by frame_weight = 1 / (instance + 1), in order to average together this pass with all the previous ones.

<< Evaluate emissive hits >> :=

// accumulate the emissive component along the incoming direction
VertexGeometry light_vertex_geom = ev.geom; // the light source geometry IS the current vertex geometry
float light_pdf;
Edf light_edf;
// calculate the pdf of sampling this point on the light source
renderer.mesh_light.map(hit.triId, cugar::Vector2f(hit.u, hit.v), light_vertex_geom, &light_pdf, &light_edf );
// evaluate the edf's output along the incoming direction
const cugar::Vector3f f_L = light_edf.f(light_vertex_geom, light_vertex_geom.position, ev.in);
const float d2 = fmaxf(1.0e-10f, hit.t * hit.t);
// compute the MIS weight with next event estimation at the previous vertex
const float G_partial = fabsf(cugar::dot(ev.in, light_vertex_geom.normal_s)) / d2;
// NOTE: G_partial doesn't include the dot product between 'in and the normal at the previous vertex
const float p1 = G_partial * p_prev; // NOTE: p_prev = p_proj * dot(in,normal)
const float p2 = light_pdf;
const float mis_w = context.in_bounce > 0 ? mis_heuristic<MIS_HEURISTIC>(p1, p2) : 1.0f;
// and accumulate the weighted contribution
const cugar::Vector3f out_w = w.xyz() * f_L * mis_w;
// and accumulate the weighted contribution
if (cugar::max_comp(out_w) > 0.0f && cugar::is_finite(out_w))
// accumulate the sample contribution to the image
add_in<false>(renderer.fb(FBufferDesc::COMPOSITED_C), pixel_index, out_w, context.frame_weight);

Evaluating Scattering and Absorption

The last bit of shade_vertex() involves sampling a scattering or an absorption event. For this, we will use the last 3 of the random numbers we chose, and rely on the scatter() utility function. This function will sample the BSDF, and return both the BSDF value, predivided by the projected sampling pdf, as well as the pdf itself and a Bsdf::ComponentType flag telling us which component has been sampled. If the flag is equal to Bsdf::kAbsorption, the path should be terminated.

<< Evaluate scattering and absorption >> :=

// compute a scattering event / trace a bounce ray
if (context.in_bounce + 1 < context.options.max_path_length)
// fetch the sampling dimensions
const float z[3] = { samples[3], samples[4], samples[5] }; // use dimensions 3,4,5
// sample a scattering event
cugar::Vector3f out(0.0f);
cugar::Vector3f g(0.0f);
float p(0.0f);
float p_proj(0.0f);
Bsdf::ComponentType out_comp(Bsdf::kAbsorption);
// compute a scattering direction
scatter(ev, z, out_comp, out, p, p_proj, g, true, false, false, Bsdf::kAllComponents);
// compute the output weight
cugar::Vector3f out_w = g * w.xyz();
if (p != 0.0f && cugar::max_comp(out_w) > 0.0f && cugar::is_finite(out_w))
// enqueue the output ray
MaskedRay out_ray;
out_ray.origin = ev.geom.position;
out_ray.dir = out;
out_ray.mask = __float_as_uint(1.0e-3f);
out_ray.tmax = 1.0e8f;
// track the solid angle probability of this scattering event
const float out_p = p;
// append the ray to the scattering queue
// notice that we pack the sample probability together with the sample value in a single
// float4, so as to allow a single 16-byte write into the output queue.
context.scatter_queue.warp_append( pixel_index, out_ray, cugar::Vector4f( out_w, out_p ) );
return true; // continue the path

Packaging it All Together

Now that our shade_vertex() function is complete, it remains to package it into a kernel which fetches tasks (i.e. vertices to be shaded) from the input queue, and executes them. Again, this is nothing complex: we'll spawn one thread per queue entry, and have each thread fetch one queue item and pass it to shade_vertex().
// shade vertices kernel
void shade_vertices_kernel(const uint32 in_queue_size, HelloPTContext context, RenderingContextView renderer)
const uint32 thread_id = threadIdx.x + blockIdx.x * blockDim.x;
if (thread_id < in_queue_size) // *context.in_queue.size
const uint32 pixel_index = context.in_queue.pixels[thread_id].x;
const MaskedRay ray = context.in_queue.rays[thread_id];
const Hit hit = context.in_queue.hits[thread_id];
const cugar::Vector4f w = context.in_queue.weights[thread_id];
const uint2 pixel = make_uint2(
pixel_index % renderer.res_x,
pixel_index / renderer.res_x
w.w );
// dispatch the shade hits kernel
void shade_vertices(const uint32 in_queue_size, HelloPTContext context, RenderingContextView renderer)
const uint32 blockSize(64);
const dim3 gridSize(cugar::divide_ri(in_queue_size, blockSize));
shade_vertices_kernel<<< gridSize, blockSize >>>( in_queue_size, context, renderer );

Resolving Occlusion

The next and last bit of our pipeline is a kernel that takes tasks from the shadow queue, representing next-event samples, and resolves their occlusion using the corresponding ray hits. If the samples are not occluded, they will be accumulated to their originating pixel.
// a kernel to resolve NEE samples' occlusion
void resolve_occlusion_kernel(
const uint32 shadow_queue_size,
HelloPTContext context,
const uint32 thread_id = threadIdx.x + blockIdx.x * blockDim.x;
if (thread_id < shadow_queue_size) // *context.shadow_queue.size
const uint32 pixel_index = context.shadow_queue.pixels[thread_id].x;
const Hit hit = context.shadow_queue.hits[thread_id];
const cugar::Vector4f w = context.shadow_queue.weights[thread_id];
if (hit.t < 0.0f) // add this sample if and only if there was no intersection
add_in<false>( renderer.fb(FBufferDesc::COMPOSITED_C), pixel_index, w.xyz(), context.frame_weight );
// dispatch the resolve_occlusion kernel
void resolve_occlusion(
const uint32 shadow_queue_size,
HelloPTContext context,
const uint32 blockSize(64);
const dim3 gridSize(cugar::divide_ri(shadow_queue_size, blockSize));
resolve_occlusion_kernel<<< gridSize, blockSize >>>( shadow_queue_size, context, renderer );

The Plugin

In order to get the new renderer picked up by the Fermat executable (fermat.exe), we need to write a plugin DLL that can be loaded at runtime. As anticipated, this DLL needs to export a single function, in this case registering the new HelloPT factory:


#include <hellopt.h>
#include <renderer.h>
// define the plugin entry point
extern "C" uint32 __declspec(dllexport) __stdcall register_plugin(RenderingContext& renderer)
// register the new renderer factory and return the assigned id
return renderer.register_renderer("hellopt", &HelloPT::factory);

We're Done!

Pat yourself on the shoulder, we have just finished writing our first sample path tracer in Fermat!
I am sure you'll have some unanswered questions in your head, though hopefully they are not way too many...
In fact, if you have already compiled the entire Fermat solution, and you have the dll placed in the same directory as Fermat's executable, you should be able to run the HelloPT plugin renderer interactively by just launching:
*  fermat.exe -view -plugin hellopt.dll -path-length 6 -i ../../models/bathroom2/bathroom.obj -c ../../models/bathroom2/camera2.txt -o output/hellopt.tga
and if all goes well, waiting long enough you should get an image like this:

Despite the fact that a lot of details were hidden behind some helper functions and classes (e.g. for sampling and evaluating BSDFs, or mesh emitters), this example is still rather low level. In the following pages we'll see that Fermat provides much higher level constructs and libraries to implement both forward and bidirectional path tracers, based on the realization that the underlying structure of these path samplers is more or less always the same, and that one typically only needs to customize their behaviour at specific points (e.g. specifying how exactly NEE is performed, or performing a custom action, like checking a cache, any time a new vertex is generated, or again specifying how the samples are weighted and finally consumed). We will also see that Fermat typically splits these libraries into two distinct components: a core library of singled-threaded device or host/device functions (i.e. functions meant to be called by and operate within an individual thread, that are still completely thread-safe - so there could be millions such threads working in parallel), and a higher level library of parallel kernels and host dispatch functions, typically providing a generalized skeleton of the queue-based pipeline mechanism we just described here.

Next: PTLib