curobo.examples.getting_started.feature_mapping module

Fuse neural image features into a TSDF map and query semantic regions.

This tutorial extends curobo.examples.getting_started.volumetric_mapping with a learned feature channel. cuRobo still integrates depth frames into a block-sparse Truncated Signed Distance Field (TSDF), but each RGB frame is also encoded by NVIDIA C-RADIO and passed to CameraObservation as feature_grid. The mapper fuses those patch features into the allocated TSDF blocks, so later queries can find parts of the 3D map that are visually or semantically similar.

Feature Integration

C-RADIO (Reduce All Domains Into One) distills multiple vision foundation models, including DINOv2, SAM, CLIP, and SigLIP, into one backbone. This example uses the C-RADIO v3-B checkpoint and its per-image patch embeddings in two beginner-friendly ways:

  • Project image or map features to RGB with Principal Component Analysis (PCA) so feature clusters can be inspected visually.

  • When the viewer is enabled, project block features through the fixed SigLIP adaptor and match them against text prompts such as table or chair.

This example downloads C-RADIO v3-B through torch.hub on first use. The first run must be able to reach NVlabs/RADIO on GitHub, download the checkpoint, and install any missing RADIO dependencies.

By the end of this tutorial you will have:

  • Loaded an RGB-D sequence from Sun3D

  • Extracted C-RADIO v3-B patch features for each selected RGB frame

  • Fused depth, color, and learned features into a TSDF map

  • Saved side-by-side RGB | PCA(features) images for quick inspection

  • Visualized the map and highlighted blocks that match a text prompt when --visualize is enabled

Before starting

Install the extra feature mapping dependencies. If your environment needs Hugging Face authentication for checkpoint downloads, export HF_TOKEN before running the example:

export HF_TOKEN=<your_huggingface_token>
uv pip install timm transformers torchvision einops

How the mapper uses features

The feature mapper follows the same geometry path as the volumetric mapping tutorial, with one extra input: a lower-resolution grid of learned patch features from the RGB image.

digraph FeatureMapping { rankdir=LR; edge [color="#2B4162", fontsize=10]; node [shape="box", style="rounded, filled", fontsize=12, color="#cccccc"]; rgb [label="RGB image", color="#708090", fontcolor="white"]; depth [label="Depth image", color="#708090", fontcolor="white"]; camera [label="Camera pose\n+ intrinsics", color="#708090", fontcolor="white"]; radio [label="C-RADIO\npatch features", color="#558c8c", fontcolor="white"]; obs [label="CameraObservation\ndepth + RGB + feature_grid", color="#76b900", fontcolor="white"]; mapper [label="Mapper.integrate()\nblock-sparse TSDF", color="#76b900", fontcolor="white"]; blocks [label="TSDF blocks\ngeometry + color + features", color="#558c8c", fontcolor="white"]; pca [label="PCA colors\nfeature clusters", color="#708090", fontcolor="white"]; text [label="Text matching\nSigLIP adaptor", color="#708090", fontcolor="white"]; rgb -> radio -> obs; depth -> obs; camera -> obs; obs -> mapper -> blocks; blocks -> pca; blocks -> text; }

RGB-D feature mapping data flow

Step 1: Download the dataset

This tutorial uses the same Sun3D indoor RGB-D scene as the volumetric mapping tutorial. It contains color images, depth maps, camera intrinsics, and ground-truth camera poses.

Quick start (downloads one scene, about 1400 MB):

wget http://3dvision.princeton.edu/projects/2016/3DMatch/downloads/rgbd-datasets/sun3d-mit_76_studyroom-76-1studyroom2.zip
mkdir -p datasets/sun3d
unzip sun3d-mit_76_studyroom-76-1studyroom2.zip -d datasets/sun3d

The extracted directory should look like:

datasets/sun3d/sun3d-mit_76_studyroom-76-1studyroom2/
    camera-intrinsics.txt
    <sequence_name>/
        000001.color.png
        000001.depth.png
        000001.pose.txt
        ...

Step 2: Run a quick feature-fusion pass

Start with a small number of frames because C-RADIO inference is heavier than plain depth integration:

python -m curobo.examples.getting_started.feature_mapping \
    --root ./datasets/sun3d/sun3d-mit_76_studyroom-76-1studyroom2 \
    --num-frames 50 \
    --stride 10 \
    --save-pca

When --save-pca is enabled, the tutorial writes side-by-side RGB and feature-PCA panels to ~/.cache/curobo/examples/feature_mapping/. The colors are not object labels; they are a three-dimensional PCA projection of high-dimensional feature vectors, so nearby colors usually indicate similar visual embeddings.

Step 3: Inspect the map interactively

Add --visualize to open a Viser server at http://localhost:8080:

python -m curobo.examples.getting_started.feature_mapping \
    --root ./datasets/sun3d/sun3d-mit_76_studyroom-76-1studyroom2 \
    --num-frames 100 \
    --stride 5 \
    --visualize

The viewer shows:

  • /reconstruction/features_pca: occupied voxels colored by fused C-RADIO features projected through a map-wide PCA basis.

  • /reconstruction/rgb: occupied voxels colored by the TSDF color channel. This layer is hidden by default and can be toggled from the scene tree.

  • Current RGB and Current Feature PCA panels for the latest frame.

Feature Integration

Step 4: Try text matching

Use --visualize to open the Text Matching panel. The example uses the C-RADIO v3-B SigLIP adaptor for text queries:

python -m curobo.examples.getting_started.feature_mapping \
    --root ./datasets/sun3d/sun3d-mit_76_studyroom-76-1studyroom2 \
    --num-frames 100 \
    --stride 5 \
    --visualize

Enter a prompt in the panel to highlight the top matching TSDF blocks under /reconstruction/text_matched. Clear Matches demonstrates how matched blocks can be removed from the dynamic map. For a geometric clearing example, --clear-aabb xmin ymin zmin xmax ymax zmax clears all allocated blocks that intersect the given world-space bounds in meters.

Text Feature Alignment

Step 5: Check the output

When the tutorial finishes successfully you will see output similar to:

Loading Sun3D from ./datasets/sun3d...
Found 200 frames
Loading C-RADIO (c-radio_v3-B) via NVlabs/RADIO torch.hub...
Feature dim: 768
Mapper initialized: 64.0 MB
integrating: 100%|...

Mapper memory: 64.0 MB
PCA panels saved to: ~/.cache/curobo/examples/feature_mapping
class CRadioInference(
model_name='c-radio_v3-B',
device='cuda:0',
text_adaptor_name=None,
)

Bases: object

Own all C-RADIO neural-network inference used by this tutorial.

The mapper itself is not a neural network: it fuses depth, color, and feature tensors into a TSDF map. This class keeps the learned pieces together so the rest of the example can treat them as three simple operations:

  1. extract patch features from an RGB image;

  2. optionally encode text with the requested RADIO adaptor;

  3. optionally project map features into the adaptor’s text-aligned space.

Parameters:
  • model_name (str)

  • device (str)

  • text_adaptor_name (str | None)

__init__(
model_name='c-radio_v3-B',
device='cuda:0',
text_adaptor_name=None,
)
Parameters:
  • model_name (str)

  • device (str)

  • text_adaptor_name (str | None)

static resolve_torchhub_version(
model_name,
)

Map a C-RADIO model id to NVlabs/RADIO’s torch.hub version key.

Return type:

str

Parameters:

model_name (str)

extract_patch_features(
rgb_uint8,
)

Extract patch features from one RGB image.

Parameters:

rgb_uint8 (Tensor) – (H, W, 3) uint8 image on self.device.

Return type:

Tensor

Returns:

(H_p, W_p, D) float32 feature tensor, where H_p = target_h // patch_size and similarly for W_p.

encode_text(text)

Encode one or more strings to (N, D_teacher) L2-normalized features.

Return type:

Tensor

project_features(
features,
)

Project (N, D_radio) map features to L2-normalized teacher features.

Return type:

Tensor

Parameters:

features (torch.Tensor)

pca_colorize_tensor(
feats_flat,
prev_basis=None,
low_pct=0.02,
high_pct=0.98,
)

Fit or reuse a 3-component PCA on (N, D) features and map to RGB.

Returns (colors, basis) where colors is (N, 3) uint8 and basis is (D, 3) float32. Non-finite rows are dropped from the fit and receive black in the output so bad inputs are visually obvious but don’t poison the principal directions. When a compatible prev_basis is provided, it is reused instead of refitting so the expensive PCA/SVD-style solve happens only once.

Return type:

Tuple[Tensor, Tensor]

Parameters:
pca_colorize_with_basis(
feats,
prev_basis=None,
low_pct=0.02,
high_pct=0.98,
)

Project (H, W, D) features to an (H, W, 3) uint8 image via PCA.

Return type:

Tuple[ndarray, Tensor]

Parameters:
pca_colorize(
feats,
low_pct=0.02,
high_pct=0.98,
)

Project (H, W, D) features to an (H, W, 3) uint8 image via PCA.

Return type:

ndarray

Parameters:
upsample_nn(img_uint8, target_hw)

Nearest-neighbor upsample (H, W, 3) uint8 to target_hw.

Return type:

ndarray

Parameters:
downsample_for_gui(
img_uint8,
max_width=320,
)

Cheap preview downsample for viser GUI image widgets.

Return type:

ndarray

Parameters:
show_empty_reconstruction(
visualizer,
voxel_size,
)

Publish empty RGB and feature-PCA point clouds to clear the viewer.

Return type:

None

Parameters:

voxel_size (float)

show_feature_reconstruction(
visualizer,
voxels,
block_colors_pca,
voxel_size,
)

Draw occupied voxels as RGB and feature-PCA point clouds in Viser.

Return type:

None

Parameters:
process_frame(
obs,
mapper,
feature_model,
depth_filter,
prev_pca_basis=None,
surface_only=True,
extract_voxels=False,
timer=None,
)

Integrate one RGB-D frame and optionally prepare visualization data.

The mapper expects a batched CameraObservation, even when this tutorial uses one camera. This helper keeps the per-frame flow in one place: clean depth, extract C-RADIO features, integrate into the mapper, and optionally extract occupied voxels for the live PCA point cloud.

Returns:

(feats, voxels, block_colors_pca, pca_basis, tsdf_time_ms). feats is the raw (H_p, W_p, D) RADIO patch map used for per-image PCA.

Parameters:
main()