CRISP: Object Pose and Shape Estimation with Test-Time Adaptation

Jingnan Shi1, Rajat Talak1, Luca Carlone1
1Laboratory for Information & Decision Systems (LIDS), Massachusetts Institute of Technology
We propose CRISP, a method for object pose and shape estimation with test-time adaptation.

Abstract

We consider the problem of estimating object pose and shape from an RGB-D image. Our first contribution is to introduce CRISP, a category-agnostic object pose and shape estimation pipeline. The pipeline implements an encoder-decoder model for shape estimation. It uses FiLM-conditioning for implicit shape reconstruction and a DPT-based network for estimating pose-normalized points for pose estimation. As a second contribution, we propose an optimization-based pose and shape corrector that can correct estimation errors caused by a domain gap. Observing that the shape decoder is well behaved in the convex hull of known shapes, we approximate the shape decoder with an active shape model, and show that this reduces the shape correction problem to a constrained linear least squares problem, which can be solved efficiently by an interior point algorithm. Third, we introduce a self-training pipeline to perform self-supervised domain adaptation of CRISP. The self-training is based on a correct-and-certify approach, which leverages the corrector to generate pseudo-labels at test time, and uses them to self-train CRISP. We demonstrate CRISP (and the self-training) on YCBV, SPE3R, and NOCS datasets. CRISP shows high performance on all the datasets. Moreover, our self-training is capable of bridging a large domain gap. Finally, CRISP also shows an ability to generalize to unseen objects.

Overview

Pipeline: Given an segmented RGB image and depth points of the object, CRISP uses a ViT backbone to extracts features from the cropped image. It estimates the object pose, by estimating pose-normalized coordinates (PNC) Z, and shape, by reconstructing the signed distance field (SDF) of the object.

Corrector: The pose and shape estimates are corrected by the corrector which solves a bi-level optimization problem using two solvers: PGD and LSQ.

Self-training: The self-training uses corrected estimates that pass the observable certification check as pseudo-labels. The SDF decoder is fixed during self-training.

YCBV Qualitative Results

RGB
rgb
RGB w/ Mask
rgb_mask
Reconstructed Meshes
rgb
rgb_mask
rgb
rgb_mask

SPE3R Qualitative Results

RGB
rgb_acrimsat_final_000000.jpg
RGB w/ Mask
rgb_masked_acrimsat_final_000000.jpg
Reconstructed Mesh
rgb_apollo_soyuz_carbajal_000200.jpg
rgb_masked_apollo_soyuz_carbajal_000200.jpg
rgb_cheops_002005.jpg
rgb_masked_cheops_002005.jpg

NOCS REAL275 Qualitative Results

rgb
rgb_mask
rgb
rgb_mask
rgb
rgb_mask