[Reading]PARIS: Part-level Reconstruction and Motion Analysis for Articulated Objects

  • ~2.83K 字
  1. 1. Problem Statement
  2. 2. Method
    1. 2.1. Structure
    2. 2.2. Training

ABSTRACT: This paper introduces a self-supervised, end-to-end architecture that learns part-level implicit shape and appearance models and optimizes motion parameters jointly without requiring any 3D supervision, motion, or semantic annotation. The training process is similar to original NeRF but and extend the ray marching and volumetric rendering procedure to compose the two fields.

[Arxiv] [Github] [Project Page]

Problem Statement

The problem of articulate object reconstruction in this paper can be summarized as: Given start state t=0t=0 and end state t=1t=1 and corresponding multi-view RGB images ItI^t and camera parameters.

The first problem is to decouple the object into static and movable part. Here the paper assumes that an object has only one static and one movable part.

The second problem is to estimate the articulated motion T{fp,q,fa,d}T\in\{f_{p,q}, f_{a,d}\} . A revolute joint is parametrize as fp,qf_{p,q} a pivot point pR3p\in\mathbb R^3 and a rotation as quaternion qR4q\in\mathbb R^4, q=1||q||=1. A prismatic joint is modeled as fa,df_{a,d} a joint axis as unit vector aR3a \in\mathbb R^3 and a translation distance dd. The training process will adapt one of them as prior info states. If no such prior info is given, the motion is modeled by SE(3)\mathbf{SE}(3).

Method

image-20231014145659879

This paper divides the parts by registration on input state tt to a canonical state tt^*. The components agrees with the transformation is extracted as moving part and the remaining as the static part.

Structure

Static and moving part are jointly learnt during training and they are built separately on networks with the same structure that built upon InstantNGP. Their relationship is modeled explicitly as the transformation function TT as described in Problem Statement.

The fields are represented as:

{Static:FS(xt,dt)=σS(xt),cS(xt,dt)Mobile:FS(xt,dt)=σS(xt),cS(xt,dt) \left\{ \begin{aligned} \mathtt{Static}:& \mathcal{F}^S(\mathbb x_t,\mathbb d_t)= \sigma^S(\mathbb x_t),c^S(\mathbb x_t,\mathbb d_t)\\ \mathtt{Mobile}:& \mathcal{F}^S(\mathbb x_{t^*},\mathbb d_{t^*})= \sigma^S(\mathbb x_{t^*}),c^S(\mathbb x_{t^*},\mathbb d_{t^*})\\ \end{aligned} \right.

Here xtR3\mathbb x_t\in \mathbb R^3 is a point sampled along a ray at state tt with direction. dtR3\mathbb d_t\in \mathbb R^3. σ(x)R\sigma(\mathbb x)\in \mathbb R is the density value of the point x, and c(x,d)c(\mathbb x,\mathbb d) is the RGB color predicted from the point x from a view direction dd .

Training

The adapted training pipeline is similar to NeRF and the ray marching and volumetric rendering procedure to compose the two fields is extended.

分享这一刻
让朋友们也来瞅瞅!