【Reading】Ditto-Building Digital Twins of Articulated Objects from Interaction

  • ~1.65K 字
  1. 1. Workflow-In Brief
    1. 1.1. Two Stream Encoder
  2. 2. Training

This paper propose a way to form articulation model of articulated objects by encoding the features and find the correspondence of static and mobile part via visual observation before and after the interaction.

image-20230616163010066

Workflow-In Brief

Two Stream Encoder

  • Given point cloud observations before and after interaction: P1,P2RN×3P_1,P_2 \in \mathbb R^{N\times 3}

  • Encode them with PointNet++ Encoder μenc\mu_{enc}: f1=μenc(P1)f_1=\mu_{enc}(P_1), f2=μenc(P2)f_2=\mu_{enc}(P_2). f1,f2RN×dsubf_1,f_2\in \mathbb R^{N'\times d_{sub}}. N<NN'<N is the number of the sub-sampled points, and dsubd_{sub} is the dimension of the sub-sampled point features.

  • Fuse the features with attention layer: Attn12=softmax(f1f2Tdsub)f2Attn_{12}=softmax(\frac{f_1f_2^T}{\sqrt{d_{sub}}})f_2, f12=[f1,Attn12]f_{12}=[f_1,Attn_{12}], f12RN×2dsubf_{12}\in \mathbb R^{N'\times 2 d_{sub}}.

  • The fused feature is decoded by two PointNet++ decoder νgeo\nu_{geo}, νart\nu_{art}, and get fgeo=νgeo(f12)f_{geo}=\nu_{geo}(f_{12}), fart=νart(f12)f_{art}=\nu_{art}(f_{12}). f1,f2RN×ddensef_1,f_2\in \mathbb R^{N\times{d_{dense}}} are point features aligned with P1P_1

  • Feature encoding based on ConvONet.

    image-20230616183619689

fartf_{art} is projected into 2D feature planes and fgeof_{geo} is projected into voxel grids as in the ConvONets. The points that fall into the same pixel cell or voxel cell are aggregated together via max pooling.

Training

image-20230616185216164

image-20230616185327505

Revolute joint:

image-20230616185351385

image-20230616185433598

分享这一刻
让朋友们也来瞅瞅!