【Reading】Ditto-Building Digital Twins of Articulated Objects from Interaction

  1. 1. Workflow-In Brief
    1. 1.1. Two Stream Encoder
  2. 2. Training

This paper propose a way to form articulation model of articulated objects by encoding the features and find the correspondence of static and mobile part via visual observation before and after the interaction.

image-20230616163010066
image-20230616163010066

Workflow-In Brief

Two Stream Encoder

  • Given point cloud observations before and after interaction:
  • Encode them with PointNet++ Encoder : , . . is the number of the sub-sampled points, and is the dimension of the sub-sampled point features.

  • Fuse the features with attention layer: , , .

  • The fused feature is decoded by two PointNet++ decoder , , and get , . are point features aligned with

  • Feature encoding based on ConvONet.

    image-20230616183619689
    image-20230616183619689

is projected into 2D feature planes and is projected into voxel grids as in the ConvONets. The points that fall into the same pixel cell or voxel cell are aggregated together via max pooling.

Training

image-20230616185216164
image-20230616185216164
image-20230616185327505
image-20230616185327505

Revolute joint:

image-20230616185351385
image-20230616185351385
image-20230616185433598
image-20230616185433598