Keypoint Transformer: Solving Joint Identification in Challenging Hands and Object Interactions for Accurate 3D Pose Estimation

Shreyas Hampali; Sayan Deb Sarkar; Mahdi Rad; Vincent Lepetit

doi:10.1109/CVPR52688.2022.01081

Keypoint Transformer: Solving Joint Identification in Challenging Hands and Object Interactions for Accurate 3D Pose Estimation

Shreyas Hampali, Sayan Deb Sarkar, Mahdi Rad, Vincent Lepetit

Institute of Computer Graphics and Vision (7100)

Research output: Chapter in Book/Report/Conference proceeding › Conference paper › peer-review

Abstract

We propose a robust and accurate method for estimating the 3D poses of two hands in close interaction from a single color image. This is a very challenging problem, as large occlusions and many confusions between the joints may happen. State-of-the-art methods solve this problem by regressing a heatmap for each joint, which requires solving two problems simultaneously: localizing the joints and recognizing them. In this work, we propose to separate these tasks by relying on a CNN to first localize joints as 2D keypoints, and on self-attention between the CNN features at these keypoints to associate them with the corresponding hand joint. The resulting architecture, which we call 'Keypoint Transformer', is highly efficient as it achieves state-of-the-art performance with roughly half the number of model parameters on the InterHand2.6M dataset. We also show it can be easily extended to estimate the 3D pose of an object manipulated by one or two hands with high performance. Moreover, we created a new dataset of more than 75,000 images of two hands manipulating an object fully annotated in 3D and will make it publicly available.

Original language	English
Title of host publication	Proceedings - 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022
Publisher	IEEE Computer Society Publications
Pages	11080-11090
Number of pages	11
ISBN (Electronic)	9781665469463
DOIs	https://doi.org/10.1109/CVPR52688.2022.01081
Publication status	Published - 2022
Event	2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition: CVPR 2022 - New Orleans, United States Duration: 19 Jun 2022 → 24 Jun 2022

Conference

Conference	2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition
Abbreviated title	CVPR 2022
Country/Territory	United States
City	New Orleans
Period	19/06/22 → 24/06/22

Keywords

3D from single images
Datasets and evaluation
Deep learning architectures and techniques
Pose estimation and tracking

ASJC Scopus subject areas

Software
Computer Vision and Pattern Recognition

Access to Document

10.1109/CVPR52688.2022.01081

Cite this

Hampali, S., Sarkar, S. D., Rad, M., & Lepetit, V. (2022). Keypoint Transformer: Solving Joint Identification in Challenging Hands and Object Interactions for Accurate 3D Pose Estimation. In Proceedings - 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022 (pp. 11080-11090). IEEE Computer Society Publications. https://doi.org/10.1109/CVPR52688.2022.01081

Keypoint Transformer: Solving Joint Identification in Challenging Hands and Object Interactions for Accurate 3D Pose Estimation. / Hampali, Shreyas; Sarkar, Sayan Deb; Rad, Mahdi et al.
Proceedings - 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022. IEEE Computer Society Publications, 2022. p. 11080-11090.

Research output: Chapter in Book/Report/Conference proceeding › Conference paper › peer-review

Hampali, S, Sarkar, SD, Rad, M & Lepetit, V 2022, Keypoint Transformer: Solving Joint Identification in Challenging Hands and Object Interactions for Accurate 3D Pose Estimation. in Proceedings - 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022. IEEE Computer Society Publications, pp. 11080-11090, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, United States, 19/06/22. https://doi.org/10.1109/CVPR52688.2022.01081

@inproceedings{83d3327519ef4bc0b7f758b465b16c68,

title = "Keypoint Transformer: Solving Joint Identification in Challenging Hands and Object Interactions for Accurate 3D Pose Estimation",

abstract = "We propose a robust and accurate method for estimating the 3D poses of two hands in close interaction from a single color image. This is a very challenging problem, as large occlusions and many confusions between the joints may happen. State-of-the-art methods solve this problem by regressing a heatmap for each joint, which requires solving two problems simultaneously: localizing the joints and recognizing them. In this work, we propose to separate these tasks by relying on a CNN to first localize joints as 2D keypoints, and on self-attention between the CNN features at these keypoints to associate them with the corresponding hand joint. The resulting architecture, which we call 'Keypoint Transformer', is highly efficient as it achieves state-of-the-art performance with roughly half the number of model parameters on the InterHand2.6M dataset. We also show it can be easily extended to estimate the 3D pose of an object manipulated by one or two hands with high performance. Moreover, we created a new dataset of more than 75,000 images of two hands manipulating an object fully annotated in 3D and will make it publicly available.",

keywords = "3D from single images, Datasets and evaluation, Deep learning architectures and techniques, Pose estimation and tracking",

author = "Shreyas Hampali and Sarkar, {Sayan Deb} and Mahdi Rad and Vincent Lepetit",

note = "Funding Information: Acknowledgments. This work was supported by the Christian Doppler Laboratory for Semantic 3D Computer Vision, funded in part by Qualcomm Inc, and Chistera IPalm. Publisher Copyright: {\textcopyright} 2022 IEEE.; 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition : CVPR 2022, CVPR 2022 ; Conference date: 19-06-2022 Through 24-06-2022",

year = "2022",

doi = "10.1109/CVPR52688.2022.01081",

language = "English",

pages = "11080--11090",

booktitle = "Proceedings - 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022",

publisher = "IEEE Computer Society Publications",

}

TY - GEN

T1 - Keypoint Transformer

T2 - 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition

AU - Hampali, Shreyas

AU - Sarkar, Sayan Deb

AU - Rad, Mahdi

AU - Lepetit, Vincent

N1 - Funding Information: Acknowledgments. This work was supported by the Christian Doppler Laboratory for Semantic 3D Computer Vision, funded in part by Qualcomm Inc, and Chistera IPalm. Publisher Copyright: © 2022 IEEE.

PY - 2022

Y1 - 2022

N2 - We propose a robust and accurate method for estimating the 3D poses of two hands in close interaction from a single color image. This is a very challenging problem, as large occlusions and many confusions between the joints may happen. State-of-the-art methods solve this problem by regressing a heatmap for each joint, which requires solving two problems simultaneously: localizing the joints and recognizing them. In this work, we propose to separate these tasks by relying on a CNN to first localize joints as 2D keypoints, and on self-attention between the CNN features at these keypoints to associate them with the corresponding hand joint. The resulting architecture, which we call 'Keypoint Transformer', is highly efficient as it achieves state-of-the-art performance with roughly half the number of model parameters on the InterHand2.6M dataset. We also show it can be easily extended to estimate the 3D pose of an object manipulated by one or two hands with high performance. Moreover, we created a new dataset of more than 75,000 images of two hands manipulating an object fully annotated in 3D and will make it publicly available.

AB - We propose a robust and accurate method for estimating the 3D poses of two hands in close interaction from a single color image. This is a very challenging problem, as large occlusions and many confusions between the joints may happen. State-of-the-art methods solve this problem by regressing a heatmap for each joint, which requires solving two problems simultaneously: localizing the joints and recognizing them. In this work, we propose to separate these tasks by relying on a CNN to first localize joints as 2D keypoints, and on self-attention between the CNN features at these keypoints to associate them with the corresponding hand joint. The resulting architecture, which we call 'Keypoint Transformer', is highly efficient as it achieves state-of-the-art performance with roughly half the number of model parameters on the InterHand2.6M dataset. We also show it can be easily extended to estimate the 3D pose of an object manipulated by one or two hands with high performance. Moreover, we created a new dataset of more than 75,000 images of two hands manipulating an object fully annotated in 3D and will make it publicly available.

KW - 3D from single images

KW - Datasets and evaluation

KW - Deep learning architectures and techniques

KW - Pose estimation and tracking

UR - http://www.scopus.com/inward/record.url?scp=85139893938&partnerID=8YFLogxK

U2 - 10.1109/CVPR52688.2022.01081

DO - 10.1109/CVPR52688.2022.01081

M3 - Conference paper

AN - SCOPUS:85139893938

SP - 11080

EP - 11090

BT - Proceedings - 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022

PB - IEEE Computer Society Publications

Y2 - 19 June 2022 through 24 June 2022

ER -

Keypoint Transformer: Solving Joint Identification in Challenging Hands and Object Interactions for Accurate 3D Pose Estimation

Abstract

Conference

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this