EgoMusic-driven Human Dance Motion Estimation with Skeleton Mamba

Quang Nguyen1       Nhat Le2       Baoru Huang7               Minh Nhat Vu3               Chengcheng Tang4      
Van Nguyen1       Ngan Le5       Thieu Vo6       Anh Nguyen7

1FPT Software AI Center   2The University of Western Australia   3TU Wien   4Meta  
5University of Arkansas   6NUS   7University of Liverpool

Abstract

Estimating human dance motion is a challenging task with various industrial applications. Recently, many efforts have focused on predicting human dance motion using either egocentric video or music as input. However, the task of jointly estimating human motion from both egocentric video and music remains largely unexplored. In this paper, we aim to develop a new method that predicts human dance motion from both egocentric video and music. In practice, the egocentric view often obscures much of the body, making accurate full-pose estimation challenging. Additionally, incorporating music requires the generated head and body movements to align well with both visual and musical inputs. We first introduce EgoAIST++, a new large-scale dataset that combines both egocentric views and music with more than 36 hours of dancing motion. Drawing on the success of diffusion models and Mamba on modeling sequences, we develop an EgoMusic Motion Network with a core Skeleton Mamba that explicitly captures the skeleton structure of the human body. We illustrate that our approach is theoretically supportive. Intensive experiments show that our method clearly outperforms state-of-the-art approaches and generalizes effectively to real-world data.

BibTeX

Comming soon!!
    

Acknowledgements

We borrow the page template from HyperNeRF. Special thanks to them!