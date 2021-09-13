



Posted by: Shan Yang, Software Engineer, Angjoo Kanazawa, Research Scientist, Google Research

Dance is a universal language found in almost every culture and is the means many people use to express themselves on today’s modern media platforms. The ability to compose and dance patterns of movement to the beat of music is a fundamental aspect of human behavior. However, dance is a form of art that requires practice. In fact, providing dancers with a rich repertoire of dance motions needed to create expressive choreography often requires specialized training. This process is difficult for humans, but even more difficult with machine learning (ML) models. This is because we need the ability to generate kinematically complex continuous motions while capturing the non-linear relationships between motions. Accompanying music.

In “AI Choreographer: Music-Conditioned 3D Dance Generation with AIST ++” announced at ICCV 2021, a full-attention cross-modal transformer (FACT) that can imitate and understand dance movements and further enhance human choreography ability. ) Propose a model. dance. Along with the model, we have released a large multimodal 3D dance motion dataset AIST ++. It contains 5.2 hours of 3D dance motion in 1408 sequences, each covering 10 dance genres, including multi-view videos of known camera poses. Extensive user research on AIST ++ has shown that the FACT model is qualitatively and quantitatively superior to modern state-of-the-art methods.

Introducing a new full-attention cross-modal transformer (FACT) network that can generate realistic 3D dance motion (right) and new 3D dance dataset AIST ++ (left) subject to music.

Generate the proposed 3D motion dataset from an existing AIST dance database. This is a collection of dance videos with musical accompaniment but no 3D information. AIST has 10 dance genres: old school (break, pop, rock, wack) and new school (middle hip hop, LA style hip hop, house, clamp, street jazz, ballet jazz). Includes dancer multi-view video, but these cameras are not tuned.

For our purposes, we have restored the camera calibration parameters and 3D human movement in terms of the parameters used in the widely used SMPL 3D model. The resulting database AIST ++ is a large 3D human dance motion dataset containing various 3D motions combined with music. Each frame contains an extensive annotation.

Nine views of the camera’s intrinsic and extrinsic parameters. 17 COCO-style human joint positions in both 2D and 3D. 24 SMPL pose parameters and global scaling and transformation.

The motion is evenly distributed across all 10 dance genres, covering different musical tempos with beats per minute (BPM). Each genre of dance includes 85% basic movements and 15% advanced movements (long choreography freely designed by the dancer).

The AIST ++ dataset also contains multi-view synchronized image data, which helps in the direction of other studies such as 2D / 3D pose estimation. As far as we know, AIST ++ is the largest 3D human dance dataset with 1408 sequences, 30 themes, 10 dance genres, and both basic and advanced choreography.

An example of a 3D dance sequence for an AIST ++ dataset. Left: Three views of the dance video from the AIST database. Right: Reconstructed 3D motion visualized with 3D mesh (top) and skeleton (bottom).

Since AIST is an educational database, it changes the BPM, which is a common practice of dance, to record multiple dancers according to the same choreography of different music. This poses a unique challenge to cross-modal sequence-to-sequence generation, as the model needs to learn a one-to-many mapping between audio and motion. Carefully build a unique train and test the subsets with AIST ++ to ensure that neither choreography nor music is shared between the subsets.

Full Attention Cross Modal Transformer (FACT) Model Use this data to train a FACT model to generate 3D dance from music. The model begins by encoding seed motion and audio inputs using separate motion and audio transformers. The embeds are then concatenated and sent to the cross-modal transformer. The cross-modal transformer learns the correspondence between both modality and produces N future motion sequences. Then use these sequences to train your model in a self-monitoring manner. All three transformers are jointly learned end-to-end. At the time of testing, apply this model to the autoregressive framework. In this framework, the predicted movement acts as an input to the next generation step. As a result, the FACT model can generate long-range dance motion frame by frame.

The FACT network captures a 2-second sequence (X) of music (Y) and seed motion to generate long-range future motion that correlates with the input music.

FACT contains three important design choices that are important for generating realistic 3D dance motion from music.

All transformers use a full attention mask. This is more expressive than the normal causal model because the internal token has access to all inputs. Train the model to predict N futures beyond the current input, as well as the next move. This causes the network to pay more attention to the temporal context and prevent the model from motion freezing or diverging after a few generation steps. It combines the two embeddings (motion and audio) early and employs a deep 12-layer cross-modal transformer module that is essential for training models that actually pay attention to the input music.

Results Evaluate performance based on three indicators:

Motion Quality: Calculates the Fréchet start distance (FID) between the actual dance motion sequence of the AIST ++ test set and 40 model-generated motion sequences of 1200 frames (20 seconds) each. Based on geometric and kinetic features, FIDs are referred to as FIDg and FIDk, respectively.

Generation Diversity: As in previous work, we calculated the average Euclidean distance of the feature space over 40 motions generated by the AIST ++ test set to assess the model’s ability to generate divers’ dance motions. , Geometric feature space (Distg) and dynamic feature space (Distk).

Four different dance choreography generated using different music (right), but the same 2-second seed motion (left). The genres of conditioning music are break, ballet jazz, krumping and middle hip hop. Seed motion comes from hip-hop dance.

Motion-Music Correlation: With BeatAlign, there is no well-designed metric to measure the correlation between input music (musical beats) and generated 3D motion (kinematic beats). Suggest a new metric called.

The generated dance motion velocity (blue curve) and kinematic beats (green dotted line), and music beats (orange dotted line). Kinematic beats are extracted by finding the minimum value from the velocity curve.

Quantitative Assessment Compare the performance of FACT with each of these metrics with the performance of other state-of-the-art methods.

Compared to the three most advanced methods of recent years (Li et al., Dancenet, and Dance Revolution), the FACT model is more realistic, highly correlated with input music, and when subject to a variety of music. Generates even more diversified motions. * Li et al. The generated motion is discontinuous and the average motion characteristic distance is abnormally long.

Also, the correlation between motion and music is correlated with a user survey that asks each participant to watch 10 videos showing one of the results and a random counterpart, and selects dancers who are more in sync with the music. Is perceptually evaluated. The study consisted of 30 participants, from professional dancers to those who rarely dance. Compared to each baseline, 81% is Li et al. I preferred the output of the FACT model over, 71% preferred FACT over Dancenet, and 77% preferred Dance Revolution. Interestingly, 75% of the participants preferred the unpaired AIST ++ dance motion to the one generated by FACT. This is not surprising as the original dance capture is very expressive.

Qualitative Results Compared to previous methods such as DanceNet (left) and Li et al. al. The 3D dance generated using the (center), FACT model (right) is more realistic and highly correlated with the input music.

A more generated 3D dance using the FACT model.

Conclusions and Discussions We will introduce a model that can not only learn the correspondence between audio and motion, but also generate high-quality 3D motion sequences with music as a condition. Generating 3D motion from music is an early area of ​​research, so we hope our work will pave the way for future cross-modal audio to 3D motion generation. We are also releasing AIST ++, the largest 3D human dance dataset ever. This proposed multi-view, multi-genre, cross-modal 3D motion dataset is useful not only for studying conditional 3D motion generation studies, but also for studying general human understanding. Here we are releasing the code in the GitHub repository and the trained model.

Our results show a promising direction for this issue of music-tuned 3D motion generation, but we need to investigate further. First, our approach is based on kinematics and does not infer the physical interaction between the dancer and the alcove. Therefore, global transformations can lead to artifacts such as foot sliding and floating. Second, our model is currently deterministic. Exploring ways to generate multiple realistic dances for each piece of music is an exciting direction.

Acknowledgments Thanks to the contributions of other co-authors, including Ruilong Li and David Ross. Thanks to Chen Sun, Austin Myers, Bryan Seybold, and Abhijit Kundu for their informative discussions. Thanks to Emre Aksan and Jiaman Li for sharing the code. We would also like to thank Kevin Murphy for his early attempts in this direction and Peggy Chi and Pan Chen for supporting the user research experiments.

