Alibaba Digital Human Model: OmniAvatar

OmniAvatar: Innovating Full-Body Video Generation Driven by Audio

In the realm of virtual humans and digital content creation, the technology for full-body video generation driven by audio is rapidly evolving. OmniAvatar, jointly introduced by Zhejiang University and Alibaba, represents the latest breakthrough in this field. OmniAvatar generates full-body virtual human videos from audio input, addressing issues such as rigid movements and insufficient lip-sync precision in existing technologies.

Core Features and Technology

High-Precision Lip-Sync and Full-Body Motion Generation

OmniAvatar introduces a pixel-wise multi-layer audio embedding strategy, ensuring high synchronization between lip movements and audio while generating natural and fluid full-body motions. This technology makes the generated virtual human videos more realistic and vivid.

Multimodal Input and Fine-Grained Control

OmniAvatar supports precise control through text prompts, such as specifying the character's emotions and actions. Users can control the virtual human's emotions and actions through simple text descriptions like "a sorrowful monologue" or "an impassioned speech."

Dynamic Interaction and Scene Adaptation

OmniAvatar not only supports the generation of natural full-body motions but also enables interaction between the virtual human and surrounding objects, as well as dynamic background adjustments. For example, the virtual human can pick up a microphone and sing according to text prompts, or interact in different backgrounds.

Application Scenarios

Podcasts and Interview Videos

Using a single host photo and audio, OmniAvatar can automatically generate vivid host videos, suitable for podcasts and interview programs.

E-commerce Marketing Ads

OmniAvatar supports natural interaction between characters and objects, suitable for product display. For example, a virtual human can be generated to showcase products through text prompts, enhancing the appeal of advertisements.

Virtual Singer Performances

OmniAvatar excels in singing scenarios, generating precise lip-sync and natural body movements to create realistic stage performances.

Dynamic Scene Control

OmniAvatar supports dynamic background changes controlled by text prompts, such as generating a virtual human in a moving car.

Technical Principles

Pixel-Wise Multi-Layer Audio Embedding

OmniAvatar employs a pixel-wise multi-layer audio embedding strategy, aligning audio waveform features with video frames at the pixel level to significantly improve lip-sync precision.

LoRA-Based Training Method

OmniAvatar introduces a LoRA-based training method in the layers of the DiT model, retaining the base model's powerful capabilities while flexibly incorporating audio conditions.

Frame Overlapping Mechanism and Reference Image Embedding

To maintain consistency and temporal continuity in long video generation, OmniAvatar incorporates a frame overlapping mechanism and reference image embedding strategy.

Advantages and Limitations

Advantages

Natural and Fluid Motion：OmniAvatar excels in generating natural and expressive character portraits with fluid motions.
High-Precision Control：Through text prompts and LoRA training methods, precise control over virtual human actions and expressions is achieved.
Wide Range of Applications：Suitable for a variety of video generation scenarios, including podcasts, human interactions, dynamic scenes, and singing.

Limitations

Color Offset Issues：The model inherits some defects from the base model, such as color offset, which may cause color differences in the generated videos compared to real scenes.
Error Accumulation in Long Videos：In long video generation, error accumulation may lead to a decline in video quality.
Complex Text Control Limitations：Although text prompt control is supported, it is still difficult to distinguish between speakers or handle multi-character interactions in complex text control.
Long Inference Time：Diffusion inference requires multiple denoising steps, resulting in a long inference time that is not suitable for real-time interaction.