ByteDance has unveiled OmniHuman-1, an advanced AI system capable of transforming a single photo into a lifelike video of a person speaking, gesturing, singing, or even playing an instrument.
According to a research paper published Sunday on arXiv, OmniHuman-1 "significantly outperforms existing methods" in generating natural human movement from minimal input, particularly audio.
The AI works with images of any aspect ratio, from close-up portraits to full-body shots, ensuring smooth and realistic motion across various scenarios.
ByteDance has released sample videos demonstrating the system’s capabilities, including expressive hand and body movements, animated characters, and the recreation of historical figures. One example features Albert Einstein delivering a lecture, complete with nuanced facial expressions and natural hand gestures.
Trained on over 18,700 hours of human video data, OmniHuman-1 leverages a combination of text, audio, and physical pose inputs to create highly convincing digital humans. With this development, ByteDance—alongside TikTok—continues to push deeper into the rapidly evolving AI-generated video landscape.