Creating Generated Video

2023-04-16 428 words 3 minutes

Contents

When I presented my blogging bot here a couple months ago, a friend suggested I try to create generated video content - check out the linked video to see what I’ve been able to do so far:
https://youtu.be/WkfGq42OyBI

It’s not quite there - the body movements are random, the eyes aren’t focused on the camera, some of the mouth movements don’t seem to match the face. I’ve added a street scene in the background to make it a bit more interesting.

Like the blogging bot, the script is generated by an OpenAI model, seeded with comments scrapped from an internet forum. The audio is created with a text to speech tool from ElevenLabs. I experimented with a few different tools for creating an animated avatar, and settled on working with Nvidia’s Audio2Face animation model in combination with Metahuman technology in Unreal Engine, an environment popular in game development.

Unlike my blogging bot, the process for creating this video is not automated. The tooling, at least for someone with my experience and resources, does not seem to lend itself to automation. It looks like this could change in the very near future - Nvidia has announced, but has not yet released, its Omniverse Avatar Cloud Engine (ACE), which looks like it could facilitate the creation of generated content. If anyone from Nvidia is reading - I’d love early access.

The Guardian reported earlier this week that Kuwait News has introduced a generated news presenting Avatar. I could envision services like LinkedIn experimenting with using an avatar to present our personal news feed as generated video. It remains to be seen if this new generation of avatars will see greater success than earlier attempts like Microsoft Office’s Clippy!

Tools Used

source material was collected from comments scrapped from a recent Hacker News thread ( https://news.ycombinator.com/item?id=35573345 ).
the script was written by OpenAI’s text-davinci-003 model, with an added introduction and closing text
the script was narrated by ElevenLabs’ speech synthesis (the Rachel voice)
the video was rendered using an Unreal Metahuman model (Bernice), in Unreal Engine
the Metahuman face was animated with Nvidia’s Audio2Face model
the Metahuman body was animated with motion capture obtained from https://mocaponline.com/

How to

I followed these tutorials to create the video: