Talking Avatar using Wav2Lip

Step-by-step guide to to create your own AI avatar with cutting-edge AI tools and techniques that are open-sourced and free to use.

Create an AI avatar was created using MidJourney, LeiaPix Converter, ElevenLabs, ChatGPT, and Wav2Lip.

1. MidJourney or Stable Diffusion (for creating Images from a Prompt) (

2. ChatGPT (for creating text if you want)

3. LeiaPix (creating 2D images to 3D Lightfield Images) (

4. ElevenLabs Prime Ai (for generating Text from Audio) (

5. WaveLip (for Lipsync of Video and Voice/Audio)

Note: Wav2Lip is on Google Colab. But you can also download it and run it locally as well)

To run on Colab, do a search for: ‘wav2lip tendeepfake eng.ipynb’ on Google and it will take you to the latest Colab installation.



Wav2Lip is a couple years old. There are better/new alternatives for lip syncing dubbing in 2023. (open source repos that are currently maintained)

Recent Alternatives to Wav2Lip.

There's SadTalker. Unfortunately it's still basically just Wav2Lip, but there are some improvements & it's under active development.

There's a newer version for it, it's called DiffTalk, which is essentially diffusion on top of Wav2Lip.

Now there is also this one:

In this Video you will learn how to create ‘Digital Avatars’ using free Tools. You will learn the basics of how to create your own digital avatars using open source tools.

I'll walk you through a Step-by-step process that's super easy to follow.

You can use MidJourney or any of the other free alternatives to create an image.

Then we will convert this to a 3D animation.

Next we will use a Text-to-Voice tool from ElevenLabs to convert text to speech.

We will combine everything we have worked on to create a talking Avatar using an awesome open sourced and free tool as well.

Step one:

Create or use an existing image. MidJourney which is a Text-to-Image generation tool. Of course you can use any of the other AI image generators like or etc.

This is the Prompt that I'm using. I will put this in the video description now this is the image that actually selected based on the previous generation so for example here are four images and out of those I selected this one now.

One thing you want to be careful about is you want to make sure that the image you select has his/her the mouth ‘open’ and the teeth are showing.

This is really important because when we animate this image into a 3D Avatar.

Step 2:

Next we will convert the 2D image into a 3D Object.

We will use LeiaPix Converter' to add some motion to an image. This will add a 3D movement to our image. Click upload, select the image and then there are a whole bunch of options I have a detailed video coming up on this so I'm going to put a link to that video

Basically there are a couple of options that we are interested in so first is the length of Animation so duration for example we want to select the maximum in this case so that's six second and you see that it already started animating our image

Next one is the type of Animation that you want right so for example if I click horizontal it simply moves the image horizontally but I like this a tall Circle so select either tall Circle or vertical maybe actually vertical might be better okay so let's select vertical

What you want to do is come here click this share. we're going to store this as an mp4 file so essentially what happened here was it converted a 2D image into a 3D video

Click save I'm going to name it as an output output underscore video and you need to make sure that the file name is output underscore video because we will need this specific file name.

Click save that will save our video and I can actually play this so let me open this up and here's the file right

Now we have two assets. The first was the image and the second is the video. so step ElevenLabs - Audio three

now we need some audio and that we want to use in our digital Avatar so for that we will be using ElevenLabs. I have already made a video on it so I'm going to put the link to that video so it's going to be somewhere at the top Let's text to speech converter so wins you need an account so once you log in you will see a screen like this so here you can type in type in your text and here you can select different voices. for example I use this in one of my videos before and you can also select different settings for the audio let's see this is going to be my text so “welcome to prompt engineering Channel learn how to create digital avatars like me with open source and absolutely free tools now subscribe to the channel and hit the notification button for more awesome button for more awesome content”

Probably just to sound better with this ElevenLabs. in order to generate the speech all you need to do is simply click generate and let's wait for it okay so it seems like generated the audio so let's play this welcome to the prompt engineering Channel learn how to create digital avatars like me with open source and absolutely free tools subscribe to the channel and hit that notification button for more awesome content”

Let's download this so I'm going to store it in the same folder but I want to label this as a ‘input audio’ need to change this MP3 format to Wav. so let's select all and let's change this to view okay click save so that will download our audio so that completes our step three just make sure at the end of step three you have two files one is input underscore audio dot with and you this second file is output underscore video dot MP4 now in step four we are going to put both of them together

Step 4: We are going to be putting everything together we will combine both the audio and the Video one option is to use tools such as D-ID. D-ID is a paid to and it can be pretty expensive so we’ll use Wav2Lip, which is an open source model.

Using Wav2Lip Model on Colab

Wav2Lip Model is an open source model that is available on GitHub and Google Colab. I'm going to put a link to this in the description of the video so the wave tool lip model basically takes a video without sound and an audio and then lip sync the videos to the audio.

that is the reason why in Step number two you convert it converted a static image into a 3D video.

now if you check out the links were here they have provided the link to the paper which is actually a fascinating read. so if you're interested, check this out. this is a great read and there is a wealth of knowledge in it now the great thing about this is you can actually download the source code so everything is available here and you can rent it locally but they have provided two online environments where you can test it so one is this interactive demo and the other one is a Google Colab that you can use for your own data set. this interactive demo is limited in terms of how long the video can be and Google Colab you can do any length you want.

I'm going to walk you through both the process the interactive demo is the easiest one so let's click on this all right so here's the interactive demo. you notice there and they're saying that if you run this locally you can attempt to lip sync high resolution and longer videos in this interactive demo there is limit of 20 seconds. but if you run it locally you will be able to actually use this on a higher duration videos so our test audio is around 11 seconds. so it will work perfectly on that.

so now first you need to select the video file so here is our video file and then you need to select your audio file they don't have to be exactly the same length so in this case a video file is around six seconds long whereas our audio file is around 11 seconds long so it doesn't matter if they are not of the same lead then you need to click on sync this pair.

so here are results welcome to the prompt engineering Channel learn how to create digital avatars like me with open source and absolutely free tools subscribe to the channel and hit that notification button for more awesome content just a couple of things to note it's an interactive demo so the results are limited to 480p resolution right if you run it locally you can get a high resolution video and then it's crafted to maximum of 20 seconds a pretty good job of syncing the lips and the audio and it looks a lot more natural like there are definitely no eye movements because our input video didn't have any iron movements but if you look at it there areas like minute movement of the cheeks as well so I think it's it's pretty natural.

and you will notice that because of our step number two where we converted this image into a 3D video it seems to have motion natural movements as well like the movements of the head that we would expect.

okay then let me download this file we're going to do one bonus step that I'm calling step number five but before then I also want to show you how can you run this demo on a much larger file using the Google Golab.

but if you're not interested just skip ahead to the next section all right so we're going to be using this updated Google Colab click on this now if you are not familiar with Google Colab, it's Running it in Google Colab Notebook actually an environment provided by Google where you can run different machine learning models right so you need to do a couple of steps first come here change the runtime environment make sure the GPU is selected okay so click save then come here and hit connect so there are different code blocks that you need to run so anything that has this play button does a code block.

so let's say let's first click on this play button that will install the wave to clip model so it will ask you to run it and it could give us this morning and just click run anyway.

so it's basically starts downloading all the required python files as well as they call the libraries that are needed and it will start installing them so this process will take a few seconds. just be patient with this now when a code block runs execution you will see this green check mark.

now let's go and click on this second play button so this will ask us to actually upload our files right so this is the input underscore audio and you can put underscore video click open it will upload both files so this is done uploading one thing you will notice is there's a table view of contents and each step that you are in it actually shows the corresponding documentation so that's pretty nice

Next we will run the Wave2Lip model on our video in audio.

Click this play button again and it starts executing the code. This will take a minute or so okay so the execution of this block is complete now let's go to the third one that's basically it's going to play a synced video so click on this all right so here's the video let's see what the results look like welcome to the prompt engineering Channel learn how to create digital avatars like me with open source and absolutely free tools subscribe to the channel and hit that notification button for more awesome content okay so this seems to be very similar to what we just saw with the introductory demo all right next we want to download the files so we just run this and it will download it locally right so that's our downloaded file so split to make sure that it works welcome to the prompt engineering Channel learn okay all right this works I'm not going to play it again I just close this okay next you want to run this number five so it's basically deletes all the files from your Google drive because you already have it downloaded to your computer there are some other code blocks that you can run these are different variations of the model play around with but I'm not going to do that the output resolution is for adifi but you can use let's say tools like Topas Lab to actually improve its result output resolution so it you can take it to full HD or even 4K resolution from that one last step that I'm going to do which will help us with the resolution is running through running it through a thin plate spline model so in one of my previous videos I I showed this model Thin plate spline motion model I'll put a link to the video basically the idea is you take an image you have a video and you want to animate that image using this video right so it's a sort of defect so what we're going to do is we are going to select our output image oh sorry output video right so this is our output video and then we are going to select our original image right and just recreate this I have welcome to the prompt engineering Channel sorry I think I have found that it actually helps in improving the resolution of the output right so our input image is going to be the original image that was created with mid Journey our driving video is the video that we just created then simply click submit all right this will take a few minutes so here is our outcome it works on face detection so that's why it cropped the video but let's download this we'll click download and I'm going to save this as a output all right so here is our final output this is from the model so you see the the lips are a little bit blurred and this one is from the final step the template spine method the difference may not be noticeable but I like to run it through this a final step.

I hope you found the video helpful if you did please consider subscribing and turn on the Bell notification button thanks for watching see you in the next one

Last updated