Vegeta Avatar
Building a voice + video spokesperson

A passion project combining AI and creativity to bring Vegeta from Dragon Ball Z to life.

Introduction

This project started with a blank wall.
I was preparing for an upcoming episode of my podcast, a personal format where I paint portraits while having open conversations with guests. Only this time, I couldn’t think of anyone that I wanted to have my next episode with.
I wanted someone with a presence and a very unique take. Someone who wouldn't hold back. And that’s when the idea hit me:
What if I brought Vegeta, my childhood hero, as a digital guest?
But there was something deeper too. I’ve always struggled with self-promotion. I build a lot, but rarely put it out there. So I thought: if I can’t talk about my work, maybe I can build someone who will be confident, brutally honest, and unfiltered.
Who better than Vegeta? What began as a curiosity became an engineered voice-and-video AI agent that could speak on my behalf. try the demo here

Project Goals & Constraints

From the beginning, the vision was ambitious:

A real-time agent powered by WebRTC, capable of face-to-face interaction.
No extensive reliance on external paid APIs.
Minimal deployment cost (HF Spaces Pro at $9/month).

Currently, the WebRTC version is still in debugging hell, largely due to integration issues between the TTS and video sync pipelines. Meanwhile, this checkpoint version focuses on text-to-voice-to-video output using a character-based persona.

Why VITS?

Lower total latency
No vocoder separation unlike 2 stage TTS systems
Captures speaker identity more naturally

The initial model (V1) was trained with fewer steps and had a noticeably robotic, choppy tone. In this updated version, I trained for more steps, leading to better convergence and improved fluency. It still isn’t perfect due to limited dataset quality, especially noisy game-derived audio that includes yelling and power-ups, some pronunciations result in unintentional shouting.

Dataset Challenges

Limited clean audio
Includes anime + game voice clips (with screaming)

Phoneme Issues

The model uses espeak-ng for phoneme generation, which assumes a US English pronunciation baseline. Indian names or non-standard words aren't handled well. I implemented some fixes via manual overrides, but the long-term solution is to enrich the phoneme dictionary with examples.

XTTS as Future Option

XTTS may eventually replace VITS for real-time voice use cases. It supports cross-lingual cloning and more natural-sounding prosody. For now, in the tests i ran, VITS isn't a latency bottleneck, so I'm sticking with it.

Lip-Sync Video Generation

Initially, I wasn’t even aware that real-time lip-sync models conditioned on audio existed. Most open-source avatar models leaned toward diffusion pipelines for high-quality outputs, but they weren't suitable for real-time response. Eventually, I found MuseTalk, a GAN-based model that fits my use case perfectly.

Why MuseTalk

Fast generation
Accepts audio as a conditioning signal
Uses a base video with subtle head movement

Unlike diffusion models, which require multiple iterative steps, MuseTalk generates lip-synced frames in latent space in a single forward pass. It outputs only lip and cheek movement but for real-time agents, this tradeoff works.

How it Works?

Uses a Whisper model to chunk and extract features from audio
Predicts lip-synced frames for that chunk
I then stitch all frames into a complete video response

Challenges

Only mouth and cheek movement (eyes are static)
Occasional jitter depending on input length

Deployment Constraints

Why Hugging Face Spaces?

Simple interface for deploying demos
Free GPU for short bursts
Cost capped at $9/month with Pro

Key Limitations

TTS hosted on S3 + Sagemaker: Prone to crashing due to memory constraints
Gradio streaming bug: Due to HF being locked to gradio==5.1.0, streamed video doesn't retain audio after the first response
ZeroGPU builds are broken in latest Gradio versions, so I can’t upgrade until Hugging Face Spaces fully supports it

This is frustrating. I know the demo isn’t as smooth as it could be, but I’m forced to work within these constraints

What's Next: WebRTC & Beyond

v3: Real-Time WebRTC Agent

The original vision is still alive a live avatar agent you can talk to in a video call.

Working through timestamp sync issues between audio and frames and smooth video streaming

Smarter Animation

MuseTalk only animates lips. I'd like to:

Train a model that animates eyes, brows, head tilt based on the sentiment of the audio or text
Improve liveliness and expressiveness of the avatar

Final Thoughts

This project taught me more than just audio and video pipelines.
It taught me how to push through the frustration of broken toolchains, unclear documentation, and GPU cost ceilings.
Building an AI spokesperson based on Vegeta wasn’t just about showing off tech. It was a way for me to confront something personal: the discomfort of talking about my own work. I’m proud of what I’ve built so far, and I’m excited to see where it leads next.

If you’re working on avatar agents, GenAI systems, or building real-time ML infra, feel free to reach out. I’d love to learn from your experience or jam on something together.