Vegeta Avatar
Building a voice + video spokesperson
A passion project combining AI and creativity to bring Vegeta from Dragon Ball Z to life.

Introduction
This project started with a blank wall.
I was preparing for an upcoming episode of my podcast, a personal format where I paint portraits while having open conversations with guests. Only this time, I couldn’t think of anyone that I wanted to have my next episode with.
I wanted someone with a presence and a very unique take. Someone who wouldn't hold back.
And that’s when the idea hit me:
What if I brought Vegeta, my childhood hero, as a digital guest?
But there was something deeper too. I’ve always struggled with self-promotion. I build a lot, but rarely put it out there. So I thought: if I can’t talk about my work, maybe I can build someone who will be confident, brutally honest, and unfiltered.
Who better than Vegeta?
What began as a curiosity became an engineered voice-and-video AI agent that could speak on my behalf.
try the demo here
Project Goals & Constraints
From the beginning, the vision was ambitious:
- A real-time agent powered by WebRTC, capable of face-to-face interaction.
- No extensive reliance on external paid APIs.
- Minimal deployment cost (HF Spaces Pro at $9/month).
Why VITS?
- Lower total latency
- No vocoder separation unlike 2 stage TTS systems
- Captures speaker identity more naturally
The initial model (V1) was trained with fewer steps and had a noticeably robotic, choppy tone. In this updated version, I trained for more steps, leading to better convergence and improved fluency. It still isn’t perfect due to limited dataset quality, especially noisy game-derived audio that includes yelling and power-ups, some pronunciations result in unintentional shouting.
Dataset Challenges
- Limited clean audio
- Includes anime + game voice clips (with screaming)
Phoneme Issues
The model uses espeak-ng for phoneme generation, which assumes a US English pronunciation baseline. Indian names or non-standard words aren't handled well. I implemented some fixes via manual overrides, but the long-term solution is to enrich the phoneme dictionary with examples.
XTTS as Future Option
XTTS may eventually replace VITS for real-time voice use cases. It supports cross-lingual cloning and more natural-sounding prosody. For now, in the tests i ran, VITS isn't a latency bottleneck, so I'm sticking with it.
Lip-Sync Video Generation
Initially, I wasn’t even aware that real-time lip-sync models conditioned on audio existed. Most open-source avatar models leaned toward diffusion pipelines for high-quality outputs, but they weren't suitable for real-time response. Eventually, I found MuseTalk, a GAN-based model that fits my use case perfectly.
Why MuseTalk
- Fast generation
- Accepts audio as a conditioning signal
- Uses a base video with subtle head movement
Unlike diffusion models, which require multiple iterative steps, MuseTalk generates lip-synced frames in latent space in a single forward pass. It outputs only lip and cheek movement but for real-time agents, this tradeoff works.
How it Works?
- Uses a Whisper model to chunk and extract features from audio
- Predicts lip-synced frames for that chunk
- I then stitch all frames into a complete video response
Challenges
- Only mouth and cheek movement (eyes are static)
- Occasional jitter depending on input length
Deployment Constraints
Why Hugging Face Spaces?
- Simple interface for deploying demos
- Free GPU for short bursts
- Cost capped at $9/month with Pro
Key Limitations
- TTS hosted on S3 + Sagemaker: Prone to crashing due to memory constraints
- Gradio streaming bug: Due to HF being locked to gradio==5.1.0, streamed video doesn't retain audio after the first response
- ZeroGPU builds are broken in latest Gradio versions, so I can’t upgrade until Hugging Face Spaces fully supports it
This is frustrating. I know the demo isn’t as smooth as it could be, but I’m forced to work within these constraints
What's Next: WebRTC & Beyond
v3: Real-Time WebRTC Agent
The original vision is still alive a live avatar agent you can talk to in a video call.
- Working through timestamp sync issues between audio and frames and smooth video streaming
Smarter Animation
MuseTalk only animates lips. I'd like to:
- Train a model that animates eyes, brows, head tilt based on the sentiment of the audio or text
- Improve liveliness and expressiveness of the avatar
Final Thoughts
This project taught me more than just audio and video pipelines.
It taught me how to push through the frustration of broken toolchains, unclear documentation, and GPU cost ceilings.
Building an AI spokesperson based on Vegeta wasn’t just about showing off tech.
It was a way for me to confront something personal: the discomfort of talking about my own work.
I’m proud of what I’ve built so far, and I’m excited to see where it leads next.
If you’re working on avatar agents, GenAI systems, or building real-time ML infra, feel free to reach out.
I’d love to learn from your experience or jam on something together.