🟡 LOW TIER — Learning & Testing
Basic Digital Clone Pipeline
Best for: Beginners, testing, learning ComfyUI. Not for client work.
Upload a single portrait photo + type your script text into ComfyUI. No audio cleaning at this tier.
Coqui XTTS v2 clones voice from a 6-second audio sample. Free, open-source, runs locally on RunPod. Good quality but not perfect.
Animates a still photo to match the audio. Older model — works but produces slightly robotic head movements.
Syncs lip movements to the cloned voice audio. Basic accuracy — visible artifacts at high quality settings.
Restores face quality after lip sync distortion. Older model — noticeable smoothing effect on skin.
OpenAI Whisper transcribes audio and generates SRT subtitle file automatically. Free and highly accurate.
Merges video + audio + captions into final MP4. No 4K upscaling at this tier — output is 720p max.
⚙️ RunPod Setup — Low Tier
1
Create RunPod Account
Go to runpod.io → Sign up → Add $10 credit to start
2
Deploy ComfyUI Template
Pods → + New Pod → Search "ComfyUI" template → Select RTX 3080 (cheapest)
3
Install Custom Nodes via ComfyUI Manager
Open ComfyUI → Manager → Install Missing Nodes
Nodes to install:
→ ComfyUI-SadTalker
→ ComfyUI-Wav2Lip
→ ComfyUI-GFPGAN
→ ComfyUI-XTTS
→ ComfyUI-Whisper
4
Stop Pod When Not Using
⚠️ Always stop your pod after use — you are billed per hour even when idle!
🔵 MID TIER ⭐ — Professional & Sellable
Professional Digital Clone Pipeline
Best for: Freelancers, agencies, selling video services. Fully professional output.
RTX 3080/3090
GPU (RunPod)
Upload portrait photo + script text + optional raw audio recording (even noisy mic is fine — Demucs cleans it).
Meta's Demucs separates voice from background noise. Removes room echo, keyboard clicks, fan noise — produces studio-clean audio.
Same XTTS v2 as Low Tier — but now fed with Demucs-cleaned audio, producing significantly better voice clone quality.
Major upgrade from SadTalker. LivePortrait produces natural head movements, eye blinks, micro-expressions. Looks like a real person talking.
Major upgrade from Wav2Lip. MuseTalk produces near-perfect lip sync with zero visible artifacts. Handles fast speech and multiple languages.
Upgrade from GFPGAN. CodeFormer preserves natural skin texture, pores, and hair detail. No more plastic-skin effect.
Upscales video from 720p to 4K resolution. Enhances sharpness, removes compression artifacts. Output is crisp on any screen.
Generates a professional static AI background (office, studio, outdoor). Avatar is composited onto background using chroma key or segmentation.
Same Whisper as Low Tier — generates accurate SRT captions. At Mid Tier, captions are styled and burned into video automatically.
Merges all layers: avatar video + AI background + cleaned audio + styled captions → Final 4K MP4 output. You manually click Queue Prompt in ComfyUI.
⚙️ RunPod Setup — Mid Tier
1
Select RTX 3080 or 3090 (24GB VRAM)
RunPod → New Pod → RTX 3080 (~$0.44/hr) or RTX 3090 (~$0.74/hr). Use Network Volume for persistent storage.
2
Deploy ComfyUI + Install Manager
Template: ComfyUI Official
Volume: 60–80GB Network Volume
Port: 8188 (ComfyUI UI)
Port: 22 (SSH access)
3
Install All Custom Nodes
Via ComfyUI Manager → Install:
→ ComfyUI-LivePortrait
→ ComfyUI-MuseTalk
→ ComfyUI-CodeFormer
→ ComfyUI-Real-ESRGAN
→ ComfyUI-XTTS
→ ComfyUI-Demucs
→ ComfyUI-Whisper
→ ComfyUI-FLUX (for backgrounds)
→ ComfyUI-FFmpeg-Node
4
Download Models to Network Volume
Models needed:
→ XTTS v2 weights (~1.8GB)
→ LivePortrait weights (~1.2GB)
→ MuseTalk weights (~900MB)
→ CodeFormer weights (~330MB)
→ Real-ESRGAN x4plus (~67MB)
→ FLUX.1-dev (~23GB) or SDXL (~7GB)
→ Whisper large-v3 (~3GB)
5
Load Workflow JSON → Click Queue Prompt
Import the pipeline workflow JSON → connect all nodes → upload your photo + script → click Queue Prompt → wait ~15 mins → download output.
🟢 HIGH TIER — Agency & Scale
Full Agency Pipeline + B-Roll + Automation
Best for: Agencies, high-volume production, premium client work, zero manual work.
Client fills a form / sends email / places order → n8n detects it automatically → sends all inputs to ComfyUI via API. Zero manual clicking.
Same as Mid Tier — Demucs cleans audio automatically as part of the n8n-triggered pipeline.
Optional upgrade from XTTS v2. ElevenLabs produces ultra-realistic, emotionally expressive voice. Indistinguishable from human recording. $22/mo Starter plan.
Same LivePortrait as Mid Tier — but running on A100 GPU, so 3–5x faster rendering speed.
Same MuseTalk as Mid Tier — faster on A100. Perfect lip sync output.
Same CodeFormer as Mid Tier — faster on A100.
Same Real-ESRGAN as Mid Tier — faster on A100.
Upgrade from basic SDXL. FLUX.1 + ControlNet + IP-Adapter produces photorealistic, cinematic backgrounds. Looks like a real film set — not AI generated.
BRAND NEW at High Tier. While avatar talks about "beach vacation" → AI generates actual moving beach video footage as B-roll. Cuts between avatar and B-roll scenes automatically.
Same Whisper — styled captions burned in automatically as part of the pipeline.
Same FFmpeg as Mid Tier — merges all layers including B-roll scenes. Triggered automatically by n8n, not manually.
n8n detects rendering complete → uploads to Google Drive → posts to YouTube/TikTok → emails client → logs to Notion → sends you Telegram notification. All automatic.
🤖 n8n Automation Flow
📋 Client Form
→
n8n Detects
→
Send to ComfyUI
→
Pipeline Runs
Video Rendered
→
Upload Drive
→
Post Social
→
Notify Client
📊 FULL COMPARISON
All Three Tiers — Side by Side
Every tool, every step, every cost — compared honestly.
| Step / Tool |
🟡 Low Tier |
🔵 Mid Tier ⭐ |
🟢 High Tier |
| Auto Trigger | ❌ Manual | ❌ Manual | ✅ n8n |
| Audio Cleaning | ❌ None | ✅ Demucs | ✅ Demucs |
| Voice Cloning | ⚠️ XTTS v2 | ✅ XTTS v2 | ✅✅ ElevenLabs* |
| Face Animation | ⚠️ SadTalker | ✅ LivePortrait | ✅ LivePortrait |
| Lip Sync | ⚠️ Wav2Lip | ✅ MuseTalk | ✅ MuseTalk |
| Face Restoration | ⚠️ GFPGAN | ✅ CodeFormer | ✅ CodeFormer |
| 4K Upscaling | ❌ None | ✅ Real-ESRGAN | ✅ Real-ESRGAN |
| AI Background | ❌ None | ✅ SDXL/FLUX | ✅✅ FLUX+ControlNet |
| B-Roll Video | ❌ None | ❌ None | ✅ AnimateDiff |
| Auto Captions | ✅ Whisper | ✅ Whisper | ✅ Whisper |
| Final Assembly | ✅ FFmpeg | ✅ FFmpeg | ✅ FFmpeg |
| Auto Delivery | ❌ Manual | ❌ Manual | ✅ n8n |
| GPU | RTX 3080 | RTX 3080/3090 | A100 / H100 |
| Render Speed | ~20 min/video | ~15 min/video | ~3–5 min/video |
| Output Quality | ⚠️ Basic | ✅ Professional | ✅✅ Agency |
| Paid APIs | $0 | $0 | $0–120+/mo* |
| RunPod Cost | $5–15/mo | $40–70/mo | $150–300/mo |
| Total Cost | $5–15/mo | $40–70/mo | $150–420/mo |
| Best For | Learning | Freelancers ⭐ | Agencies |
🟡 Low Tier — What You Get
- Basic working digital clone
- 720p output maximum
- Visible lip sync artifacts
- No background replacement
- Good for learning ComfyUI
- NOT suitable for selling
🔵 Mid Tier — What You Get ⭐
- Professional 4K output
- Near-perfect lip sync
- Natural face animation
- AI background replacement
- Studio-clean audio
- Fully sellable to clients
🟢 High Tier — What You Get
- Everything in Mid Tier PLUS
- Moving B-roll video scenes
- Cinematic backgrounds
- Ultra-realistic voice (ElevenLabs*)
- Fully automated pipeline
- Agency-level volume capacity
💡 Recommended Path
- Week 1–2: Start at Low Tier
- Learn ComfyUI node basics
- Week 3–4: Move to Mid Tier
- Start selling at Mid Tier
- Month 3+: Add High Tier extras
- Only upgrade when revenue justifies