LongCat Avatar Multi-Image Shot-Switching Digital Human Workflow

Watch the full video first if you want to understand how this LongCat Avatar workflow works in practice. The video shows how multiple reference images can be organized into a talking-avatar pipeline, how shot switching and loop extension are handled, and how to launch the workflow online without rebuilding the full ComfyUI environment locally.

This ComfyUI workflow is designed for LongCat Avatar multi-image shot-switching digital human generation. Its main purpose is to turn several reference images and one driving audio track into a longer talking-avatar video with controllable visual changes. Instead of using only one image for a single static talking head, this workflow builds a reference image pool and prompt pool so the avatar can switch between different prepared visual states while keeping the talking rhythm and audio-driven mouth movement.

The workflow is built around LongCat-Avatar-15_bf16.safetensors as the main avatar model, LongCat-Avatar DMD LoRA as the distilled acceleration layer, WanVideoWrapper generation nodes, WanVideo VAE, Whisper large v3 encoder, LongCat Avatar embed extension, and a segmented sampling / looping structure. The audio is loaded first and passed through Whisper, which extracts speech features for mouth movement and speaking behavior. This makes the workflow suitable for audio-driven digital human videos rather than ordinary silent image-to-video animation.

The image side is organized as a multi-reference system. The workflow includes up to ten image input groups. Each image is resized to a unified 1280×720 canvas, then encoded into a LongCat-compatible latent. These images can represent different characters, outfits, backgrounds, camera angles, or visual states. The workflow also includes an image index switch, allowing the user to select which reference image enters the generation path.

The prompt side is also modular. The graph contains GPT-5-based reverse prompt generation nodes and a prompt pool. Each image can have a corresponding LongCat talking-avatar prompt describing identity, appearance, scene relationship, camera framing, speaking behavior, lip-sync, subtle head movement, natural facial expression, hand gestures, and continuity locks. This makes the workflow more practical than manually writing every avatar prompt from scratch.

The generation structure uses a first-stage render plus loop extension. The first stage generates the opening segment. The loop stage takes the tail frames, converts them back into latent space, and continues the avatar video while preserving continuity. The workflow uses 93-frame segments, 13-frame overlap, automatic audio-duration calculation, and automatic loop-count logic to cover the full audio length. This helps the output stay aligned with the driving audio instead of requiring manual duration calculation.

Compared with ordinary single-image digital human workflows, this graph is more suitable for creator production. It can handle multiple avatar references, longer audio, loop-based continuation, shot switching, and prompt-index control in one system. It is useful for AI presenters, virtual anchors, anime hosts, character narration, news-style avatars, educational videos, product explanations, Bilibili demonstrations, YouTube content, RunningHub showcases, and Civitai workflow publishing.

Main features:

LongCat Avatar multi-image digital human workflow
One audio track drives mouth movement and speaking rhythm
Whisper large v3 speech feature extraction
LongCat-Avatar-15 main model route
LongCat Avatar DMD LoRA support
Up to ten reference image input groups
Unified 1280×720 image resizing and latent encoding
Image index switching for multi-shot avatar control
GPT-5 reverse prompt pool for avatar descriptions
Prompt index switching for different visual states
93-frame segment generation with 13-frame overlap
Automatic loop count based on audio duration
First-stage generation plus loop continuation
Final video output through VHS VideoCombine

Suggested workflow:

Prepare a clean driving audio file first. The speech should be clear, stable, and not buried under loud background music. Then prepare several reference images for the avatar. Keep each image visually readable, with a clear face, stable lighting, and a mouth area that is not blocked. Load the images into the reference image pool, then check the image index and prompt index settings. Start with one image and one prompt first to confirm that the mouth movement, identity, and camera framing are stable. After the first test works, add more image references and switch between them through the image pool. Use automatic loop mode when you want the workflow to cover the full audio length. If continuity breaks, reduce image variation, lower aggressive motion language, and keep the overlap settings stable.

⚙️ RunningHub Workflow

Try the workflow online right now — no installation required.
👉 Workflow: the results meet your expectations, you can later deploy it locally for customization.

🎁 Fan Benefits: Register to get 1000 points + daily login 100 points — enjoy 4090 performance and 48 GB super power!

📺 Bilibili Updates (Mainland China & Asia-Pacific)

If you’re in the Asia-Pacific region, you can watch the video below to see the workflow demonstration and creative breakdown.
📺 Bilibili Video: Support Me on Ko-fi

If you find my content helpful and want to support future creations, you can buy me a coffee ☕.
Every bit of support helps me keep creating — just like a spark that can ignite a blazing flame.
👉 Ko-fi: Business Contact

For collaboration or inquiries, please contact aiksk95 on WeChat.

⚙️打开下方链接即可在线体验，无需安装。
👉 工作流： />如果觉得效果理想，你也可以在本地进行自定义部署。

🎁 粉丝福利：注册即送 1000 积分，每日登录 100 积分，畅玩 4090 体验 48 G 超级性能！

📺 Bilibili 更新（中国大陆及南亚太地区）

如果你在中国大陆或南亚太地区，可以通过下方视频查看该工作流的实测效果与构思讲解。
📺 B站视频：夸克网盘持续更新模型资源：
👉 />这些资源主要面向本地用户，方便进行创作与学习。

LongCat Avatar Multi-Image Shot-Switching Digital Human Workflow

About this model

Tags

Related Models

ON-THE-FLY 实时生成！Wan-AI 万相/ Wan2.1 Video Model (multi-specs) - CausVid&Comfy&Kijai - workflow included

【WAN2.1】IMG to VIDEO

ComfyUI Image Workflows

WAN 2.2 Workflow T2V-I2V-T2I (Kijai Wrapper)

Instagirl WAN 2.2

Hunyuan 🌻 AllInOne

Moody Simple Zimage Turbo/Distilled Workflow

Moody ZIB (Zimage Base) + ZIT (Zimage Turbo) Simple Workflow