LTX 2.3 Image and Text Video 10S Similarity Preservation Workflow

Watch the full video first if you want to understand how this LTX 2.3 image-and-text video workflow works in practice. The video shows how one reference image can be combined with text control, how the 10-second similarity system keeps the subject stable, and how to run the full workflow online without rebuilding a complex local ComfyUI environment.

This ComfyUI workflow is designed for LTX 2.3 image-reference video generation with text-controlled motion and 10-second likeness preservation. Its main purpose is to let creators start from one image, describe the desired action or camera movement with text, and generate a controlled video while keeping the original subject, composition, and visual identity more stable across the clip.

The workflow is built around the LTX 2.3 distilled 1.1 generation route. It uses the LTX 2.3 video checkpoint, Gemma3 fp8 text encoder, LTX Audio VAE, LTXVConditioning, LTXVImgToVideoConditionOnly, LTXVPreprocess, Image_Resize_longsize, LTX2_NAG, ManualSigmas, CFGGuider, SamplerCustomAdvanced, LTXVLatentUpsampler, LTXVConcatAVLatent, LTXVSeparateAVLatent, tiled decoding, and final video output. This makes the workflow more structured than a basic one-pass image-to-video graph.

The image side provides the visual anchor. The reference image is resized, prepared through LTXVPreprocess, and injected into the generation process through LTXVImgToVideoConditionOnly. This helps the model preserve the character, object, scene, lighting, clothing, and composition from the original image. The text prompt then controls the motion direction, expression, camera movement, atmosphere, and cinematic behavior.

The key update is the 10-second similarity preservation system. The workflow uses similarity and anchor-style guidance during the later stages, especially around the latent upscaling and HD refinement process. This helps reduce common image-to-video issues such as face drift, hairstyle changes, clothing inconsistency, subject deformation, background collapse, and unwanted identity changes. For creators making character videos, this is one of the most important improvements.

The generation process is divided into three stages. The first stage builds the initial composition and motion base. The second stage performs latent-space upscaling while keeping stronger similarity control and weak anchor stability. The third stage applies final high-definition refinement with lighter similarity control, improving sharpness and detail while trying not to damage the established character identity.

The workflow also includes LTX2_NAG and a universal negative prompt system. This helps suppress flicker, frame jitter, subtitles, watermarks, UI overlays, bad hands, broken mouth shapes, unstable motion, unwanted text, distorted audio artifacts, and sudden scene changes. Compared with ordinary image-to-video workflows, this version is better suited for publishable creator content because it combines reference image control, text-guided direction, similarity locking, staged sampling, and high-resolution refinement.

This workflow is suitable for character animation, portrait-to-video, product motion shots, cinematic still animation, AI short clips, MV fragments, social media video, Bilibili demonstrations, YouTube showcases, RunningHub releases, and Civitai workflow publishing.

Main features:

LTX 2.3 image-and-text video workflow
One reference image + text motion control
10-second similarity preservation
LTX 2.3 distilled 1.1 checkpoint route
Gemma3 fp8 text encoder
LTX Audio VAE support
Image_Resize_longsize image preparation
LTXVPreprocess reference preprocessing
LTXVImgToVideoConditionOnly image guidance
LTX2_NAG universal negative guidance
Three-stage rendering structure
LTXVLatentUpsampler high-resolution transition
AV latent concatenation and separation
Final HD video output

Suggested workflow:

Prepare one clean reference image first. The subject should be clear, well-framed, and not blocked by complex foreground objects. Load the image into the workflow, then write a text prompt describing the motion, camera behavior, lighting, expression, atmosphere, and video style. Run the first stage first to check whether the image identity and motion direction are correct. If the character changes too much, keep the 10S similarity settings active and simplify the prompt. If the video is too static, make the motion instruction more explicit. After the base motion is stable, continue through latent upscaling and final HD refinement.

⚙️ RunningHub Workflow

Try the workflow online right now — no installation required.
👉 Workflow: the results meet your expectations, you can later deploy it locally for customization.

🎁 Fan Benefits: Register to get 1000 points + daily login 100 points — enjoy 4090 performance and 48 GB super power!

📺 Bilibili Updates (Mainland China & Asia-Pacific)

If you’re in the Asia-Pacific region, you can watch the video below to see the workflow demonstration and creative breakdown.
📺 Bilibili Video: Support Me on Ko-fi

If you find my content helpful and want to support future creations, you can buy me a coffee ☕.
Every bit of support helps me keep creating — just like a spark that can ignite a blazing flame.
👉 Ko-fi: Business Contact

For collaboration or inquiries, please contact aiksk95 on WeChat.

⚙️打开下方链接即可在线体验，无需安装。
👉 工作流： />如果觉得效果理想，你也可以在本地进行自定义部署。

🎁 粉丝福利：注册即送 1000 积分，每日登录 100 积分，畅玩 4090 体验 48 G 超级性能！

📺 Bilibili 更新（中国大陆及南亚太地区）

如果你在中国大陆或南亚太地区，可以通过下方视频查看该工作流的实测效果与构思讲解。
📺 B站视频：夸克网盘持续更新模型资源：
👉 />这些资源主要面向本地用户，方便进行创作与学习。

LTX 2.3 Image and Text Video 10S Similarity Preservation Workflow

About this model

Tags

Related Models

LTX 2.3 高速版 GTAnimation | 25 frames in 5S! 12G VRAM

ON-THE-FLY 实时生成！Wan-AI 万相/ Wan2.1 Video Model (multi-specs) - CausVid&Comfy&Kijai - workflow included

【WAN2.1】IMG to VIDEO

ComfyUI Image Workflows

WAN 2.2 Workflow T2V-I2V-T2I (Kijai Wrapper)

Hunyuan 🌻 AllInOne

Moody Simple Zimage Turbo/Distilled Workflow

Moody ZIB (Zimage Base) + ZIT (Zimage Turbo) Simple Workflow