0:00
/0:05

Prompt: A solitary man walks alongside his three camels across the expansive, sun-soaked desert, their pace steady under the relentless sun. The camels' hooves leave a series of rhythmic imprints in the golden sands as their elongated shadows stretch far into the distance, weaving patterns that dance toward the distant horizon.

The recently released Wan 2.1 is a groundbreaking open-source AI video model. Renowned for its ability to exceed the performance of other open-source models like Hunyuan and LTX, as well as numerous commercial alternatives, Wan 2.1 delivers truly incredible text2video and image2video generations with little effort. While it sacrifices generation speed compared to competitors, the quality of its video production is exceptional.

So, ready to see your ideas come to life with Wan?

Verified to work on ThinkDiffusion Build: Mar 11, 2025

ComfyUI v0.3.26 with WAN FP16 model support.
Why do we specify the build date? ComfyUI and custom node versions that are part of this workflow that are updated after this date may change the behavior or outputs of the workflow.


What Makes Wan Unique

0:00
/0:01

Prompt: In an old, cherished photograph, a sister beams brightly as she carries her younger brother, who wears a cozy hoodie, both capturing the warmth and joy of their sibling bond. Their smiles illuminate the image, exuding a timeless happiness that echoes fond memories of their shared moments.

💡
With a new powerful Video VAE, Wan 2.1 delivers better quality than seen before in open source models. It generates videos with complex body movements, smooth transitions and accurate real-world physics.

Wan Image to Video Architecture

Source: Wan

The Image-to-Video (I2V) task changes a picture into a video using a control prompt. It begins with an image, and a mask decides which frames to keep or generate, while a 3D Variational Autoencoder (VAE) compresses the images. This data goes into the DiT model, which has an extra layer because it uses more channels. A CLIP image encoder gets features from the image to provide context, added using cross-attention. After it has been generated the VAE again decodes the images back from the latent space into images we can see.


Diffusion Process

Source: Wan
💡
Wan2.1 uses Diffusion Transformers and Flow Matching. It takes multi-language text, uses the T5 Encoder to understand it, and uses cross-attention in each block. A shared MLP with a Linear and SiLU layer predicts six settings for time, with each block learning its own biases. Tests show this setup works better without making the model bigger. This design is used for both small and large models.

0:00
/0:01

Prompt: Under the tranquil canopy of a blue starlit night, an anime scene unfolds where the silhouettes of a sweet couple stand closely together, framed by the gentle rustle of windy grass. The sparkling stars above cast a magical glow, enveloping the pair in a serene moment of connection and wonder.


How to run Wan 2.1 in ComfyUI for ImagetoVideo Generation

Installation guide

💡
Download the workflow and drag & drop it into your ComfyUI window, whether locally or on ThinkDiffusion. If you're using ThinkDiffusion, it's necessary to use at minimum the Turbo 24gb machine, but we do recommend the Ultra 48gb machine.

Custom Nodes

If there are red nodes in the workflow, it means that the workflow lacks the certain required nodes. Install the custom nodes in order for the workflow to work.

  1. Go to ComfyUI Manager  > Click Install Missing Custom Nodes
  1. Check the list below if there's a list of custom nodes that needs to be installed and click the install.

Models

For this guide you'll need 4 recommended models to be downloaded.

1. wan2.1_i2v_720p_14B_fp16.safetensors
2. umt5_xxl_fp16.safetensors
3. wan_2.1_vae.safetensors
4. clip_vision_h.safetensors

Models can be downloaded from Comfy.Org or the model manager.
  1. Go to ComfyUI Manager  > Click Model Manager
  1. Search for the models above and when you find the exact model that you're looking for, click install and make sure to press refresh when you are finished.

Model Path Source

Some of my models may not be available in the model manager. Use the model path source instead if you prefer to install the models using model's link address and paste into ThinkDiffusion MyFiles using upload URL.

Model Name Model Link Address ThinkDiffusion Upload Directory
wan2.1_i2v_720p_14B_fp16.safetensors
📋 Copy Path
.../comfyui/models/diffusion_models/
umt5_xxl_fp16.safetensors
📋 Copy Path
.../comfyui/models/text_encoders/
wan_2.1_vae.safetensors
📋 Copy Path
.../comfyui/models/vae/
clip_vision_h.safetensors
📋 Copy Path
.../comfyui/models/clip_vision/

Step-by-step Workflow Guide

This workflow was pretty easy to set up and runs well from the default settings. Here are a few steps where you might want to take extra note.

Steps Recommended Nodes
1. Load Image

Load an image
ThinkDiffusion StableDiffusion ComfyUI Wan img2vid load an image
2. Set Size

When resizing the image, make that sure the resolution does not exceed 1280 x 720 as that will increase generation time drastically. Reduce to speed up generation time.

Enable keep proportion to keep the aspect ratio of your input image
ThinkDiffusion StableDiffusion ComfyUI Wan img2vid set the size
3. Set Models

Set the models as seen on the image. It should be the same files from the official source repo https://huggingface.co/Comfy-Org/Wan2.1ComfyUIrepackaged
ThinkDiffusion StableDiffusion ComfyUI Wan img2vid set the models
4. Write a Prompt

You can write any prompt you want. Wan2.1 has no limitation
ThinkDiffusion StableDiffusion ComfyUI Wan img2vid write a prompt
5. Check the Generation Settings

We recommend unipc for sampler and simple for scheduler

The length is 16 frames per second +1. So 5 seconds would be 81 frames

ThinkDiffusion StableDiffusion ComfyUI Wan img2vid check the generation settings
6. Check the Generated Video

ThinkDiffusion StableDiffusion ComfyUI Wan img2vid check the video

Examples

The examples below displays different ranges of tasks such as portrait video, complex motions, animation, physical simulation, cinematic quality and camera control.

💡
I tested and generated the examples below using the following settings:
-20 steps
-6 cfg
-uni_pc sampler
-simple scheduler

Portrait

0:00
/0:05

Prompt: A professional man sits confidently on a vibrant yellow bench, his posture relaxed yet assertive as he faces the camera. His expression is animated and focused, with a lively conversation unfolding. Gesturing occasionally, his eyes convey enthusiasm and his smile hints at the depth of engagement in the dialogue, all set against the backdrop of a bustling urban park.

Complex Motions

0:00
/0:03

Prompt: Five women, each elegantly dressed in white, command the stage as they dance in perfect harmony, their movements uniform and synchronous, creating a mesmerizing spectacle for the audience.

Animation

0:00
/0:03

Prompt: A pretty anime girl sits atop a city building, silhouetted against the sprawling urban landscape. Her headphones rest comfortably over her ears as she sings softly, her melodic voice mingling with the distant hum of city life, while a gentle breeze ruffles her hair under the twinkling city lights.

Physical Simulation

0:00
/0:04

Prompt: A cat stands in the bustling kitchen, adorably mimicking a real-life chef. With a tiny apron tied snugly around its waist and a little chef's hat perched atop its head, it playfully swats at the flour with its paw, causing a small cloud to rise. Its tail flicks with concentration as it carefully paws at the rolling pin, determined to bake a culinary masterpiece amidst the delicious aromas wafting through the air.

Cinematic Quality

0:00
/0:03

Prompt: Ghost Rider races through the night's shadows on his blazing motorcycle, roaring down the city streets with supernatural speed. The tires, engulfed in flames, leave a trail of fiery light as they sear the pavement, creating a mesmerizing spectacle against the backdrop of twinkling city lights and towering skyscrapers.

Camera Control

0:00
/0:04

Prompt: A drone glides effortlessly through the sky, capturing an aerial view that highlights the vivid structures of houses below. As it flies forward, the rooftops form an intricate tapestry, showcasing a colorful array of architecture nestled within the landscape, painting a vibrant portrait of the community from above.

💡
This image-to-video workflow is a great starting point to get running quickly with Wan. More examples of WAN2.1, will soon be available, featuring capabilities such as flow editing, inpainting, outpainting, image referencing, video-to-audio conversion, visual text, and more.

If you’re having issues with installation or slow hardware, you can try any of these workflows on a more powerful GPU in your browser with ThinkDiffusion.

If you enjoy ComfyUI and you want to test out creating awesome images, then feel free to check out this OmniGen tutorial here. And have fun out there with your videos!

Here are some more image to video tutorials you might like:

AI Video Speed: How LTX is Reshaping Video2Video as We Know It
LTX Video to Video can deliver really powerful results in amazing speed and that’s just the power of LTX - speed. With this video2video workflow we’ll be transforming input videos in AI counterparts with amazing efficiency.
Enhanced Video Generation with Hunyuan and LoRA
We now can use LoRAs together with the AI video model Hunyuan. Why? To keep character or object consistency in a video.
Meet LTX: Video Generation Speed Unleashed in ComfyUI
LTX speeds up the video-making process so you can focus on what really matters — telling your story and connecting with your audience. In this guide, we’ll focus on Image2Video generation specifically, and we’ll explore all the features that make LTX a game-changer.