BachVid: Training-Free Video Generation with Consistent Background and Character

1MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University
2Vertex Lab 3Australian National University
*: Work done during internship at Vertex Lab

BachVid generates a baTch of videos with consistent background and character using a Training-Free method.

Abstract

Diffusion Transformers (DiTs) have recently driven significant progress in text-to-video (T2V) generation. However, generating multiple videos with consistent characters and backgrounds remains a significant challenge. Existing methods typically rely on reference images or extensive training, and often only address character consistency, leaving background consistency to image-to-video models. We introduce BachVid, the first training-free method that achieves consistent video generation without needing any reference images. Our approach is based on a systematic analysis of DiT's attention mechanism and intermediate features, revealing its ability to extract foreground masks and identify matching points during the denoising process. Our method consolidates this finding by first generating an identity video and caching the intermediate variables, and then inject these cached variables into corresponding positions in generated videos, ensuring both foreground and background consistency across multiple videos. Experimental results demonstrate that BachVid achieves robust consistency in generated videos without requiring additional training, offering a novel and efficient solution for consistent video generation without relying on reference images or additional training.

Pipeline

Interpolate start reference image.

An identity video is first generated to cache key intermediate variables. For every frame video, these cached key-values are injected into matched points to ensure both foreground and background consistency, using only vital layers.

Method

Foreground Mask Extraction

Each prompt contains both foreground and background descriptions. Our method first extracts the foreground mask by comparing the cross-attention weights between the video-to-foreground-text and video-to-background-text pairs from specific layers and timesteps. The figure below presents the IoU between the ground-truth mask and the masks extracted across different layers and timesteps. This procedure enables the model to effectively separate the foreground (character) from the background during the denoising process.

Hover over the interactive points (highlighted in red) to view the corresponding video outputs at specific layer indices and timesteps. Darker points indicate higher IoU values. CogVideoX consists of 42 transformer blocks, and we set the number of timesteps to 50.

Layer: 0, Time Step: 0

Matching Point Identification

After obtaining the foreground masks, our method identifies matching points between two video generation processes by measuring the similarity of attention outputs from selected layers and timesteps. The figure below shows the MSE error between the ground-truth matching points and those identified at different layers and timesteps. This step establishes correspondences across video frames, providing anchor points that ensure consistent character positioning across multiple generated videos.

Hover over the interactive points (highlighted in red) to view the corresponding images at specific layer indices and timesteps. Darker points indicate higher MSE values. CogVideoX contains 42 transformer blocks, and the number of timesteps is set to 50.

Matching point visualization

Layer: 0, Time Step: 0

Vital Layers Determination

With the masks and correspondences obtained, we determine which layers to apply key-value injection by evaluating the aesthetic quality of generated videos when selectively skipping each layer. The line chart below shows the aesthetic (AES) score across layers. This strategy significantly reduces memory consumption while maintaining consistency between the two video generation processes, since storing all key-value pairs across timesteps and layers is infeasible for DiT-based video models due to their large depth and latent dimensionality.

Hover over the interactive points (highlighted in red) to view the corresponding video outputs at specific layer indices. CogVideoX consists of 42 transformer blocks.

Layer: 0

Results

BibTeX


      @misc{yan2025bachvidtrainingfreevideogeneration,
        title={BachVid: Training-Free Video Generation with Consistent Background and Character}, 
        author={Han Yan and Xibin Song and Yifu Wang and Hongdong Li and Pan Ji and Chao Ma},
        year={2025},
        eprint={2510.21696},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2510.21696}, 
      }