CT-1 : Vision-Language-Camera Models Transfer Spatial Reasoning Knowledge to Camera-controllable Video Generation

Haoyu Zhao, Zihao Zhang, Jiaxi Gu, Haoran Chen, Qingping Zheng, Pin Tang, Yeyin Jin,
Yuang Zhang, Junqi Cheng, Zenghui Lu, Peng Shu, Zuxuan Wu, Yu-Gang Jiang
Fudan University; Tencent.

📄 arXiv 💻 Code 🗂 Dataset (Coming Soon)

* This project page contains a large number of videos, please wait patiently for them to load.

* Note: due to the high level of realism and strong camera dynamics in the generated videos, extended viewing may cause visual discomfort for some viewers.

Abstract

Camera-controllable video generation aims to synthesize videos with flexible and physically plausible camera movements. However, existing methods either provide imprecise camera control from text prompts or rely on labor-intensive manual camera trajectory parameters, limiting their use in automated scenarios. To address these issues, we propose a novel Vision-Language-Camera model, termed CT-1 (Camera Transformer 1), a specialized model designed to transfer spatial reasoning knowledge to video generation by accurately estimating camera trajectories. Built upon vision-language modules and a Diffusion Transformer model, CT-1 employs a Wavelet-based Regularization Loss in the frequency domain to effectively learn complex camera trajectory distributions. These trajectories are integrated into a video diffusion model to enable spatially aware camera control that aligns with user intentions. To facilitate the training of CT-1, we design a dedicated data curation pipeline and construct CT-200K, a large-scale dataset containing over 47M frames. Experimental results demonstrate that our framework successfully bridges the gap between spatial reasoning and video synthesis, yielding faithful and high-quality camera-controllable videos and improving camera control accuracy by 25.7% over prior methods.

Framework

Grapefruit slice atop a pile of other slices

Overview of the proposed camera-controllable video generation framework based on the CT-1 model. The framework consists of three main components: (a) a vision–language module for semantic embedding, (b) a Diffusion Transformer module for modeling camera trajectory distributions, and (c) controllable video generation models that synthesize videos conditioned on the predicted trajectories.

"VLC + Diffusion" Paradigm: Camera-Decision-First, Generation-Next

In the "camera-decision" stage, our framework semantically determines how the camera moves based on the spatial knowledge within the user's intention, and generate camera trajectories over time. We refer to this category of models as Vision-Language-Camera (VLC) model. A key strength of the VLC model is its capacity to jointly reason about visual content and semantic cues for predicting future spatiotemporal camera poses, thereby promoting semantic alignment between camera trajectories and visual content in the video "generation" stage.

Qualitative example (a).

Qualitative example (b).

Camera coordinate system and motion semantics.
· The red axis (x-axis) denotes the horizontal direction, where the arrow direction corresponds to camera movement to the right, and the opposite direction indicates movement to the left.
· The blue axis (z-axis) represents the depth direction, with the arrow direction indicating zoom-in or forward motion, and the opposite direction indicating zoom-out or backward motion.
· The green axis (y-axis) denotes the vertical direction, where the arrow direction corresponds to downward camera motion, and the opposite direction indicates upward motion.
* All trajectory visualizations on this webpage follow these coordinate definition.

1. CT-1: Camera Trajectories Estimation via Vision-Language Inputs

We introduces CT-1, our novel method for estimating camera trajectories directly from vision-language inputs. We delve into how CT-1 leverages the rich contextual information from both visual cues and textual descriptions to predict precise and natural camera movements. The examples below demonstrate CT-1's capability to generate diverse and dynamic camera paths. Each example is accompanied by the specific prompt that guided its trajectory estimation.

Camera Prompt 1:

The unsteady camera quickly pans from left to right.

Camera Prompt 2:

The camera rotates around the helicopter, maintaining a steady motion.

Camera Prompt 3:

The camera moves forward, following the subject from behind with an unsteady motion, marked by noticeable shaking.

Camera Prompt 4:

The handheld camera slowly moves backward toward the man while panning right to thetrack the walking woman.

Camera Prompt 5:

The camera smoothly dollies forward while simultaneously panning to the right, then continues moving straight.

Camera Prompt 6:

The camera smoothly moves backward while titling down and panning left, maintaining a steady and fluid motion throughout.

1.1 CT-1: Camera Trajectories Estimation under Complex Camera Descriptions

CT-1 also demonstrates strong camera control performance under long and complex textual prompts.

Longer Camera Prompt 1:

The camera begins with a slight arc to the left, capturing a drone POV of a serene lake and dock, before transitioning into a smooth trucking from right to left, maintaining steadiness through until the end.

Longer Camera Prompt 2:

The handled camera, initially stationary with a smooth steadiness, focuses on three people. It then quickly moves backward to concentrate on two of them as they engage in conversation, maintaining minimal shaking throughout.

Longer Camera Prompt 3:

The camera moves forward to capture the street view, slightly unsteady with some shaking. Midway, it pans right to adjust its orientation while continuing to move forward, revealing more of the scene.

Longer Camera Prompt 4:

The camera, mounted on the front of a motorcycle and facing the driver, leads the subject from the front as th emotoroycle advances. As the motorcycle moves forward, the camera moves backward relative to the scene.

Longer Camera Prompt 5:

The camera smoothly executes a forward and right arc around the church building, maintaining a steady motion without any shaking, and concludes with a seamless transition into a trucking motion to the right.

Longer Camera Prompt 6:

The camera executes an unsteady side-tracking shot, following a person skateboarding to the left. As the skateboarder jumps mid-video, the camera moves up, maintaining its leftward trajectory.

1.2 Multi-instance cross-validation of CT-1 (Image-Prompts)

To systematically evaluate the camera trajectory modeling capability of CT-1 under different image-text pairing conditions, we design two complementary cross-validation experiments that examine “multiple camera descriptions with a single image” and “multiple images with a single camera description”, respectively.

Multiple camera descriptions with a single image:

Reference Image A

Output Trajectories A-1

Camera Prompt (a): The camera zooming in creates a sense of intimacy and draws attention to the man's facial expressions.

Output Trajectories A-2

Camera Prompt (b): The camera zooming in creates a sense of intimacy and draws attention to the bed.

Output Trajectories A-3

Camera Prompt (c): The camera zooming in creates a sense of intimacy and draws attention to the wardrobe.

Reference Image B

Output Trajectories B-1

Camera Prompt (d): The shot focuses on the Greyhound's head and collar.

Output Trajectories B-2

Camera Prompt (e): The shot focuses on the left side of the living room.

Output Trajectories B-3

Camera Prompt (f): The shot focuses on the right side of the home dining room.

Reference Image C

Output Trajectories C-1

Camera Prompt (h): The camera begins wide, stage left, revealing the empty stage. it slowly arcs right, around the actress. The shot tightens as it moves.

Output Trajectories C-2

Camera Prompt (i): The shot starts on the actress's feet. Its' a tight close-up. The camera slowly zooms in, revealing her full costume. Simultaneously, it cranes upwards.

Output Trajectories C-3

Camera Prompt (j): The camera begins far back on the empty stage. It slowly dollies forward. The shot tightens as it approaches the actress.

1.3 Multi-instance cross-validation of CT-1 (Prompt-Images)

Multiple images with a single camera description:

Camera Prompt A:

The camera smoothly moves closer to the subject.

Reference Image A-1

Output Trajectories A-1

Reference Image A-2

Output Trajectories A-2

Reference Image A-3

Output Trajectories A-3

Camera Prompt B:

The camera rotates horizontally from right to left.

Reference Image B-1

Output Trajectories A-1

Reference Image B-2

Output Trajectories A-2

Reference Image B-3

Output Trajectories A-3

Camera Prompt C:

The screen moves downward as the camera pans down.

Reference Image C-1

Output Trajectories C-1

Reference Image C-2

Output Trajectories C-2

Reference Image C-3

Output Trajectories C-3

2. Video Generation with CT-1

We systematically investigate the challenges of camera trajectory estimation and video generation across diverse complex scenarios. As illustrated in the following examples, we focus on multiple representative complex scenes and consider two typical camera motion patterns: forward motion and front-left rotational motion. For each scenario, we design scene-specific textual prompts to guide the model toward reasonable camera motion inference and video generation.

Below, we give the qualitative examples in Challenging Scenarios.

2.1 Compared with State-of-the-Art Foundation Models

Furthermore, we compare our method with multiple state-of-the-art foundation models, including CogVideoX, LTX-Video, Wan2.1, and Wan2.2, using the same image and text inputs across diverse scenes. The qualitative visualizations clearly indicate that our approach achieves more accurate and consistent camera control.

Camera Prompt: The unsteady camera quickly pans from left to right, then moves forward to approach the screen, with a slight shake throughout the rapid movement.

Camera Prompt: The camera smoothly dollies forward while simultaneously panning to the right, then continues moving straight in the same direction with minimal shaking.

Camera Prompt: The camera arcs counterclockwise with very smooth, minor movement, maintaining steadiness throughout.

Camera Prompt: The drone shot glides forward with a slight downward movement, maintaining a very smooth and steady trajectory throughout the flight.

Camera Prompt: The camera moves forward to capture the street view, slightly unsteady with some shaking. Midway, it pans right to adjust its orientation while continuing to move forward, revealing more of the scene.

Camera Prompt: The camera smoothly moves backward while tilting down and panning left, maintaining a steady and fluid motion throughout.

Camera Prompt: The drone shot smoothly flies forward and downward while tilting down, quickly closing in on a hot air balloon, with the camera movement remaining very smooth and free of any shaking.

Camera Prompt: The camera smoothly tracks backward while trucking right, maintaining minimal shaking throughout the moving.

2.2 Cross-Model: applying CT-1 to More Controllable Video Generation Models

In our camera-controllable video generation framework, the camera parameters predicted by CT-1 are designed to be compatible with existing video generation models. Accordingly, we feed the camera trajectories estimated by CT-1 into CameraCtrl and MotionCtrl, and evaluate their performance on different datasets, namely RealEstate10K and MultiCamVideo.