* This project page contains a large number of videos, please wait patiently for them to load.
* Note: due to the high level of realism and strong camera dynamics in the generated videos, extended viewing may cause visual discomfort for some viewers.
Camera-controllable video generation aims to synthesize videos with flexible and physically plausible camera movements. However, existing methods either provide imprecise camera control from text prompts or rely on labor-intensive manual camera trajectory parameters, limiting their use in automated scenarios. To address these issues, we propose a novel Vision-Language-Camera model, termed CT-1 (Camera Transformer 1), a specialized model designed to transfer spatial reasoning knowledge to video generation by accurately estimating camera trajectories. Built upon vision-language modules and a Diffusion Transformer model, CT-1 employs a Wavelet-based Regularization Loss in the frequency domain to effectively learn complex camera trajectory distributions. These trajectories are integrated into a video diffusion model to enable spatially aware camera control that aligns with user intentions. To facilitate the training of CT-1, we design a dedicated data curation pipeline and construct CT-200K, a large-scale dataset containing over 47M frames. Experimental results demonstrate that our framework successfully bridges the gap between spatial reasoning and video synthesis, yielding faithful and high-quality camera-controllable videos and improving camera control accuracy by 25.7% over prior methods.
Qualitative example (a).
Qualitative example (b).
The unsteady camera quickly pans from left to right.
The camera rotates around the helicopter, maintaining a steady motion.
The camera moves forward, following the subject from behind with an unsteady motion, marked by noticeable shaking.
The handheld camera slowly moves backward toward the man while panning right to thetrack the walking woman.
The camera smoothly dollies forward while simultaneously panning to the right, then continues moving straight.
The camera smoothly moves backward while titling down and panning left, maintaining a steady and fluid motion throughout.
The camera begins with a slight arc to the left, capturing a drone POV of a serene lake and dock, before transitioning into a smooth trucking from right to left, maintaining steadiness through until the end.
The handled camera, initially stationary with a smooth steadiness, focuses on three people. It then quickly moves backward to concentrate on two of them as they engage in conversation, maintaining minimal shaking throughout.
The camera moves forward to capture the street view, slightly unsteady with some shaking. Midway, it pans right to adjust its orientation while continuing to move forward, revealing more of the scene.
The camera, mounted on the front of a motorcycle and facing the driver, leads the subject from the front as th emotoroycle advances. As the motorcycle moves forward, the camera moves backward relative to the scene.
The camera smoothly executes a forward and right arc around the church building, maintaining a steady motion without any shaking, and concludes with a seamless transition into a trucking motion to the right.
The camera executes an unsteady side-tracking shot, following a person skateboarding to the left. As the skateboarder jumps mid-video, the camera moves up, maintaining its leftward trajectory.
Camera Prompt (a): The camera zooming in creates a sense of intimacy and draws attention to the man's facial expressions.
Camera Prompt (b): The camera zooming in creates a sense of intimacy and draws attention to the bed.
Camera Prompt (c): The camera zooming in creates a sense of intimacy and draws attention to the wardrobe.
Camera Prompt (d): The shot focuses on the Greyhound's head and collar.
Camera Prompt (e): The shot focuses on the left side of the living room.
Camera Prompt (f): The shot focuses on the right side of the home dining room.
Camera Prompt (h): The camera begins wide, stage left, revealing the empty stage. it slowly arcs right, around the actress. The shot tightens as it moves.
Camera Prompt (i): The shot starts on the actress's feet. Its' a tight close-up. The camera slowly zooms in, revealing her full costume. Simultaneously, it cranes upwards.
Camera Prompt (j): The camera begins far back on the empty stage. It slowly dollies forward. The shot tightens as it approaches the actress.
The camera smoothly moves closer to the subject.
Reference Image A-1
Reference Image A-2
Reference Image A-3
The camera rotates horizontally from right to left.
Reference Image B-1
Reference Image B-2
Reference Image B-3
The screen moves downward as the camera pans down.
Reference Image C-1
Reference Image C-2
Reference Image C-3
Below, we give the qualitative examples in Challenging Scenarios.
Camera Prompt: The unsteady camera quickly pans from left to right, then moves forward to approach the screen, with a slight shake throughout the rapid movement.
Camera Prompt: The camera smoothly dollies forward while simultaneously panning to the right, then continues moving straight in the same direction with minimal shaking.
Camera Prompt: The camera arcs counterclockwise with very smooth, minor movement, maintaining steadiness throughout.
Camera Prompt: The drone shot glides forward with a slight downward movement, maintaining a very smooth and steady trajectory throughout the flight.
Camera Prompt: The camera moves forward to capture the street view, slightly unsteady with some shaking. Midway, it pans right to adjust its orientation while continuing to move forward, revealing more of the scene.
Camera Prompt: The camera smoothly moves backward while tilting down and panning left, maintaining a steady and fluid motion throughout.
Camera Prompt: The drone shot smoothly flies forward and downward while tilting down, quickly closing in on a hot air balloon, with the camera movement remaining very smooth and free of any shaking.
Camera Prompt: The camera smoothly tracks backward while trucking right, maintaining minimal shaking throughout the moving.