[ICML'26] CameraNoise: Enabling Faithful Camera Control in Video Diffusion through Geometry-Flow-Guided Noise Warping

Haoyu Zhao, Jiaxi Gu, Haoran Chen, Qingping Zheng, Yeyin Jin, Hongyi Yang, Junqi Cheng, Yuang Zhang,
Zenghui Lu, Huan Yu, Jie Jiang, Peng Shu, Zuxuan Wu, Yu-Gang Jiang Fudan University; Tencent.

* This project page contains a large number of videos, please wait patiently for them to load.

Project Page Code Models (Huggingface)

Abstract

Precise camera pose control is critical for video diffusion, yet maintaining geometric consistency remains a challenge. Existing methods that directly inject numerical camera parameters into the diffusion backbone often fail to bridge the gap between abstract coordinates and visual content, leading to structural distortions. To address this issue, we propose CameraNoise, a flow-to-noise warping method that encodes camera motion into a temporally coherent stochastic representation. Unlike conventional conditioning, CameraNoise embeds camera poses directly into the noise space. This decouples motion from scene appearance while faithfully preserving trajectory dynamics. Specifically, we introduce a novel Geometry-guided Reprojection Flow and a noise warping algorithm, which jointly preserve the Gaussian prior of diffusion and ensure consistent noise propagation under camera transformations. By integrating CameraNoise into the diffusion process, our framework delivers stable, high-fidelity videos. Extensive experiments demonstrate that our approach significantly outperforms prior methods in both visual quality and trajectory faithfulness.

1. Warped CameraNoise from Source Video

The proposed CameraNoise is an appearance-agnostic noise representation with camera poses from input video.

You can pause the video at any frame to examine the Gaussian distribution of a single frame. At that moment, you can notice that the motion information in the video has disappeared.

2. Camera-controllable Image-to-Video (I2V) Generation

2.1 Our method transforms camera poses into CameraNoise, a noise-space representation of camera motion. This appearance-independent representation is introduced into the denoising process of the video diffusion model, enabling the generated videos to accurately capture camera motion trajectories.

2.2 More camera controllable I2V results: each video is generated with reference image, camera poses, and prompt.

3. Camera Controllable Text-to-Video (T2V) Generation

3.1 Each Videos is generated with a textual prompt and camera poses.

3.2 More camera controllable T2V results.

4. Dynamic and ourdoor scenes with different camera poses.

4.1 We demonstrate dynamic results across multiple scenes under three different types of camera motion. Given a reference image, camera poses, and textual descriptions, our model is able to generate the corresponding dynamic scenes.

Camera1: Move-Up Shot.

Camera2: Counterclockwise Rotation Shot.

Camera3: Move-Down Shot.

4.2 In the three examples below, we demonstrate the model's ability to generate dynamic videos under complex camera motion. All three videos use the same CameraNoise camera trajectory. Compared with the original 2-second setting, we directly generate longer 4-second videos to further verify the model's stability and dynamic performance in long-horizon generation. We choose amusement park scenes as representative examples to showcase the model's capability in dynamic scene generation.

4.3 In the other three videos below, we further demonstrate the model's ability to generate dynamic videos in static scenes. These results highlight two key properties. First, in our model, camera motion and object motion are not inherently coupled: even under a static camera, the boat in the scene can still move naturally. Second, the motion state and speed of the object can be effectively controlled by text prompts, where different prompts lead to different speed patterns. This also shows that the model is not limited to slow motion generation, but can produce a broader range of dynamic behaviors.

Inference Prompts:

1) Low speed: a boat splashing down a steep water slope, huge arcs of water frozen in the air, wet rails, bright reflections.

2) Medium speed: a boat racing rapidly down a steep water slope, explosive arcs of water bursting into the air, wet rails, bright reflections.

3) High speed: a boat speeding down a steep water slope at high speed, massive splashes frozen in midair, wet rails, bright reflections.

5. Vehicle driving scene.

Vehicle driving scenarios are among the most challenging dynamic scenes. We test our model of image-to-video generation using data from the DrivingDoJo dataset, and the results show that it can effectively handle common driving situations, including daytime, nighttime, straight driving, and turning.

6. Single scene under different camera poses.

Scene1: A vibrant forest scene is filled with various birds flying, surrounded by trees, green mossy ground, and sunlight.

Scene2: A golden retriever stands in a sunlit grassy field, with trees and open green space in the background.

Scene3: A cowboy rides a horse along a winding dirt road through a golden and sunlit field with fences.

We showcase the dynamic motion of the same scene under different camera poses, demonstrating that our model exhibits strong robustness across various scenes. The camera poses used are sourced from the MultiCamVideo dataset.

7. Camera pose transfer: source video to generated videos.

Given an input video, we can obtain its camera parameters using the VGGT model, and then convert these parameters into CameraNoise with our proposed algorithm to provide the model with camera control capability. We select two scenes from the MultiCamVideo dataset and transfer the camera motion under three different reference image conditions. The results demonstrate that our model can achieve lossless transfer of camera motion from the input video.

8. Framework

Grapefruit slice atop a pile of other slices

Overview of our framework. We introduce CameraNoise, a controlled noise signal that encodes temporal correlations of camera poses into video diffusion. Our method is constructed via Geometry-guided Reprojection Flow (GRFlow) and a Gaussian-preserving warping algorithm, and injected into the video diffusion to enable precise viewpoint control. We use bold green arrows to illustrate the flow of control signals from camera poses to the synthesized video.

9. Geometry-guided Reprojection Flow (GRFlow)

A Reprojection of Camera Poses in 2D Grid.

In this example, we show the GRFlows generated for the leftmost video under different alpha values in Eq. (5).

10. Our Appearance-agnostic CameraNoise via GRFlow

v.s.

Appearance-motion entangled optical-flow-based noise

* To enhance the visualization of motion and appearances in the noise, we speed up the video x2.

Since optical flow inherently contains object contours and appearance information, the warped noise derived from it inevitably carries appearance priors. During inference in diffusion models, such information can conflict semantically with the noise prior and control conditions, ultimately leading to generation failure.

11. CameraNoise warping with different values of pseudo-depth d.

11.1 We present the warping results of CameraNoise under different values of d. From these results, we can observe that smaller d values amplify the camera motion, while as d increases, the effectiveness of push-in and pull-out camera movements degrades noticeably, and rotation-based camera motions also introduce more visible artifacts. In practical scenarios, we find that d=0.5 performs best.

11.2 We further evaluate the video generation results under large d values and select the push-in/pull-out camera motion, which is the most sensitive to d, as the test case, using the template from the "cat" video in the fourth row above. As shown, when d becomes large, CameraNoise can no longer be properly represented, and the generated videos therefore fail to accurately express the intended push-in/pull-out motion.

12. Comparison with previous State-of-the-Arts.

Since previous CameraCtrl, MotionCtrl, and Go-with-the-flow methods are trained on different datasets, we visualize their zero-shot results on the MultiCamVideo dataset to ensure a fair comparison. It is important to note that our focus is on controlling the camera motion in the scene, rather than dictating how the people within the scene move. GT means the ground truth.

We observe that all these methods exhibit varying degrees of degradation when applied to new scenes:

1) CameraCtrl shows declines in both camera control accuracy and visual content quality;

2) MotionCtrl almost completely loses its camera control capability in the new scenes;

3) Go-with-the-Flow suffers from a noticeable drop in visual content quality.

13. Comparison with State-of-the-Arts under OOD scenarios.

Camera pose type 1: move-up shot.

Camera pose type 2: move-down shot.

Camera pose type 3: move-left shot.

Camera pose type 4: move-right shot.

Camera pose type 5: move-clockwise shot.

Camera pose type 6: move-in shot.

We evaluate methods MotionCtrl, CameraCtrl, Go-with-the-Flow, GEN3C, and our CameraNoise in six typical out-of-distribution (OOD) scenarios: valleys, fields, lakes, deserts, forests, and amusement parks (see the reference images in the first column). For each scene, we test these methods using six representative camera motions provided by GEN3C.

Based on these results, we summarize the characteristics and limitations of current mainstream methods under OOD conditions:

1) MotionCtrl and CameraCtrl: exhibit large deviations in camera following, indicating limited robustness in camera control;

2) Go-with-the-Flow: prone to excessive camera motion and occasional content collapse;

3) GEN3C: produces static scenes where objects cannot move, resulting in rigid video content. Additionally, due to its reliance on 3D feature modeling, it is susceptible to scene penetration issues (e.g., camera pose 5);

4) Our method: demonstrates superior performance in OOD scenarios in terms of camera control accuracy, content consistency, and motion dynamics.

BibTeX


      @article{zhao2026cameranoise,
        title={CameraNoise: Enabling Faithful Camera Control in Video Diffusion through Geometry-Flow-Guided Noise Warping},
        author = {Zhao, Haoyu and Gu, Jiaxi and Chen, Haoran and Zheng, Qingping and Jin, Yeying and Yang, Hongyi and Cheng, Junqi and Zhang, Yuang and Lu, Zenghui and Yu, Huan and Jiang, Jie and Shu, Peng and Wu, Zuxuan and Jiang, Yu-Gang},
        journal={Forty-third International Conference on Machine Learning (ICML)},
        year={2026}
      }