CoDeF

CoDeF: Content Deformation Fields for Temporally Consistent Video Processing

CoDeF (Content Deformation Fields) is a new type of video representation that can be used for a variety of temporal video processing tasks, such as video style transfer, keypoint tracking, and video super-resolution.

CoDeF consists of two parts: a canonical content field and a temporal deformation field. The canonical content field is a representation of the static contents in the entire video. It is obtained by aggregating the features of all frames in the video. The temporal deformation field is a representation of the temporal deformations between frames. It is obtained by computing the optical flow between consecutive frames.

Given a target video, the two fields are jointly optimized to reconstruct it. The canonical content field is optimized to inherit the semantics of the video, while the temporal deformation field is optimized to ensure that the reconstructed video is temporally consistent.

CoDeF has several advantages over existing video representations. First, it is able to represent the static contents of a video in a compact and efficient way. This makes it well-suited for tasks such as video style transfer and video super-resolution, where only the static contents of the video need to be preserved.

Second, CoDeF is able to represent the temporal deformations between frames in a precise and accurate way. This makes it well-suited for tasks such as keypoint tracking and video stabilization, where the temporal consistency of the video is important.

Third, CoDeF is able to be lifted from image processing algorithms to video processing algorithms. This means that an image processing algorithm can be applied to the canonical content field and the results can be propagated to the entire video with the aid of the temporal deformation field. This makes it easy to develop new video processing algorithms.

CoDeF has been shown to be effective for a variety of temporal video processing tasks. In particular, it has been shown to be able to achieve state-of-the-art results for video style transfer, keypoint tracking, and video super-resolution.

Here are some examples of what CoDeF can do:

Style transfer a video to have the style of another video.
Track the keypoints of an object in a video over time.
Super-resolve a low-resolution video to a high-resolution video.

CoDeF is a promising new video representation that has the potential to revolutionize the way we process videos. It is still under development, but it has already shown great promise.

Official PyTorch implementation of CoDeF: Content Deformation Fields for Temporally Consistent Video Processing

qiuyu96.github.io/CoDeF/

Paper:

CoDeF: Content Deformation Fields for Temporally Consistent Video Processing

Hao Ouyang, Qiuyu Wang, Yuxi Xiao, Qingyan Bai, Juntao Zhang, Kecheng Zheng, Xiaowei Zhou, Qifeng Chen, Yujun Shen

We present the content deformation field CoDeF as a new type of video representation, which consists of a canonical content field aggregating the static contents in the entire video and a temporal deformation field recording the transformations from the canonical image (i.e., rendered from the canonical content field) to each individual frame along the time axis.Given a target video, these two fields are jointly optimized to reconstruct it through a carefully tailored rendering pipeline.We advisedly introduce some regularizations into the optimization process, urging the canonical content field to inherit semantics (e.g., the object shape) from the video.With such a design, CoDeF naturally supports lifting image algorithms for video processing, in the sense that one can apply an image algorithm to the canonical image and effortlessly propagate the outcomes to the entire video with the aid of the temporal deformation field.We experimentally show that CoDeF is able to lift image-to-image translation to video-to-video translation and lift keypoint detection to keypoint tracking without any training.More importantly, thanks to our lifting strategy that deploys the algorithms on only one image, we achieve superior cross-frame consistency in processed videos compared to existing video-to-video translation approaches, and even manage to track non-rigid objects like water and smog.Project page can be found at this https URL.

Comments:

Project Webpage: this https URL, Code: this https URL

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:

arXiv:2308.07926 [cs.CV]

(or arXiv:2308.07926v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2308.07926Focus to learn more

Submission history

From: Hao Ouyang [view email]

PreviousDual-Stream Diffusion Net (DSDN) for Text-to-Video Generation Next3D Gaussian Splatting for Real-Time Radiance Field Rendering

Last updated 1 year ago