訊息公告

【數據科學系列演講】時間:2024.05.01(Wed.) 15:30-17:20@EC115,講者:陳駿丞副研究員/中央研究院 資訊科技創新研究中心/講題:MeDM: Mediating Image Diffusion Models for Video-to-Video Translation with Temporal Correspondence

檢送113年05月01日(星期三) 數據科學系列演講資訊如下,歡迎蒞臨聽講!

 時間:2024.005.01(Wed.) 15:30-17:20

地點:工程三館EC115教室

講者姓名 職稱/任職單位:陳駿丞副研究員/中央研究院 資訊科技創新研究中心

 

-----------------------------------------------------------------------------------------------------

TitleMeDM: Mediating Image Diffusion Models for Video-to-Video Translation with Temporal Correspondence

Abstract

With the great success of the text-to-image technology, more researches proceed to develop advanced text-to-video technologies, such as OpenAI Sora. Nevertheless, those text-to-video models still have difficulties in maintaining temporal consistency and usually require huge amount of data and video memory for training and evaluation.  To tackle these issues, we introduce an efficient and effective method, MeDM, that utilizes pre-trained image diffusion models for video-to-video translation with consistent temporal flow. The proposed framework can render videos from scene position information, such as a normal G-buffer, or perform text-guided editing on videos captured in real-world scenarios. We employ explicit optical flows to construct a practical coding that enforces physical constraints on generated frames and mediates independent frame-wise scores. By leveraging this coding, maintaining temporal consistency in the generated videos can be framed as an optimization problem with a closed-form solution. To ensure compatibility with Stable Diffusion, we also suggest a workaround for modifying observation-space scores in latent Diffusion Models. Notably, MeDM does not require fine-tuning or test-time optimization of the diffusion models. Through extensive qualitative, quantitative, and subjective experiments on various benchmarks, the results demonstrate the effectiveness and superiority of the proposed approach. If time permits, I will also discuss our recent work using the pre-trained diffusion model to perform data augmentation for domain adaptive object detection.

Speaker Bio

Jun-Cheng Chen is now an associate research fellow at the Research Center for Information Technology Innovation (CITI), Academia Sinica. He joined CITI as an assistant research fellow in 2019. He received the B.S. and M.S. degrees advised by Prof. Ja-Ling Wu in Computer Science and Information Engineering from National Taiwan University, Taiwan (R.O.C), in 2004 and 2006, respectively, where he received the Ph.D. degree advised by Prof. Rama Chellappa in Computer Science from University of Maryland, College Park, USA, in 2016. From 2017 to 2019, he was a postdoctoral research fellow at University of Maryland Institute for Advanced Computer Studies. His research interests include computer vision, machine learning, deep learning and their applications to biometrics, such as face recognition/facial analytics, activity recognition/detection in the visual surveillance domain, etc. He was a recipient of the ACM Multimedia Best Technical Full Paper Award in 2006.