VideoComposer: Compositional Video Synthesis with Motion Controllability

2024-04-23 05:48:01
开发
37

在这里插入图片描述
decompose videos into three distinct types of conditions: textual conditions, spatial conditions, temperal conditions

条件的内容：
a. textual condition: coarse grained visual content and motions, 使用openclip vit-H/14的text encoder
b. spatial condition: the goal is to achieve fine-grained spatial control
ⅰ. single image: a single image reveal the content and structure of this video, 使用视频的第一帧作为图生视频的spatial条件
ⅱ. single sketch: 使用PiDiNet提取第一帧的sketch
ⅲ. style: 为了将一张图片的风格迁移到视频，以图片的embedding作为条件，使用OpenCLIP ViT-H/14的image encoder
c. temporal conditions:
ⅰ. motion vector: 光流图
ⅱ. depth sequence: 使用预训练的深度估计模型来提取深度
ⅲ. mask sequence：为了editing和inpaint任务
ⅳ. sketch sequence
条件的处理：所有的condition根据是否经过STC-encoder分为两类，一类是text和style(image embedding)，通过cross attention来进行交互，另一类经过STC-encoder的condition，处理后的尺寸和视频的latent一样，所有的condition先首先element-wise add操作，之后和 $x_t$ 进行拼接输入到网络当中；
训练策略：两阶段训练，首先是预训练阶段，然后是带条件的视频生成训练；
推理：使用classifier free guidance $\widehat\epsilon_\theta(z_t,c,t) = \epsilon_\theta(z_t,c_1,t) + w(\epsilon_\theta(z_t,c_2,t)-\epsilon_\theta(z_t,c_1,t))$ 其中 $c_1$ 和 $c_2$ 是两组条件，强调 $c_2-c_1$ 的条件，例如在text-driven video inpainting当中， $c_2$ 表示caption+masked video， $c_1$ 表示masked video；
实验：
a. 数据：使用了两个数据集webvid10M和LAION-400M
b. 评价指标：
ⅰ. 帧间一致性指标：计算相邻两帧的CLIP cosine similarity
ⅱ. motion control: 计算像素的预测光流和GT的欧式距离；
c. 首先展示了模型在组合控制条件来控制视频生成上面的能力，包括图生视频（+text)和视频inpainting以及根据sketch生成视频的能力，并展示相应的可视化效果；
d. 展示motion control的能力：
e. 消融实验：验证STC-encoder的有效性

原文地址:https://blog.csdn.net/weixin_44994838/article/details/138028000 本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若转载，请注明出处：https://www.suanlizi.com/kf/1782526798626689024.html 如若内容造成侵权/违法违规/事实不符，请联系《酸梨子》网邮箱：1419361763@qq.com进行投诉反馈，一经查实，立即删除！

阅读全部

VideoComposer: Compositional Video Synthesis with Motion Controllability

相关推荐

最近更新

热门阅读