[CLIP] Learning Transferable Visual Models From Natural Language Supervision

        通过在4亿图像/文本对上训练文字和图片的匹配关系来预训练网络,可以学习到SOTA的图像特征。预训练模型可以用于下游任务的零样本学习

                ​​​​​​​        ​​​​​​​        

1、网络结构

        1)simplified version of ConVIRT

        2)linear projection to map from each encoder's representation to the multi-modal embedding space

        3)image encoder

                -> ResNet

                         antialiased rect-2 blur pooling

                        用attention pooling (single layer of "transformer-style" multi-head QKV attention, where the query is conditioned on the global average-pooled representation of the image)来代替global average pooling

                -> Vision Transformer (ViT)

                        add an additional layer normalization to the combined patch

                        position embeddings before the transformer

                        slightly different initialization scheme

        4)text encoder

                -> Transformer

                        architecture modifications

                        63M-parameter 12 layer 512-wide model with 8 attention heads

                        lower-cased byte pair encoding (BPE) representation of the text with a 49152 vocab size

                        the max sequence length was capped at 76

                        the text sequence is bracketed with [SOS] and [EOS] tokens

                        the activations of the highest layer of the transformer at the [EOS] token are treated as the feature representation of the text which is layer normalized and then linearly projected into the multi-modal embedding space

        5)scale

                -> image encoder

                        equally increase the width, depth, and resolution of the model

                -> text encoder

                        only scale the width of the model to be proportional to the calculated increase in width of the ResNet, do not scale the depth at all

                        * text encoder对CLIP的表现影响较小

2、数据

        1)400 million (image, text) pairs from Internet

        2)many of the (image, text) pairs are only a single sentence

3、训练

        1)Contrastive Language-Image Pre-training (CLIP)

        2)text as a whole, not the exact words of that text

        3)Given a batch of N (image, text) pairs, predict N x N possible (image, text) pairings。N取32768

        4)jointly train an image encoder and text encoder

        5)maximize the cosine similarity of the N real pairs; minimizing the cosine similarity of the N^{2} - N incorrect pairs

        6)train from scratch

        7)数据增强

                random square crop from resized images

        8)learnable temperature parameter \tau (control the range of the logits in the softmax)

4、优势

        无需softmax分类器来预测结果,因此可以更灵活的用于zero-shot任务

相关推荐

最近更新

  1. docker php8.1+nginx base 镜像 dockerfile 配置

    2024-06-07 20:50:03       98 阅读
  2. Could not load dynamic library ‘cudart64_100.dll‘

    2024-06-07 20:50:03       106 阅读
  3. 在Django里面运行非项目文件

    2024-06-07 20:50:03       87 阅读
  4. Python语言-面向对象

    2024-06-07 20:50:03       96 阅读

热门阅读

  1. C++之动态数组

    2024-06-07 20:50:03       33 阅读
  2. vue3开发时,热更新页面的生命周期

    2024-06-07 20:50:03       30 阅读
  3. mysql GROUP BY 语句报错处理

    2024-06-07 20:50:03       27 阅读
  4. 浅谈Qt:跨平台开发的现在与未来

    2024-06-07 20:50:03       31 阅读
  5. MySQL——事务补充

    2024-06-07 20:50:03       24 阅读
  6. 深度解读ChatGPT

    2024-06-07 20:50:03       23 阅读
  7. .Net 封装Get/post方式的HTTP请求--form-data

    2024-06-07 20:50:03       34 阅读
  8. mysql5.7血泪史

    2024-06-07 20:50:03       31 阅读
  9. 蓝桥杯嵌入式学习

    2024-06-07 20:50:03       27 阅读
  10. 力扣2379.得到k个黑块的最少涂色次数

    2024-06-07 20:50:03       27 阅读