CLIP : Represent your images with text

5 min readJul 6, 2022

An image of daisy correctly labelled by CLIP

Introduction

CLIP (Contrastive Language Image Pre-Training) is a multi-modal neural network made by OpenAI that has zero-shot capabilities. A multi modal neural network is able to interpret modalities, which is referred as something that happens or is experienced.

Nowadays, typical vision models are labor intensive as they require training a lot of parameters. Another problem that arises here is that the vision model you train is specifically trained for a specific dataset. It will therefore show poor performance on other datasets. Specific models will be required for specific datasets. This solution won’t fit well in real-world problems and that’s why many researchers have been working on zero-shot learning.

Zero-Shot learning means that you generalize a model such that it is able to perform well on many different kinds of labels. These labels may be unseen labels as our zero-shot model hasn’t been trained on the data of that labels.

Dataset

To make CLIP a zero-shot classifier, it has been trained on 400 million images with paired text descriptions (ImageNet consists of only 1.28 million images!). The data for CLIP was web-scraped. There were 32,678 randomly sampled text snippets in the dataset. These snippets basically describe the image, hence the word multi-modal neural network. The classes in the dataset were converted into captions if they were not in that format already. For example:

Dog → A photo of a dog
Apple → A photo of an apple
Object → A photo of an {object}

The task that CLIP performs is of labelling images with text. CLIP will give better predictions when the target classes are phrases and not single words because the data is scrapped from the internet and it is such that the images scrapped have phrases that describe it and not just single words. Another famous model by OpenAI, called DALL-E, does the opposite. It generates images based on a textual description.

CLIP Neural Network

CLIP does the following operations:

Image → Image Encoder (IE) → Encoded Image (N representations of N images where each representation is a vector)

Text → Text Encoder (TE) → Encoded Text (M representations of M text snippets where each representation is a vector)

We then ask the model that for image ‘X’, which text is appropriate out of N texts. This is why it is called as contrastive objective. Contrastive Learning is based on the intuition that you can contrast/differentiate between similar and dissimilar things. We will train it in such a way that the image is the closest to the text label corresponding to it and not any other text.

The model trains by minimizing the cosine distance between the correct image-text pair (N real pairs), while maximizing the cosine distance between the incorrect pairings (N²-N). Similarity between two vectors increases when cosine distance between the two vectors decreases. This is how CLIP was trained.

Both the encoders are transformers. Image Encoder is ResNet50 or Vision Transformer. Vision Transformer gives a 3x gain in compute efficiency over a standard ResNet. The best performing CLIP model was trained on 256GPUs for 2 weeks.

Zero-Shot Classification

Zero-Shot CLIP outperforms a Linear Probe on ResNet50 across 16 datasets including ImageNet. If you think about it, Zero-Shot CLIP is a beautiful thing. It has been trained on such a big data (scraped from the internet) that it outperforms many state-of-the-art convolutional neural networks, and it keep in mind that it might have never seen the data it is being tested on.

There is a very small difference of performance when Zero-Shot CLIP is used on different ImageNet datasets but performance degrades a lot when we use a classifier trained specifically for ImageNet, but used on other versions of ImageNet. This shows us how flexible CLIP is as it can be used as a general solution for many datasets.

To use zero-shot CLIP, you just have to pass the target labels into the model as captions, and it will give you the predictions. It works well on most of the big image datasets but sometimes, it might not perform that well. You will prefer a classic convolutional neural network or a transformer over CLIP for some datasets.

But you can improve the score you get using zero-shot CLIP using Linear Probe. In a CLIP with linear probe, you change the classification layer so that it better adjusts to the data that you are wanting to predict on.

Existence of Multi-Modal Neuron

One of the most interesting findings of CLIP is the existence of a “Spider-Man” Neuron. Like the biological “Halle Berry” neuron, there exists a “Spider-Man” neuron in clip that responds to an image of spider, an image of the text spider, and the famous superhero Spider-Man.

This finding shows us an important connection between neural networks and our biological nervous system. Researchers further discover that the highest layers of CLIP organize images as a loose semantic collection of ideas, which provides a simple explanation for the model’s versatility.

Limitations and Conclusion

Like I said, CLIP is flexible and general but it doesn’t perform well on certain datasets. Some examples of such tasks are counting the number of objects in an image, classifying flowers, variants of aircrafts, car models, etc. It underperforms on MNIST, a very basic deep learning task. CLIP is also competitive at detecting hateful memes without needing ground truth text.

CLIP is a wonderful neural network that doesn’t fail to amuse you. It is also used in DALL-E which is another amazing network, but yes, further research is required in this and I do believe that one day, general AI will exist with the help of more data and more innovation.

References

[1] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever, Learning Transferable Visual Models From Natural Language Supervision (2021), CLIP Research Paper

[2] Gabriel Goh, Chelsea Voss, Daniela Amodei, Shan Carter, Michael Petrov, Justin Jay Wang, Nick Cammarata, Chris Olah, Multi-Modal Neurons in Artificial Neural Networks(2021), CLIP Multi-Modal Neurons Blog

Please do clap if you liked this article!

Mlearning.ai Submission Suggestions

How to become a writer on Mlearning.ai

medium.com

CLIP : Represent your images with text

Introduction

Dataset

CLIP Neural Network

Zero-Shot Classification

Existence of Multi-Modal Neuron

Limitations and Conclusion

References

Mlearning.ai Submission Suggestions

How to become a writer on Mlearning.ai

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Sanchit Goel

No responses yet