Learn AI Series (#75) - Multimodal Models - Text Meets Vision

What will I learn

You will learn what multimodal means and why combining text and vision unlocks capabilities neither modality achieves alone;
CLIP: connecting images and text in a shared embedding space through contrastive learning;
vision-language models: LLaVA, BLIP-2, and how they give language models "eyes";
visual question answering and image captioning with working code;
grounded generation: models that point at things, not just talk about them;
building a practical multimodal image analysis tool from open-source components.

Requirements

A working modern computer running macOS, Windows or Ubuntu;
An installed Python 3(.11+) distribution;
The ambition to learn AI and machine learning.

Difficulty

Beginner

Curriculum (of the `Learn AI Series`):

Learn AI Series (#75) - Multimodal Models - Text Meets Vision

Solutions to Episode #74 Exercises

Exercise 1: Model comparison pipeline.

import time
import random


class ModelComparer:
    """Compare models across prompts for quality, speed, and length."""

    def __init__(self, models, prompts):
        self.models = models
        self.prompts = prompts

    def run(self):
        results = {}
        for name, fn in self.models.items():
            timings = []
            lengths = []
            quality_scores = []

            for prompt in self.prompts:
                start = time.time()
                response = fn(prompt)
                elapsed = time.time() - start

                timings.append(elapsed)
                lengths.append(len(response.split()))

                # Quality: keywords + length + ends with punctuation
                keywords = set(
                    w.lower() for w in prompt.split()
                    if len(w) > 3
                )
                words = response.lower().split()
                kw_hits = sum(1 for w in words if w in keywords)
                kw_score = min(kw_hits / max(len(keywords), 1), 1.0)

                length_ok = 1.0 if 20 < len(words) < 200 else 0.3
                ends_ok = 1.0 if response.rstrip()[-1:] in ".!?" else 0.5
                q = (kw_score * 0.5 + length_ok * 0.3
                     + ends_ok * 0.2)
                quality_scores.append(q)

            avg_latency = sum(timings) / len(timings)
            avg_length = sum(lengths) / len(lengths)
            avg_quality = sum(quality_scores) / len(quality_scores)

            # Normalize for weighted score
            results[name] = {
                "avg_latency": avg_latency,
                "avg_length": avg_length,
                "avg_quality": avg_quality,
            }

        self._report(results)
        return results

    def _report(self, results):
        print(f"{'Model':8} {'Latency':>10} "
              f"{'Avg Len':>8} {'Score':>8}")
        print("=" * 58)

        # Compute weighted scores
        max_lat = max(r["avg_latency"] for r in results.values())
        scored = {}
        for name, r in results.items():
            lat_norm = 1.0 - (r["avg_latency"] / max_lat)
            len_norm = (1.0 if 30 < r["avg_length"] < 150
                        else 0.3)
            weighted = (0.4 * r["avg_quality"]
                        + 0.3 * lat_norm
                        + 0.3 * len_norm)
            scored[name] = weighted

            print(f"{name:8.3f} "
                  f"{r['avg_latency']:>9.4f}s "
                  f"{r['avg_length']:>8.1f} {weighted:>8.3f}")

        winner = max(scored, key=scored.get)
        print(f"\nWinner: {winner} ({scored[winner]:.3f})")


# Simulated models
def fast_dumb(prompt):
    time.sleep(0.001)
    return "Sure, the answer is yes. Done."


def slow_smart(prompt):
    time.sleep(0.05)
    keywords = [w for w in prompt.split() if len(w) > 3]
    return ("After careful analysis of " +
            ", ".join(keywords[:4]) +
            ", the conclusion involves multiple factors "
            "including theoretical foundations and "
            "practical considerations that affect the "
            "outcome significantly. The key insight is "
            "that each component contributes to the "
            "overall understanding of the topic.")


def balanced(prompt):
    time.sleep(0.01)
    keywords = [w for w in prompt.split() if len(w) > 3]
    return ("Regarding " + " ".join(keywords[:2]) +
            ": the main point is that these concepts "
            "work together in practice.")


models = {
    "fast-dumb": fast_dumb,
    "slow-smart": slow_smart,
    "balanced": balanced,
}

prompts = [
    "Explain how gradient descent optimizes a loss function",
    "What is the difference between supervised and unsupervised",
    "How does a convolutional neural network process images",
    "Describe the attention mechanism in transformers",
    "What is overfitting and how do you prevent it",
    "Explain the bias-variance tradeoff in machine learning",
    "How does backpropagation compute gradients efficiently",
    "What is transfer learning and when should you use it",
    "Describe how word embeddings capture semantic meaning",
    "What is the vanishing gradient problem in deep networks",
]

random.seed(42)
comparer = ModelComparer(models, prompts)
comparer.run()

The weighted scoring formula (40% quality, 30% latency, 30% length) forces you to think about what actually matters for your use case. In a production chatbot, you might weigh latency much higher. For an offline analysis tool, quality dominates. The point is that no single metric captures "best model" -- the weights encode your priorities.

Exercise 2: Dataset processing pipeline.

import time
import random


class DataPipeline:
    """Pure-Python dataset processing pipeline with stats."""

    def __init__(self, data):
        self.data = list(data)
        self.stats = []

    def map(self, fn, batched=False):
        start = time.time()
        rows_in = len(self.data)

        if batched:
            self.data = fn(self.data)
        else:
            self.data = [fn(example) for example in self.data]

        elapsed = time.time() - start
        self.stats.append({
            "op": "map", "rows_in": rows_in,
            "rows_out": len(self.data),
            "time": elapsed,
        })
        return self

    def filter(self, fn):
        start = time.time()
        rows_in = len(self.data)
        self.data = [ex for ex in self.data if fn(ex)]
        elapsed = time.time() - start

        self.stats.append({
            "op": "filter", "rows_in": rows_in,
            "rows_out": len(self.data),
            "time": elapsed,
        })
        return self

    def train_test_split(self, test_size=0.2, seed=42):
        start = time.time()
        rows_in = len(self.data)

        rng = random.Random(seed)
        shuffled = list(self.data)
        rng.shuffle(shuffled)

        split_idx = int(len(shuffled) * (1 - test_size))
        train = shuffled[:split_idx]
        test = shuffled[split_idx:]

        elapsed = time.time() - start
        self.stats.append({
            "op": "train_test_split",
            "rows_in": rows_in,
            "rows_out": len(train),
            "time": elapsed,
        })

        return train, test

    def select(self, indices):
        start = time.time()
        rows_in = len(self.data)
        self.data = [self.data[i] for i in indices]
        elapsed = time.time() - start

        self.stats.append({
            "op": "select", "rows_in": rows_in,
            "rows_out": len(self.data),
            "time": elapsed,
        })
        return self

    def print_stats(self):
        print(f"\n{'Operation':6} {'Out':>6} "
              f"{'Time':>10}")
        print("-" * 46)
        for s in self.stats:
            print(f"{s['op']:6} "
                  f"{s['rows_out']:>6} "
                  f"{s['time']:>9.5f}s")


# Generate 100 sample texts
topics = ["neural networks", "gradient descent",
          "transformers", "embeddings",
          "backpropagation"]
data = []
for i in range(100):
    topic = topics[i % len(topics)]
    n_words = random.randint(3, 15)
    filler = " ".join(
        random.choice(["the", "a", "deep", "model",
                        "learns", "from", "data",
                        "using", "weights", "loss"])
        for _ in range(n_words)
    )
    data.append({
        "id": i,
        "text": f"Sample text number {i} about {topic} {filler}",
    })

pipe = DataPipeline(data)

# Add word counts
pipe.map(lambda ex: {
    **ex,
    "word_count": len(ex["text"].split())
})
pipe.print_stats()

# Filter texts over 5 words
pipe.filter(lambda ex: ex["word_count"] > 5)
pipe.print_stats()

# Tokenize (split on spaces)
pipe.map(lambda ex: {
    **ex,
    "tokens": ex["text"].split()
})

# Split
train, test = pipe.train_test_split(test_size=0.2, seed=42)
pipe.print_stats()

print(f"\nFinal: {len(train)} train, {len(test)} test")
print(f"Sample train entry keys: {list(train[0].keys())}")

The chaining pattern (.map().filter().train_test_split()) mirrors how the real datasets library works. Each operation returns self, enabling method chaining. The stats tracking at each step is something you should always do in real pipelines -- knowing that your filter dropped 40% of the data tells you something important about your data quality.

Exercise 3: Hub search simulator.

class ModelHub:
    """Simulated Hugging Face Hub with search and filtering."""

    def __init__(self):
        self.models = []

    def add(self, model):
        self.models.append(model)

    def search(self, query):
        q = query.lower()
        results = []
        for m in self.models:
            if (q in m["id"].lower()
                    or any(q in t.lower() for t in m["tags"])):
                results.append(m)
        return ModelHub._wrap(results)

    def filter(self, task=None, library=None, license=None,
               min_downloads=None, max_parameters=None):
        results = list(self.models)
        if task:
            results = [m for m in results if m["task"] == task]
        if library:
            results = [m for m in results
                       if m["library"] == library]
        if license:
            results = [m for m in results
                       if m["license"] == license]
        if min_downloads is not None:
            results = [m for m in results
                       if m["downloads"] >= min_downloads]
        if max_parameters is not None:
            results = [m for m in results
                       if m["parameters"] <= max_parameters]
        return ModelHub._wrap(results)

    def sort(self, field, descending=True):
        self.models.sort(
            key=lambda m: m.get(field, 0),
            reverse=descending
        )
        return self

    def model_card(self, model_id):
        for m in self.models:
            if m["id"] == model_id:
                lines = [
                    f"# {m['id']}",
                    f"Task: {m['task']}",
                    f"Library: {m['library']}",
                    f"License: {m['license']}",
                    f"Parameters: {m['parameters']:,}",
                    f"Downloads: {m['downloads']:,}",
                    f"Likes: {m['likes']:,}",
                    f"Tags: {', '.join(m['tags'])}",
                ]
                return "\n".join(lines)
        return f"Model '{model_id}' not found."

    @staticmethod
    def _wrap(results):
        hub = ModelHub()
        hub.models = results
        return hub

    def __iter__(self):
        return iter(self.models)

    def __len__(self):
        return len(self.models)


hub = ModelHub()
entries = [
    {"id": "meta-llama/Llama-3.1-8B", "task": "text-generation",
     "library": "transformers", "downloads": 9500000,
     "likes": 4200, "license": "llama3.1",
     "parameters": 8000000000,
     "tags": ["llama", "causal-lm", "english"]},
    {"id": "meta-llama/Llama-3.1-70B", "task": "text-generation",
     "library": "transformers", "downloads": 3200000,
     "likes": 3800, "license": "llama3.1",
     "parameters": 70000000000,
     "tags": ["llama", "causal-lm", "english"]},
    {"id": "mistralai/Mistral-7B-v0.3", "task": "text-generation",
     "library": "transformers", "downloads": 7100000,
     "likes": 3100, "license": "apache-2.0",
     "parameters": 7000000000,
     "tags": ["mistral", "causal-lm"]},
    {"id": "google/gemma-2-9b", "task": "text-generation",
     "library": "transformers", "downloads": 4500000,
     "likes": 2500, "license": "gemma",
     "parameters": 9000000000,
     "tags": ["gemma", "causal-lm"]},
    {"id": "microsoft/phi-3-mini", "task": "text-generation",
     "library": "transformers", "downloads": 2800000,
     "likes": 1900, "license": "mit",
     "parameters": 3800000000,
     "tags": ["phi", "causal-lm", "small"]},
    {"id": "distilbert-base-uncased", "task": "text-classification",
     "library": "transformers", "downloads": 10000000,
     "likes": 5500, "license": "apache-2.0",
     "parameters": 66000000,
     "tags": ["distilbert", "classification", "english"]},
    {"id": "roberta-base", "task": "text-classification",
     "library": "transformers", "downloads": 8200000,
     "likes": 4100, "license": "mit",
     "parameters": 125000000,
     "tags": ["roberta", "classification"]},
    {"id": "bert-base-uncased", "task": "text-classification",
     "library": "transformers", "downloads": 9800000,
     "likes": 6200, "license": "apache-2.0",
     "parameters": 110000000,
     "tags": ["bert", "classification", "english"]},
    {"id": "cardiffnlp/twitter-roberta", "task": "text-classification",
     "library": "transformers", "downloads": 1500000,
     "likes": 800, "license": "mit",
     "parameters": 125000000,
     "tags": ["roberta", "sentiment", "twitter"]},
    {"id": "nlptown/bert-base-sentiment",
     "task": "text-classification",
     "library": "transformers", "downloads": 2100000,
     "likes": 950, "license": "mit",
     "parameters": 110000000,
     "tags": ["bert", "sentiment", "multilingual"]},
    {"id": "deepset/roberta-base-squad2",
     "task": "question-answering",
     "library": "transformers", "downloads": 6500000,
     "likes": 3200, "license": "cc-by-4.0",
     "parameters": 125000000,
     "tags": ["roberta", "qa", "squad"]},
    {"id": "distilbert-base-cased-distilled-squad",
     "task": "question-answering",
     "library": "transformers", "downloads": 7800000,
     "likes": 3900, "license": "apache-2.0",
     "parameters": 66000000,
     "tags": ["distilbert", "qa", "squad"]},
    {"id": "deepset/tinyroberta-squad2",
     "task": "question-answering",
     "library": "transformers", "downloads": 900000,
     "likes": 450, "license": "cc-by-4.0",
     "parameters": 82000000,
     "tags": ["roberta", "qa", "tiny"]},
    {"id": "Intel/dynamic-tinybert",
     "task": "question-answering",
     "library": "transformers", "downloads": 400000,
     "likes": 210, "license": "apache-2.0",
     "parameters": 67000000,
     "tags": ["tinybert", "qa", "intel"]},
    {"id": "aari1995/German-QA",
     "task": "question-answering",
     "library": "transformers", "downloads": 150000,
     "likes": 120, "license": "mit",
     "parameters": 110000000,
     "tags": ["bert", "qa", "german"]},
    {"id": "sentence-transformers/all-MiniLM-L6-v2",
     "task": "sentence-similarity",
     "library": "sentence-transformers",
     "downloads": 8900000, "likes": 5800,
     "license": "apache-2.0", "parameters": 22700000,
     "tags": ["sentence-transformers", "embedding"]},
    {"id": "sentence-transformers/all-mpnet-base-v2",
     "task": "sentence-similarity",
     "library": "sentence-transformers",
     "downloads": 5400000, "likes": 3200,
     "license": "apache-2.0", "parameters": 109000000,
     "tags": ["sentence-transformers", "embedding"]},
    {"id": "BAAI/bge-large-en-v1.5",
     "task": "sentence-similarity",
     "library": "sentence-transformers",
     "downloads": 4200000, "likes": 2900,
     "license": "mit", "parameters": 335000000,
     "tags": ["bge", "embedding", "english"]},
    {"id": "en_core_web_sm", "task": "text-classification",
     "library": "spacy", "downloads": 3500000,
     "likes": 1200, "license": "mit",
     "parameters": 12000000,
     "tags": ["spacy", "ner", "english"]},
    {"id": "en_core_web_trf", "task": "text-classification",
     "library": "spacy", "downloads": 1800000,
     "likes": 900, "license": "mit",
     "parameters": 460000000,
     "tags": ["spacy", "ner", "transformer"]},
]

for entry in entries:
    hub.add(entry)

# Search for "llama"
print("Search: 'llama'")
for m in hub.search("llama"):
    print(f"  {m['id']:40s} {m['downloads']:>10,} downloads")

# Filter by task + minimum downloads
print("\nFilter: text-generation, 3M+ downloads")
filtered = hub.filter(task="text-generation",
                      min_downloads=3000000)
for m in filtered:
    print(f"  {m['id']:40s} {m['downloads']:>10,}")

# Sort by likes, print top 5
hub.sort("likes", descending=True)
print("\nTop 5 by likes:")
for m in list(hub)[:5]:
    print(f"  {m['id']:40s} {m['likes']:>6,} likes")

# Model card for top result
print(f"\n{hub.model_card('bert-base-uncased')}")

The search + filter + sort chain is exactly how you navigate the real Hub. In practice, you almost never browse models randomly -- you filter by task first (what do I need this model to do?), then by library compatibility (does it work with my framework?), then sort by downloads or likes to find the community-vetted options. The model card method simulates what you should always read before downloading anything: license, size, training data, and intended use.

On to today's episode

Here we go! For the past 20+ episodes we've been building up two separate tracks of understanding. On one side: text. NLP fundamentals (#30), word embeddings (#31), transformers (#52-53), GPT (#58), BERT (#59), fine-tuning (#69), the entire Hugging Face ecosystem (#74). On the other side: vision. CNNs from theory to practice (#45-46), applications like detection and segmentation (#47), and Vision Transformers (#54). These have been running in parallel, each powerful in its own domain.

But the most exciting work happening in AI right now is at the intersection. Models that can look at an image and reason about it in natural language. Models that understand both a photograph and its caption as representations of the same underlying concept. That is what multimodal AI means, and it's where we're heading today ;-)

When you upload a photo to a vision-language model and ask "what's in this image?" -- something genuinely remarkable is happening under the hood. The model isn't doing image classificaton and then stitching text on top. It's processing both modalities simultaneously in a shared representation space, attending to visual features while generating language. Every component we've studied across 74 episodes contributes to making that work.

The fundamental challenge: bridging modalities

Text and images are fundamentally different data types. Text is a sequence of discrete tokens (we covered this in depth in episode #72). An image is a 2D grid of continuous pixel values. A typical sentence might have 20 tokens. A 224x224 image has 150,528 pixel values (three color channels). How do you build a model that treats the sentence "a golden retriever playing in snow" and a photograph of exactly that scene as representations of the same concept?

The answer -- and this is the core insight of the entire episode -- is shared embedding spaces. We introduced embeddings back in episode #63, where we mapped text into dense vector representations. The breakthrough in multimodal AI was extending this idea: learn an embedding space where images and text that describe similar things end up close together, regardless of which modality they came from.

Think of it as building a universal coordinate system. A photo of a cat and the words "a photo of a cat" should map to nearly the same point in this coordinate system. A photo of a dog and the words "a photo of a cat" should be far apart. That's the fundamental goal, and CLIP was the model that cracked it wide open.

CLIP: the bridge between text and images

CLIP (Contrastive Language-Image Pre-training, Radford et al., 2021) was a landmark. The architecture is straightforward -- almost deceptively simple:

Image  --> [Vision Encoder] --> image_embedding (512-dim)
                                       |
                              (should be similar for matching pairs)
                                       |
Text   --> [Text Encoder]   --> text_embedding  (512-dim)

Two separate encoders. A Vision Transformer (episode #54) processes the image. A text transformer processes the text. Both output embeddings in the same 512-dimensional space. Training uses contrastive learning: given a batch of image-text pairs, maximize the cosine similarity between matching pairs and minimize it for non-matching pairs.

from transformers import CLIPModel, CLIPProcessor
from PIL import Image
import torch

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained(
    "openai/clip-vit-base-patch32"
)

# Encode an image and multiple text descriptions
image = Image.open("dog_in_snow.jpg")
texts = [
    "a dog playing in snow",
    "a cat sleeping on a couch",
    "a landscape painting",
    "a golden retriever in a winter scene",
]

inputs = processor(
    text=texts, images=image,
    return_tensors="pt", padding=True
)

with torch.no_grad():
    outputs = model(**inputs)
    image_embeds = outputs.image_embeds   # (1, 512)
    text_embeds = outputs.text_embeds     # (4, 512)

# Cosine similarity between image and each text
similarities = torch.nn.functional.cosine_similarity(
    image_embeds.unsqueeze(1),  # (1, 1, 512)
    text_embeds.unsqueeze(0),   # (1, 4, 512)
    dim=-1
)
for text, sim in zip(texts, similarities[0]):
    print(f"  {sim:.3f}  {text}")
# The dog descriptions will score highest

CLIP was trained on 400 million image-text pairs scraped from the internet. The sheer scale of training data is what makes it work -- the model learns rich associations between visual concepts and their natural language descriptions across an enormous variety of contexts.

What makes CLIP truly remarkable (and what got everyone excited in 2021) is zero-shot transfer. Without any task-specific training, you can use CLIP for image classification by comparing an image's embedding to text embeddings of class descriptions:

def zero_shot_classify(image, class_names, model, processor):
    """Classify an image without any training on these classes."""
    # Wrap class names in a prompt template
    texts = [f"a photo of a {name}" for name in class_names]
    inputs = processor(
        text=texts, images=image,
        return_tensors="pt", padding=True
    )

    with torch.no_grad():
        outputs = model(**inputs)

    logits = outputs.logits_per_image  # (1, num_classes)
    probs = logits.softmax(dim=1)

    results = [
        (name, prob.item())
        for name, prob in zip(class_names, probs[0])
    ]
    return sorted(results, key=lambda x: x[1], reverse=True)


classes = ["cat", "dog", "bird", "car", "tree", "building"]
results = zero_shot_classify(image, classes, model, processor)
for name, prob in results:
    print(f"  {name}: {prob:.2%}")

No training data for these specific classes. No fine-tuning. Zero. CLIP's shared embedding space already understands that "a photo of a dog" should be semantically close to images of dogs, because it learned that relationship from 400 million examples of people describing images on the internet. This is a fundamentally different approach from traditional computer vision classifiers (like the CNNs we built in episodes #45-46), which require labeled training data for every class they need to recognize.

Why is this a big deal? Because traditional classifiers are limited to the classes they were trained on. A ResNet trained on ImageNet can classify 1,000 categories -- but if you need to distinguish between 5 types of industrial defects, you need to collect and label training data for those specific categories. CLIP can handle novel categories at inference time just by describing them in natural language. Having said that, CLIP's zero-shot accuracy on specialized domains (medical imaging, satellite photos, etc.) is typically lower than fine-tuned models. It's a generalist, not a specialist.

Vision-language models: giving LLMs eyes

CLIP connects images and text in a shared embedding space, which is powerful for search and classification. But CLIP doesn't generate text. It produces similarity scores. The next evolution: connect a vision encoder to a full language model, creating a system that can look at an image and have a conversation about it.

The general architecture of modern vision-language models (VLMs):

Image --> [Vision Encoder] --> image tokens --> [Projection] --> [LLM] --> text
                                                                  ^
                                                             text prompt

The vision encoder (typically a ViT, from episode #54) converts the image into a sequence of patch-level representations -- essentially "image tokens." A projection layer maps these into the language model's embedding space. Then the LLM processes both the image tokens and the text prompt together, generating a text response through the same autoregressive mechanism we studied in episode #58.

LLaVA (Large Language and Vision Assistant, Liu et al., 2023) demonstrated that this architecture works surprisingly well with a simple design:

Take a pretrained CLIP vision encoder (already understands images)
Take a pretrained language model like Llama or Vicuna (already understands language)
Connect them with a simple linear projection layer
Fine-tune on instruction-following data that includes images

from transformers import (
    LlavaForConditionalGeneration, AutoProcessor
)
from PIL import Image
import torch

model = LlavaForConditionalGeneration.from_pretrained(
    "llava-hf/llava-v1.6-mistral-7b-hf",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
processor = AutoProcessor.from_pretrained(
    "llava-hf/llava-v1.6-mistral-7b-hf"
)

image = Image.open("chart.png")
prompt = "\nWhat trends do you see in this chart?"

inputs = processor(
    text=prompt, images=image,
    return_tensors="pt"
).to(model.device)

with torch.no_grad():
    output = model.generate(
        **inputs, max_new_tokens=300, temperature=0.7
    )

response = processor.decode(
    output[0], skip_special_tokens=True
)
print(response)

The key insight: you don't train the entire system from scratch. The vision encoder already understands images (from CLIP pre-training). The language model already understands language (from text pre-training). The projection layer just needs to learn how to map between these two well-trained representation spaces. This is why LLaVA works with a relatively small amount of multimodal fine-tuning data -- roughly 600K image-text instruction pairs, which is tiny compared to the billions of examples the individual components were trained on.

BLIP-2 (Li et al., 2023) takes a slightly different approach by inserting a "Q-Former" (Querying Transformer) between the vision encoder and the language model. The Q-Former learns a fixed set of query embeddings that extract the most relevant visual information from the image, in stead of passing all image tokens to the LLM. This is more compute-efficient -- fewer tokens flowing into the language model means faster inference:

from transformers import (
    Blip2ForConditionalGeneration, Blip2Processor
)

blip_model = Blip2ForConditionalGeneration.from_pretrained(
    "Salesforce/blip2-opt-2.7b",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
blip_processor = Blip2Processor.from_pretrained(
    "Salesforce/blip2-opt-2.7b"
)

image = Image.open("park_scene.jpg")

# Unconditional captioning
inputs = blip_processor(
    images=image, return_tensors="pt"
).to(blip_model.device)
output = blip_model.generate(**inputs, max_new_tokens=50)
caption = blip_processor.decode(
    output[0], skip_special_tokens=True
)
print(f"Caption: {caption}")

# Conditional captioning (guide the description)
inputs = blip_processor(
    images=image, text="This image shows",
    return_tensors="pt"
).to(blip_model.device)
output = blip_model.generate(**inputs, max_new_tokens=50)
guided = blip_processor.decode(
    output[0], skip_special_tokens=True
)
print(f"Guided: {guided}")

Both architectures (LLaVA's direct projection and BLIP-2's Q-Former) work well in practice. The field is iterating rapidly on architectural details, but the core idea remains the same: pretrained vision + pretrained language + a learned connection layer.

Visual question answering

VQA is the classic multimodal benchmark: given an image and a question, produce a correct answer. "What color is the car?" "How many people are in the photo?" "Is there a dog in this image?" Early VQA systems (before 2020) used separate vision and language encoders with custom fusion layers -- quit some architectural complexity for what turned out to be mediocre results.

Modern VQA is handled by vision-language models directly. No special architecture needed -- just ask:

def visual_qa(image_path, question, model, processor):
    """Answer a question about an image."""
    image = Image.open(image_path)
    prompt = f"\nAnswer concisely: {question}"

    inputs = processor(
        text=prompt, images=image, return_tensors="pt"
    ).to(model.device)

    with torch.no_grad():
        output = model.generate(
            **inputs, max_new_tokens=100
        )

    full = processor.decode(
        output[0], skip_special_tokens=True
    )
    # Extract the answer after the prompt
    answer = full.split(question)[-1].strip()
    return answer


# Examples
print(visual_qa(
    "office.jpg",
    "How many monitors are on the desk?",
    model, processor
))
print(visual_qa(
    "recipe.jpg",
    "What ingredients can you identify?",
    model, processor
))
print(visual_qa(
    "error_screenshot.png",
    "What error message is shown?",
    model, processor
))

That last example -- using a VLM to read error messages from screenshots -- is particularly practical. Think about it: automated bug reporting, accessibility tools for visually impaired users, customer support where the user sends a screenshot of their problem. These are real applications that combine OCR-level text recognition with semantic understanding of what the error means.

Image captioning and dense description

Image captioning goes the other direction: given an image, generate a natural language description. This might sound simple but it requires understanding objects, their spatial relationships, actions being performed, and sometimes context that isn't directly visible (like time of day from lighting).

Practical applications are everywhere: automatic alt-text generation for web accessibility (this alone is a massive use case -- the web is full of images with no alt text), image search indexing, content moderation (describe what's in an image to check against policies), photo library organization, and generating training data for other models.

The quality of captions has improved dramatically. Early models (2015-2018) would produce generic descriptions like "a man standing in a room." Modern VLMs produce descriptions like "a man in a blue flannel shirt standing in a well-lit kitchen, holding a wooden cutting board with freshly sliced vegetables." The difference is transformers + scale + better training data. The same story we've seen throughout this entire series ;-)

Grounded generation: pointing at things

Standard VQA tells you "there's a red car on the left." Grounded generation tells you exactly where: it outputs bounding box coordinates along with the text description. This bridges the gap between understanding (knowing what's in the image) and localization (knowing where it is).

from transformers import AutoProcessor, AutoModelForCausalLM

# Florence-2: a strong open-source grounding model
model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Florence-2-base"
)
processor = AutoProcessor.from_pretrained(
    "microsoft/Florence-2-base"
)

image = Image.open("street_scene.jpg")
prompt = " car, person, traffic light"

inputs = processor(
    text=prompt, images=image, return_tensors="pt"
)
output = model.generate(**inputs, max_new_tokens=1024)

result = processor.post_process_generation(
    processor.decode(output[0], skip_special_tokens=True),
    task="",
    image_size=image.size,
)

# result contains bounding boxes for each detected object
for label, boxes in result.items():
    print(f"  {label}: {len(boxes)} found")
    for box in boxes[:2]:
        print(f"    bbox: {box}")

Grounding is essential for applications where you need to ACT on visual information, not just describe it. Robotics (the arm needs coordinates, not just the knowledge that an object exists), visual UI testing ("find and click the submit button in this screenshot"), document processing (locate specific fields in forms and invoices), and autonomous driving (where exactly is that pedestrian relative to the car?).

Florence-2 from Microsoft is worth knowing about because it handles multiple vision tasks (captioning, detection, segmentation, OCR) through a single unified architecture. You just change the task prompt. This is the "foundation model" approach applied to vision -- one model, many tasks. A part from Florence-2, there's also Grounding DINO and SAM (Segment Anything Model) which handle grounded detection and segmentation respectively.

Building a multimodal analysis tool

Let's combine everything into a practical image analysis class that you could actually use in a project:

from transformers import (
    LlavaForConditionalGeneration, AutoProcessor
)
from PIL import Image
import torch


class ImageAnalyzer:
    """Multimodal image analysis using a vision-language model."""

    def __init__(self, model_name="llava-hf/llava-v1.6-mistral-7b-hf"):
        self.processor = AutoProcessor.from_pretrained(model_name)
        self.model = LlavaForConditionalGeneration.from_pretrained(
            model_name,
            torch_dtype=torch.bfloat16,
            device_map="auto",
        )

    def _generate(self, image, prompt, max_tokens=500):
        inputs = self.processor(
            text=prompt, images=image, return_tensors="pt"
        )
        inputs = {
            k: v.to(self.model.device) for k, v in inputs.items()
        }

        with torch.no_grad():
            output = self.model.generate(
                **inputs,
                max_new_tokens=max_tokens,
                temperature=0.3,
            )

        return self.processor.decode(
            output[0], skip_special_tokens=True
        )

    def describe(self, image_path):
        """Generate a detailed description of an image."""
        image = Image.open(image_path)
        return self._generate(
            image,
            "\nDescribe this image in detail."
        )

    def ask(self, image_path, question):
        """Answer a question about an image."""
        image = Image.open(image_path)
        return self._generate(
            image,
            f"\n{question}"
        )

    def compare(self, image_paths, question):
        """Apply the same question to multiple images."""
        results = {}
        for path in image_paths:
            results[path] = self.ask(path, question)
        return results

    def extract_text(self, image_path):
        """Extract visible text from an image (OCR-like)."""
        image = Image.open(image_path)
        return self._generate(
            image,
            "\nList all text visible in this image, "
            "exactly as written."
        )


# Usage examples
analyzer = ImageAnalyzer()

# Describe what you see
print(analyzer.describe("dashboard.png"))

# Ask specific questions
print(analyzer.ask(
    "code_screenshot.png",
    "What programming language is this and what does the code do?"
))

# Extract text from an image
print(analyzer.extract_text("whiteboard_photo.jpg"))

# Compare multiple images
results = analyzer.compare(
    ["chart_v1.png", "chart_v2.png"],
    "What data trends are shown?"
)
for path, answer in results.items():
    print(f"\n{path}:\n  {answer}")

This is a straightforward tool, but stop and think about what's actually happening: a single model architecture understands images, reads text within images, reasons about visual content, and generates coherent natural language responses. Every building block we've covered across 75 episodes is at work here -- CNNs and ViTs for visual feature extraction, transformers for language generation, embeddings for shared representations, attention mechanisms for connecting it all together, and the Hugging Face ecosystem (episode #74) for making it accessible with a few lines of code.

The contrastive learning mechanism

We should look a bit deeper into how CLIP's training actually works, because contrastive learning is the secret sauce behind most multimodal models and it connects directly to the embedding concepts from episode #63.

Given a batch of N image-text pairs, CLIP creates an N x N similarity matrix. The diagonal entries are the matching pairs (image_i with text_i). Everything off-diagonal is a negative pair. Training maximizes the diagonal similarities while minimizing the off-diagonal ones:

import torch
import torch.nn.functional as F


def contrastive_loss(image_features, text_features,
                     temperature=0.07):
    """Compute CLIP-style contrastive loss.

    Both inputs: (batch_size, embedding_dim), L2-normalized.
    """
    # Cosine similarity matrix: (batch, batch)
    logits = (image_features @ text_features.T) / temperature

    # Labels: diagonal entries are the matching pairs
    batch_size = logits.shape[0]
    labels = torch.arange(batch_size)

    # Symmetric loss: image-to-text AND text-to-image
    loss_i2t = F.cross_entropy(logits, labels)
    loss_t2i = F.cross_entropy(logits.T, labels)

    return (loss_i2t + loss_t2i) / 2


# Demo with simulated embeddings
batch_size = 8
embed_dim = 512

# Pretend these come from vision and text encoders
image_emb = F.normalize(
    torch.randn(batch_size, embed_dim), dim=-1
)
text_emb = F.normalize(
    torch.randn(batch_size, embed_dim), dim=-1
)

loss = contrastive_loss(image_emb, text_emb)
print(f"Contrastive loss: {loss:.4f}")
print(f"Random baseline (batch={batch_size}): "
      f"{torch.log(torch.tensor(float(batch_size))):.4f}")

The temperature parameter controls how "sharp" the similarity distribution is. Lower temperature makes the model more confident in its distinctions (similar to what we saw in episode #71 with text generation temperature). The value of 0.07 that CLIP uses is learned during training -- it's actually a trainable parameter, not a fixed hyperparameter.

The symmetric loss is important: we want image_i to find text_i (image-to-text retrieval) AND we want text_i to find image_i (text-to-image retrieval). Both directions matter because a good shared space supports search in either direction.

The reason contrastive learning scales so well is that every non-matching pair in the batch serves as a negative example. With a batch size of 32,768 (which CLIP used), each positive pair has 32,767 negatives to contrast against. That's an enormous amount of training signal from each batch, and it's "free" -- you don't need explicit negative examples, the batch structure provides them.

Limitations and what to watch for

Multimodal models are powerful but they have real limitations that you need to understand before deploying them:

Hallucination: VLMs can confidently describe objects that aren't in the image. If you ask "what color is the hat?" and there's no hat, many models will invent one rather than saying "I don't see a hat." This is the same hallucination problem we discussed in episode #73 (evaluation), but it's harder to catch in the visual domain because verifying visual claims requires looking at the image yourself.

Spatial reasoning: current VLMs struggle with precise spatial relationships. "Is the cup to the left or right of the plate?" often gets wrong answers. The models understand objects better than their spatial arrangement, partly because the ViT patch-based encoding loses some spatial precision.

Text in images: OCR-like tasks (reading text from images) have improved dramatically but still fail on small text, unusual fonts, handwriting, and text at odd angles. Don't trust a VLM for production-critical text extraction without verification.

Bias and safety: multimodal models inherit biases from both their vision and language training data. A model might associate certain visual features with stereotypes learned from internet image-text pairs. This is an active area of research and something you need to evaluate carefully for any deployment.

Samengevat

Multimodal models connect vision and language by learning shared embedding spaces where images and text describing the same concept map to nearby vectors -- the fundamental breakthrough that makes cross-modal understanding possible;
CLIP bridges the gap through contrastive training on 400M image-text pairs, enabling zero-shot image classification and cross-modal search without any task-specific training data;
vision-language models (LLaVA, BLIP-2) connect a pretrained vision encoder to a pretrained LLM through a projection layer -- each component brings its pre-existing knowledge, and only the connection needs to be learned;
visual question answering, image captioning, and grounded generation are the core multimodal tasks, each building on the same vision-language architecture pattern;
contrastive learning is the training mechanism that makes shared embedding spaces work -- every non-matching pair in a batch is a free negative example, which is why it scales so well with batch size;
open-source models (LLaVA, BLIP-2, Florence-2) make multimodal AI accessible on consumer hardware -- you can run a capable vision-language model on a single GPU using the same Hugging Face tools we covered in episode #74.

Exercises

Exercise 1: Build a CLIP-style similarity search engine. Create a class ImageTextSearch that: (a) maintains a registry of image entries (each entry has a path, a description, and a simulated 128-dim embedding vector -- generate random embeddings normalized to unit length), (b) implements .add_image(path, description) that generates and stores the embedding, (c) implements .search_by_text(query, top_k=5) that creates a text embedding (simulated as a normalized random vector seeded by the hash of the query string, so the same query always produces the same embedding) and returns the top-k most similar images by cosine similarity, (d) implements .search_by_image(image_path, top_k=5) that finds the most similar images to a given image (using the stored embedding), (e) implements .find_duplicates(threshold=0.95) that returns all pairs of images whose embeddings are above the similarity threshold. Pre-populate with 20 image entries across 4 categories (animals, landscapes, food, architecture -- 5 each) with descriptive names. Demonstrate text search for "sunset over mountains", image-based search for one of the landscape entries, and duplicate detection. Print formatted results with similarity scores.

Exercise 2: Build a contrastive learning trainer. Create a class ContrastiveTrainer that simulates CLIP's training process: (a) generate a synthetic dataset of 100 "image-text pairs" where each pair shares a category label (10 categories, 10 pairs per category), (b) initialize random embeddings for both modalities (64-dim, normalized), (c) implement compute_loss(image_batch, text_batch, temperature) that computes the symmetric contrastive loss (InfoNCE) as described in the episode, (d) implement a training loop that runs 50 gradient steps using manual gradient computation (compute loss, adjust embeddings to increase matching-pair similarity and decrease non-matching similarity by a learning rate of 0.1), (e) after training, compute and print the alignment score: average cosine similarity of matching pairs vs average cosine similarity of random non-matching pairs. Show how the alignment gap (matching minus non-matching) increases over training. Print the loss curve and alignment scores at steps 0, 10, 25, and 50.

Exercise 3: Build a visual question answering evaluator. Create a class VQAEvaluator that: (a) defines 15 test cases, each with an image_description (text describing what's in the image), a question, a ground_truth answer, and a category (one of: counting, color, spatial, yes-no, reading), (b) implements a simulated VQA model function that extracts answers from the image description using simple keyword matching (e.g., for a counting question, look for number words in the description; for color questions, look for color words), (c) implements evaluate(model_fn) that runs all test cases and computes: exact-match accuracy per category, "relaxed" accuracy (answer is contained in the ground truth or vice versa), and average confidence (1.0 if exact match, 0.5 if relaxed match, 0.0 if wrong), (d) implements error_analysis() that groups failures by category and prints which question types the model struggles with most. Run the evaluation, print the full report with per-category breakdown, and identify the hardest category.

Bedankt en tot de volgende keer!

Hive account@scipio

Learn AI Series (#75) - Multimodal Models - Text Meets Vision

Learn AI Series (#75) - Multimodal Models - Text Meets Vision

What will I learn

Requirements

Difficulty

Curriculum (of the Learn AI Series):

Learn AI Series (#75) - Multimodal Models - Text Meets Vision

Solutions to Episode #74 Exercises

On to today's episode

The fundamental challenge: bridging modalities

CLIP: the bridge between text and images

Vision-language models: giving LLMs eyes

Visual question answering

Image captioning and dense description

Grounded generation: pointing at things

Building a multimodal analysis tool

The contrastive learning mechanism

Limitations and what to watch for

Samengevat

Exercises

Bedankt en tot de volgende keer!

Curriculum (of the `Learn AI Series`):