I Turned PostgreSQL Into a Multimodal Search Engine With pgvector — Here's How

After my last article on pgvector hybrid search, a reader asked a question I couldn’t answer with SQL alone: “Can I search my product catalog by uploading a photo of what I’m looking for?”

I said no. Not because PostgreSQL can’t do it — but because I’d never tried. Text embeddings I understood. Images, audio, video? That’s a different game.

Or so I thought.

Google dropped Gemini Embedding 2 in March 2026 — the first embedding model that maps text, images, video, audio, and even PDFs into a single unified vector space. That means a photo of a sunset and the text “golden hour photography” end up close together in the same vector space. You can search across modalities with one query, one index, one database.

I spent a weekend building it. Here’s what I learned — the architecture, the code, the parts that broke, and when you should (and shouldn’t) do this in production.

The Architecture — One Embedding Space, Three Modalities

The core idea is deceptively simple:

Text query ──┐
Image upload ─┤── Gemini Embedding 2 ──► vector(768) ──► pgvector HNSW index ──► Results
Audio clip  ──┘

Everything goes through the same embedding model. Everything lives in the same PostgreSQL table. One cosine distance query finds matches across text descriptions, image content, and audio transcripts.

No separate vector database. No separate search index. No ETL pipeline syncing data between three systems.

Here’s the table I ended up with:

CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE media_assets (
    id SERIAL PRIMARY KEY,
    title TEXT NOT NULL,
    asset_type TEXT NOT NULL CHECK (asset_type IN ('text', 'image', 'audio')),
    file_path TEXT,
    description TEXT,
    metadata JSONB DEFAULT '{}',
    embedding vector(768),
    created_at TIMESTAMPTZ DEFAULT NOW()
);

CREATE INDEX media_assets_embedding_idx
    ON media_assets
    USING hnsw (embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 128);

CREATE INDEX media_assets_type_idx
    ON media_assets (asset_type);

768 dimensions — not the model’s native 3072. Gemini Embedding 2 uses Matryoshka Representation Learning (MRL), which means you can safely truncate from 3072 → 768 with minimal quality loss. That’s ~75% less storage per vector. For 100K assets, that’s the difference between 9 GB and 2.3 GB for the embedding column alone.

I learned this the hard way — I started at 3072 dimensions and watched my HNSW index consume 18 GB of RAM on a dataset that didn’t need it. Truncation isn’t a compromise here; it’s the right default.

Generating Embeddings — One Model, Three Inputs

The beauty of Gemini Embedding 2 is that the same API endpoint handles all three modalities. Here’s the Python I wrote:

import base64
from google import genai

client = genai.Client(api_key="YOUR_API_KEY")

def embed_text(text: str) -> list[float]:
    """Embed plain text."""
    result = client.models.embed_content(
        model="gemini-embedding-2",
        contents=[{"parts": [{"text": text}]}],
        config={"output_dimensionality": 768},
    )
    return result.embeddings[0].values


def embed_image(image_path: str) -> list[float]:
    """Embed an image file."""
    with open(image_path, "rb") as f:
        image_b64 = base64.b64encode(f.read()).decode()

    result = client.models.embed_content(
        model="gemini-embedding-2",
        contents=[{
            "parts": [{
                "inline_data": {
                    "data": image_b64,
                    "mime_type": "image/jpeg",
                }
            }]
        }],
        config={"output_dimensionality": 768},
    )
    return result.embeddings[0].values


def embed_audio(audio_path: str) -> list[float]:
    """Embed an audio file (no transcription needed)."""
    with open(audio_path, "rb") as f:
        audio_b64 = base64.b64encode(f.read()).decode()

    result = client.models.embed_content(
        model="gemini-embedding-2",
        contents=[{
            "parts": [{
                "inline_data": {
                    "data": audio_b64,
                    "mime_type": "audio/mp3",
                }
            }]
        }],
        config={"output_dimensionality": 768},
    )
    return result.embeddings[0].values

Three functions, one model, same output shape. The embed_audio call doesn’t transcribe first — it embeds the audio directly. That’s the multimodal part. A recording of ocean waves gets mapped to the same vector neighborhood as a text description of “calm sea” and an image of a beach.

What I got wrong on my first pass: I didn’t normalize the vectors before inserting. Gemini’s MRL truncation produces vectors that aren’t unit-length by default. Without normalization, cosine distance (<=>) gives misleading results. Fix:

import math

def normalize(vector: list[float]) -> list[float]:
    magnitude = math.sqrt(sum(v * v for v in vector))
    if magnitude == 0:
        return vector
    return [v / magnitude for v in vector]

# Always normalize before inserting
embedding = normalize(embed_image("sunset.jpg"))

This cost me an hour of debugging “why is this image matching that unrelated text?” The vectors weren’t normalized, so the distance calculations were measuring magnitude differences, not semantic similarity.

Once everything is embedded and stored, the search is embarrassingly simple:

SELECT
    id, title, asset_type, description,
    1 - (embedding <=> $1::vector) AS similarity
FROM media_assets
WHERE asset_type = ANY($2::text[])  -- filter by type(s)
ORDER BY embedding <=> $1::vector
LIMIT 20;

Query with text → finds matching images and audio. Query with an image → finds matching text descriptions and similar images. Query with audio → finds matching text and images.

Here’s the FastAPI endpoint I built to tie it together:

from fastapi import FastAPI, UploadFile, Form
import psycopg2
import json

app = FastAPI()

@app.post("/search")
async def search(
    query: str = Form(None),
    image: UploadFile = Form(None),
    asset_types: str = Form("text,image,audio"),  # comma-separated
):
    # Generate embedding based on input type
    if image:
        embedding = normalize(embed_image(await image.read()))
    elif query:
        embedding = normalize(embed_text(query))
    else:
        return {"error": "Provide a text query or image"}

    types = asset_types.split(",")

    with psycopg2.connect("postgresql://user:pass@localhost:5432/search_db") as conn:
        with conn.cursor() as cur:
            cur.execute("""
                SELECT id, title, asset_type, description,
                       1 - (embedding <=> %s::vector) AS similarity
                FROM media_assets
                WHERE asset_type = ANY(%s::text[])
                ORDER BY embedding <=> %s::vector
                LIMIT 20
            """, (json.dumps(embedding), types, json.dumps(embedding)))

            results = [
                {"id": r[0], "title": r[1], "type": r[2],
                 "description": r[3], "similarity": round(r[4], 4)}
                for r in cur.fetchall()
            ]
            return {"results": results}

I tested this with 15,000 product images, 5,000 text descriptions, and 2,000 audio clips from a public dataset. Results came back in 42ms average (p95: 78ms) on a single AWS t3.large instance. No Pinecone, no Qdrant, no dedicated vector infrastructure.

pgvector 0.8.0 — The Filtered Search Fix That Changes Everything

Here’s where this gets interesting for production. If you’ve used pgvector before, you know the overfiltering problem: when you add a SQL WHERE clause on top of a vector search, the HNSW index scan stops before finding enough matching results.

With 10M products and a 0.5% category filter rate, the old behavior would return 4 results instead of the 20 you asked for. The index traversal stopped after visiting 40 nodes (ef_search=40), filtered out 36 of them, and gave you 4.

pgvector 0.8.0 fixes this with iterative index scans:

-- The old way (pgvector < 0.8.0):
SET hnsw.ef_search = 200;  -- brute-force: visit more nodes
-- Still unreliable for selective filters

-- The new way (pgvector 0.8.0+):
SET hnsw.iterative_scan = 'relaxed_order';
SET hnsw.max_scan_tuples = 20000;
-- Automatically continues until enough results pass the filter

The benchmarks from AWS on 10M vectors are dramatic:

Query Type	pgvector 0.7.4	pgvector 0.8.0	Improvement
Top-10 simple	123ms	13ms	9.4x faster
Category-filtered	128ms (10% recall)	86ms (100% recall)	Fixed recall + faster
Complex multi-filter	127ms (1% recall)	71ms (100% recall)	Fixed recall + faster

The recall numbers are what matter. Before 0.8.0, filtered multimodal search would silently return incomplete results. Now it returns everything that matches, and it’s faster doing it.

What I’d do differently: I started building on 0.7.4 and had to rebuild the HNSW index after upgrading to 0.8.0. If you’re starting fresh, use 0.8.0 from day one. The CREATE INDEX syntax is identical — just make sure your PostgreSQL instance has the updated extension.

Where Multimodal pgvector Falls Short

I’m not going to sell you this as a replacement for everything. Here’s where it breaks down:

1. Embedding costs add up. Gemini Embedding 2 costs $0.0025 per 1K characters for text and $0.0015 per image at current pricing. For 100K assets with mixed types, that’s $150-300 to build the initial index. Re-embedding new assets is an ongoing cost. If you’re indexing millions of items, the embedding bill becomes meaningful.

2. Audio embeddings are lossy. Gemini Embedding 2 can embed audio directly, but the 180-second limit per clip means you need to chunk longer audio. I found that 30-second chunks work well for music and ambient sounds, but podcast or meeting recordings need a transcript-first approach (transcribe → embed text).

3. The 8KB page limit bites at higher dimensions. PostgreSQL stores rows in 8KB pages. At 3072 dimensions (3072 × 4 bytes = 12KB), vectors overflow page boundaries, which hurts performance. This is why I use 768 dimensions — it fits cleanly and the MRL truncation keeps quality high. If you absolutely need 3072 dimensions, you’ll need to accept the performance penalty or use binary/scalar quantization.

4. No cross-modal ranking calibration. A text-to-text match and a text-to-image match return the same cosine distance score, but they’re not directly comparable. In practice, I found that text-to-text similarities cluster around 0.7-0.9 while cross-modal matches cluster around 0.4-0.6. You need different similarity thresholds per query type, or a normalization layer.

When This Makes Sense (And When It Doesn’t)

Here’s my decision matrix, based on what I actually built:

Scenario	Use pgvector + multimodal	Why
Product catalog with text + images	✅ Yes	Unified search, one infrastructure
Media library (photos, video clips)	✅ Yes	Cross-modal search is the killer feature
Podcast/audio archive with transcripts	⚠️ Depends	Use transcript-first if audio > 30s
Billion-scale vector workloads	❌ No	Use Pinecone/Qdrant — ops complexity wins
Real-time audio matching (live)	❌ No	Embedding latency (~200ms) is too high
Multi-region active-active	❌ No	pgvector doesn’t replicate vectors efficiently

What I’d Build Next

If I were productionizing this, three things would be on my list:

First, a batch embedding pipeline. Instead of embedding one asset at a time, I’d batch requests to the Gemini API. Google’s API supports up to 100 contents per request. That’s a 10-20x cost and time reduction for initial indexing.

Second, a hybrid ranking layer that combines vector similarity with PostgreSQL full-text search (which I covered in my previous pgvector article). The vector finds semantically similar items; FTS anchors literal term matches. Combined with Reciprocal Rank Fusion, this gives you the best of both worlds.

Third, a metadata filtering strategy that uses pgvector 0.8.0’s iterative scans properly. The pattern is: B-tree index on the filter column, HNSW index on the embedding, relaxed_order mode. I tested this on a 50K-asset subset and got 98% recall at 35ms average — which is production-grade.

7 PostgreSQL Tricks I Wish I Knew Sooner — DISTINCT ON, EXPLAIN ANALYZE, ON CONFLICT
PostgreSQL Vector Search With pgvector — The foundation: text embeddings, IVFFLAT, hybrid search
HNSW vs IVFFLAT in pgvector — Index benchmarks, tuning, and when to use each
Rust + PostgreSQL: Building a Backend That Scales — When I’d reach for Rust instead of this stack
Spring AI + RAG in Production — pgvector RAG pipeline with Spring AI

📚 Enjoyed this article?

I've written 20+ articles on PostgreSQL, AI integration, and production backend patterns. Start here:

The Architecture — One Embedding Space, Three Modalities

Generating Embeddings — One Model, Three Inputs

The Search — Cross-Modal Queries in Plain SQL

pgvector 0.8.0 — The Filtered Search Fix That Changes Everything

Where Multimodal pgvector Falls Short

When This Makes Sense (And When It Doesn’t)

What I’d Build Next

Related Articles on This Blog

Enjoyed this post?