I Turned PostgreSQL Into a Multimodal Search Engine With pgvector — Here's How

📅 May 18, 2026
I Turned PostgreSQL Into a Multimodal Search Engine With pgvector — Here's How
👁 ... views

After my last article on pgvector hybrid search, a reader asked a question I couldn’t answer with SQL alone: “Can I search my product catalog by uploading a photo of what I’m looking for?”

I said no. Not because PostgreSQL can’t do it — but because I’d never tried. Text embeddings I understood. Images, audio, video? That’s a different game.

Or so I thought.

Google dropped Gemini Embedding 2 in March 2026 — the first embedding model that maps text, images, video, audio, and even PDFs into a single unified vector space. That means a photo of a sunset and the text “golden hour photography” end up close together in the same vector space. You can search across modalities with one query, one index, one database.

I spent a weekend building it. Here’s what I learned — the architecture, the code, the parts that broke, and when you should (and shouldn’t) do this in production.

The Architecture — One Embedding Space, Three Modalities

The core idea is deceptively simple:

Text query ──┐
Image upload ─┤── Gemini Embedding 2 ──► vector(768) ──► pgvector HNSW index ──► Results
Audio clip  ──┘

Everything goes through the same embedding model. Everything lives in the same PostgreSQL table. One cosine distance query finds matches across text descriptions, image content, and audio transcripts.

No separate vector database. No separate search index. No ETL pipeline syncing data between three systems.

Here’s the table I ended up with:

CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE media_assets (
    id SERIAL PRIMARY KEY,
    title TEXT NOT NULL,
    asset_type TEXT NOT NULL CHECK (asset_type IN ('text', 'image', 'audio')),
    file_path TEXT,
    description TEXT,
    metadata JSONB DEFAULT '{}',
    embedding vector(768),
    created_at TIMESTAMPTZ DEFAULT NOW()
);

CREATE INDEX media_assets_embedding_idx
    ON media_assets
    USING hnsw (embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 128);

CREATE INDEX media_assets_type_idx
    ON media_assets (asset_type);

768 dimensions — not the model’s native 3072. Gemini Embedding 2 uses Matryoshka Representation Learning (MRL), which means you can safely truncate from 3072 → 768 with minimal quality loss. That’s ~75% less storage per vector. For 100K assets, that’s the difference between 9 GB and 2.3 GB for the embedding column alone.

I learned this the hard way — I started at 3072 dimensions and watched my HNSW index consume 18 GB of RAM on a dataset that didn’t need it. Truncation isn’t a compromise here; it’s the right default.

Generating Embeddings — One Model, Three Inputs

The beauty of Gemini Embedding 2 is that the same API endpoint handles all three modalities. Here’s the Python I wrote:

import base64
from google import genai

client = genai.Client(api_key="YOUR_API_KEY")

def embed_text(text: str) -> list[float]:
    """Embed plain text."""
    result = client.models.embed_content(
        model="gemini-embedding-2",
        contents=[{"parts": [{"text": text}]}],
        config={"output_dimensionality": 768},
    )
    return result.embeddings[0].values


def embed_image(image_path: str) -> list[float]:
    """Embed an image file."""
    with open(image_path, "rb") as f:
        image_b64 = base64.b64encode(f.read()).decode()

    result = client.models.embed_content(
        model="gemini-embedding-2",
        contents=[{
            "parts": [{
                "inline_data": {
                    "data": image_b64,
                    "mime_type": "image/jpeg",
                }
            }]
        }],
        config={"output_dimensionality": 768},
    )
    return result.embeddings[0].values


def embed_audio(audio_path: str) -> list[float]:
    """Embed an audio file (no transcription needed)."""
    with open(audio_path, "rb") as f:
        audio_b64 = base64.b64encode(f.read()).decode()

    result = client.models.embed_content(
        model="gemini-embedding-2",
        contents=[{
            "parts": [{
                "inline_data": {
                    "data": audio_b64,
                    "mime_type": "audio/mp3",
                }
            }]
        }],
        config={"output_dimensionality": 768},
    )
    return result.embeddings[0].values

Three functions, one model, same output shape. The embed_audio call doesn’t transcribe first — it embeds the audio directly. That’s the multimodal part. A recording of ocean waves gets mapped to the same vector neighborhood as a text description of “calm sea” and an image of a beach.

What I got wrong on my first pass: I didn’t normalize the vectors before inserting. Gemini’s MRL truncation produces vectors that aren’t unit-length by default. Without normalization, cosine distance (<=>) gives misleading results. Fix:

import math

def normalize(vector: list[float]) -> list[float]:
    magnitude = math.sqrt(sum(v * v for v in vector))
    if magnitude == 0:
        return vector
    return [v / magnitude for v in vector]

# Always normalize before inserting
embedding = normalize(embed_image("sunset.jpg"))

This cost me an hour of debugging “why is this image matching that unrelated text?” The vectors weren’t normalized, so the distance calculations were measuring magnitude differences, not semantic similarity.

The Search — Cross-Modal Queries in Plain SQL

Once everything is embedded and stored, the search is embarrassingly simple:

SELECT
    id, title, asset_type, description,
    1 - (embedding <=> $1::vector) AS similarity
FROM media_assets
WHERE asset_type = ANY($2::text[])  -- filter by type(s)
ORDER BY embedding <=> $1::vector
LIMIT 20;

Query with text → finds matching images and audio. Query with an image → finds matching text descriptions and similar images. Query with audio → finds matching text and images.

Here’s the FastAPI endpoint I built to tie it together:

from fastapi import FastAPI, UploadFile, Form
import psycopg2
import json

app = FastAPI()

@app.post("/search")
async def search(
    query: str = Form(None),
    image: UploadFile = Form(None),
    asset_types: str = Form("text,image,audio"),  # comma-separated
):
    # Generate embedding based on input type
    if image:
        embedding = normalize(embed_image(await image.read()))
    elif query:
        embedding = normalize(embed_text(query))
    else:
        return {"error": "Provide a text query or image"}

    types = asset_types.split(",")

    with psycopg2.connect("postgresql://user:pass@localhost:5432/search_db") as conn:
        with conn.cursor() as cur:
            cur.execute("""
                SELECT id, title, asset_type, description,
                       1 - (embedding <=> %s::vector) AS similarity
                FROM media_assets
                WHERE asset_type = ANY(%s::text[])
                ORDER BY embedding <=> %s::vector
                LIMIT 20
            """, (json.dumps(embedding), types, json.dumps(embedding)))

            results = [
                {"id": r[0], "title": r[1], "type": r[2],
                 "description": r[3], "similarity": round(r[4], 4)}
                for r in cur.fetchall()
            ]
            return {"results": results}

I tested this with 15,000 product images, 5,000 text descriptions, and 2,000 audio clips from a public dataset. Results came back in 42ms average (p95: 78ms) on a single AWS t3.large instance. No Pinecone, no Qdrant, no dedicated vector infrastructure.

pgvector 0.8.0 — The Filtered Search Fix That Changes Everything

Here’s where this gets interesting for production. If you’ve used pgvector before, you know the overfiltering problem: when you add a SQL WHERE clause on top of a vector search, the HNSW index scan stops before finding enough matching results.

With 10M products and a 0.5% category filter rate, the old behavior would return 4 results instead of the 20 you asked for. The index traversal stopped after visiting 40 nodes (ef_search=40), filtered out 36 of them, and gave you 4.

pgvector 0.8.0 fixes this with iterative index scans:

-- The old way (pgvector < 0.8.0):
SET hnsw.ef_search = 200;  -- brute-force: visit more nodes
-- Still unreliable for selective filters

-- The new way (pgvector 0.8.0+):
SET hnsw.iterative_scan = 'relaxed_order';
SET hnsw.max_scan_tuples = 20000;
-- Automatically continues until enough results pass the filter

The benchmarks from AWS on 10M vectors are dramatic:

Query Typepgvector 0.7.4pgvector 0.8.0Improvement
Top-10 simple123ms13ms9.4x faster
Category-filtered128ms (10% recall)86ms (100% recall)Fixed recall + faster
Complex multi-filter127ms (1% recall)71ms (100% recall)Fixed recall + faster

The recall numbers are what matter. Before 0.8.0, filtered multimodal search would silently return incomplete results. Now it returns everything that matches, and it’s faster doing it.

What I’d do differently: I started building on 0.7.4 and had to rebuild the HNSW index after upgrading to 0.8.0. If you’re starting fresh, use 0.8.0 from day one. The CREATE INDEX syntax is identical — just make sure your PostgreSQL instance has the updated extension.

Where Multimodal pgvector Falls Short

I’m not going to sell you this as a replacement for everything. Here’s where it breaks down:

1. Embedding costs add up. Gemini Embedding 2 costs $0.0025 per 1K characters for text and $0.0015 per image at current pricing. For 100K assets with mixed types, that’s $150-300 to build the initial index. Re-embedding new assets is an ongoing cost. If you’re indexing millions of items, the embedding bill becomes meaningful.

2. Audio embeddings are lossy. Gemini Embedding 2 can embed audio directly, but the 180-second limit per clip means you need to chunk longer audio. I found that 30-second chunks work well for music and ambient sounds, but podcast or meeting recordings need a transcript-first approach (transcribe → embed text).

3. The 8KB page limit bites at higher dimensions. PostgreSQL stores rows in 8KB pages. At 3072 dimensions (3072 × 4 bytes = 12KB), vectors overflow page boundaries, which hurts performance. This is why I use 768 dimensions — it fits cleanly and the MRL truncation keeps quality high. If you absolutely need 3072 dimensions, you’ll need to accept the performance penalty or use binary/scalar quantization.

4. No cross-modal ranking calibration. A text-to-text match and a text-to-image match return the same cosine distance score, but they’re not directly comparable. In practice, I found that text-to-text similarities cluster around 0.7-0.9 while cross-modal matches cluster around 0.4-0.6. You need different similarity thresholds per query type, or a normalization layer.

When This Makes Sense (And When It Doesn’t)

Here’s my decision matrix, based on what I actually built:

ScenarioUse pgvector + multimodalWhy
Product catalog with text + images✅ YesUnified search, one infrastructure
Media library (photos, video clips)✅ YesCross-modal search is the killer feature
Podcast/audio archive with transcripts⚠️ DependsUse transcript-first if audio > 30s
Billion-scale vector workloads❌ NoUse Pinecone/Qdrant — ops complexity wins
Real-time audio matching (live)❌ NoEmbedding latency (~200ms) is too high
Multi-region active-active❌ Nopgvector doesn’t replicate vectors efficiently

What I’d Build Next

If I were productionizing this, three things would be on my list:

First, a batch embedding pipeline. Instead of embedding one asset at a time, I’d batch requests to the Gemini API. Google’s API supports up to 100 contents per request. That’s a 10-20x cost and time reduction for initial indexing.

Second, a hybrid ranking layer that combines vector similarity with PostgreSQL full-text search (which I covered in my previous pgvector article). The vector finds semantically similar items; FTS anchors literal term matches. Combined with Reciprocal Rank Fusion, this gives you the best of both worlds.

Third, a metadata filtering strategy that uses pgvector 0.8.0’s iterative scans properly. The pattern is: B-tree index on the filter column, HNSW index on the embedding, relaxed_order mode. I tested this on a 50K-asset subset and got 98% recall at 35ms average — which is production-grade.

💡

Enjoying the content? Here are tools I personally use and recommend:

  • 🌐 Hosting: Bluehost — what this blog runs on
  • 🛒 Tech Gear: My Amazon Store — keyboards, monitors, dev tools I use

Purchases through my links help keep this blog ad-free 💙

Enjoyed this post?

Subscribe to the newsletter or follow on YouTube for more dev content.

🎬 Watch Shorts