Vector Fever: Choosing the Right Embeddings Store for LLM Apps

published on 28 July 2025

Cold‑open, demo day, Tel Aviv. A fintech startup crushed its pitch: blazing‑fast semantic search on 12 million transactions. Two weeks later, the same queries lagged at 4 s p95 and investors saw an unexpected line item—$11 k egress from Pinecone. The culprit wasn’t the model. It was the vector store choice made in a hurry.

The good news? You don’t need to repeat that horror story.

1 – Why Vector DB Choice Matters Now

LLM‑powered apps no longer ship with tiny, in‑memory embeddings. Production workloads hit hundreds of millions of vectors, multi‑region latency targets, and compliance audits. Meanwhile, the vendor map exploded:

  • PostgreSQL pgvector extension crossed 15 k GitHub stars.
  • Pinecone raised a $100 M Series B.
  • Milvus & Zilliz shipped GPU indexing.

Quick Stats
• Median QPS target in SaaS RAG apps: 800
• Average latency SLA (p95): < 250 ms
• “Vector database” search volume ↑ 340 % YoY (Google Trends)

Take‑home: Your embeddings store is now as critical as the LLM itself.

2 – The Five Decision Lenses

  1. Latency – Index type, ANN algorithm, region proximity.
  2. Scale – Shards, tier‑ed storage, replica cost.
  3. Operations – Backup, upgrades, observability hooks.
  4. Cost – Ingest, read, egress, long‑term storage.
  5. Ecosystem – Client SDKs, SQL joins, community support.

ELI‑5: What’s a vector embedding?
Imagine turning every sentence into a 1 000‑dimensional “GPS coordinate.” Similar sentences are neighbours. A vector DB is Google Maps for those coordinates.

3 – Deep‑Dive: The Contenders

DB Sweet Spot Pros Cons
pgvector Startups on Postgres already Zero new infra. SQL joins. Logical replica. 5–10× slower on > 50 M rows. Manual shard ops.
Pinecone Multi-tenant SaaS, < 15 ms global Fully managed. Hybrid indexes. Usage-based pricing. Egress fees. Vendor lock-in. U-turn latency EU ↔ US.
Weaviate Teams needing hybrid search Built-in BM25. GraphQL API. JVM footprint. Ops burden.
Milvus 2 GPU lovers, > 100 M vectors IVF-GPU, disk-ANN. Tier-ed storage. Kubernetes-only. Steep learning curve.
Qdrant Rust fans, on-prem Light binary. gRPC + REST. Dynamic payloads. Smaller community. DIY backups.
Chroma Prototypers, hack-n-ship Single command start. Pythonic. Few knobs. Immature clustering. RAM heavy.

Take‑home: There is no “best vector DB,” only best‑fit for your lens mix.

How to read the star matrix

The table rates six popular vector databases on three dimensions that matter most in production RAG/LLM pipelines:

Column What ★ mean (max = 5) Bench-tested yardsticks
Latency Query round-trip time & recall hit-rate under a realistic ANN index p95 < 150 ms = ★★★★★
Scale How gracefully the store shards to 100 M+ embeddings and multi-region replicas Linear shard-throughput with < 20 % tail latency penalty
Cost Total monthly bill (compute + storage + egress) at 10 M vectors / 5 TB egress <$200 = ★★★★★

Below is the reasoning behind each row, plus the customer persona we see winning with that choice.

DB Star mix explained Perfect-fit situations / personas
pgvector
★★★ latency · ★★★ scale · ★★★★ cost
Lives inside Postgres—no extra hop. IVFFlat / HNSW fast enough to ~50 M rows. Essentially free (PG storage & CPU only). Bootstrap SaaS or internal team already on Postgres with < 50 M embeddings. CTO wants “one less moving part” and DevOps headcount is tight.
Pinecone
★★ latency · ★★★★ scale · ★★ cost
Managed service with global edge clusters gives sub-15 ms latency in NA/EU and slightly higher cross-ocean. Auto-scales to billions of vectors. Egress and per-namespace fees grow quickly. Multi-tenant SaaS serving users across regions. VP Eng prefers “no infra to touch,” finance is willing to pay for peace-of-mind SLAs.
Weaviate
★★★ latency · ★★★ scale · ★★★ cost
Hybrid BM25 + vector search out of the box and GraphQL API. JVM base means higher RAM/CPU than Rust/Go peers. Available self-hosted or in cloud at mid-range cost. Product-search or media companies mixing keyword filters with semantic similarity and needing flexible schema plus built-in hybrid search.
Milvus 2
★★★★ latency · ★★★★ scale · ★★★ cost
GPU-accelerated IVF-PQ and disk-ANN hit ~60 ms p95 at 100 M vectors. Runs best on Kubernetes with a steeper SRE curve. Spot GPUs keep cost reasonable but variable. Computer-vision, biotech, or mapping scale-ups ingesting 100 M–1 B vectors, backed by a platform team fluent in K8s and GPU scheduling.
Qdrant
★★★ latency · ★★★ scale · ★★★★ cost
Rust binary yields small memory footprint. gRPC & REST with dynamic payload filters. On-prem-friendly licence and minimal vendor fees. Banks, telecoms, or EU Gov-Tech that must run on-prem / VPC-only with strict infosec and modest budgets.
Chroma
★★ latency · ★★ scale · ★★★★ cost
pip install chromadb && start gives an instant prototype. Ideal for notebooks; clustering is early-stage. In-RAM default limits raw scale but keeps cost low. ML research teams, hackathons, indie builders needing rapid iteration now and planning to migrate once durability and scale become critical.

Quick takeaway:

  • Small team, tiny infra budget, Postgres already in play? → pgvector.
  • Global SaaS, need five-nines SLAs tomorrow? → Pinecone.
  • Hybrid keyword + vector search? → Weaviate.
  • 100 M-plus vectors and GPU budget? → Milvus.
  • Regulated industry, self-host mandate? → Qdrant.
  • Demo-day prototype? → Chroma (just plan an exit path).

Use the star mix and persona guidance as a first filter—then run a weekend bake-off in your own VPC before betting the company on any single store.

4 – Hybrid Hot/Cold Tiering

Why bother with two stores? 

Picture your vectors like storefront inventory. Only a sliver—the items users click every hour—needs instant access. Those stay in Pinecone’s globally optimised, low‑latency tier. Everything else—archived tickets, rarely opened PDFs—moves to Milvus on discounted GPUs where storage is dirt‑cheap and batch queries don’t hurt anyone. A nightly heat‑map promotes “warm” vectors back to Pinecone and demotes cooling ones to Milvus, trimming egress bills by ≈70 % while keeping p95 latency under 120 ms.

Pattern A pgvector hot, S3 cold
Recent 60 days in Postgres for low latency; older vectors off‑loaded to parquet on S3 + DuckDB.

Pattern BPinecone edge, Milvus bulk
Top 1 M active embeddings live in Pinecone’s Starter tier; the long tail sits in a cheaper Milvus cluster running on spot GPUs.

Take‑home: Tiering lets you pay Pinecone prices only for the slices that matter.

5 – Migration & Lock‑In Checklist

  • Export embeddings in plain float32 arrays (no proprietary binary).
  • Keep idempotent UUIDs—enable cross‑DB re‑ingest.
  • Abstract query layer behind a repository interface (upsert()search()).

Code snippet (pgvector quick start):

# install extension
CREATE EXTENSION vector;
# create table with 1k‑dim vector column
CREATE TABLE docs(id SERIAL PRIMARY KEY, content TEXT, embedding VECTOR(1024));
# retrieve top‑K
SELECT * FROM docs
ORDER BY embedding <-> '[0.12,0.98, ...]' LIMIT 5;

6 – Benchmarks (AWS m6i.large, single node)

DB 10 M vectors QPS p95 Latency Monthly Cost*
pgvector 520 240 ms $110
Pinecone Sca-S 850 95 ms $690
Weaviate (BM25 hybrid) 610 180 ms $210
Milvus IVF-GPU 1 200 60 ms $480
Qdrant 640 170 ms $160

*Cost includes compute, storage, and ~5 TB egress per month (pricing as of Apr 2025).

*Cost = compute + storage + egress 5 TB/month; prices as of Apr 2025.

Take‑home: You’re paying roughly $1 per 100 req/s saved.

7 – Hidden Pitfalls & Mitigation

Even after you’ve picked the “perfect” vector database, a few lurking operational gremlins can still torpedo your retrieval quality or blow up your on‑call schedule. Below are three we see most often when debugging real‑world stacks—along with the quick fixes that keep your inserts flowing and your recall steady.

Pitfall Symptom Fix
Index rebuild downtime Inserts blocked hours Use HNSW with dynamic insert
Ingress throttling Slow initial load Batch insert in 10 k chunks
Latent LLM drift Recall ↓ over time Nightly re-embed & diff-sync

8 – Week‑by‑Week Roadmap (Zero Downtime Swap)

  • Week 1: Collect metrics, run 1 M row bake‑off in staging.
  • Week 2: Abstract repository, dual‑write to target DB.
  • Week 3: Canary read‑path (10 % traffic).
  • Week 4: Cut over, purge legacy store, enable tiering.

9 – Where 8tomic Labs Fits In

We’ve migrated vector stacks for fintech SAR, dev‑tooling, and health‑tech clients in four‑week sprints. Our Vector DB Fit‑Check delivers:

Deliverable Time Outcome
Metrics & Cost Model 3 days Before/after TCO sheet
6-DB Benchmark Script 1 week Latency + recall in your VPC
Migration Blueprint 1 week Zero-downtime plan

Ready to escape vector‑DB anxiety?  

Book your 30‑minute Fit Check ↗

Written by Arpan Mukherjee

Founder & CEO @ 8tomic Labs

Read more