Cold‑open, demo day, Tel Aviv. A fintech startup crushed its pitch: blazing‑fast semantic search on 12 million transactions. Two weeks later, the same queries lagged at 4 s p95 and investors saw an unexpected line item—$11 k egress from Pinecone. The culprit wasn’t the model. It was the vector store choice made in a hurry.
The good news? You don’t need to repeat that horror story.
1 – Why Vector DB Choice Matters Now
LLM‑powered apps no longer ship with tiny, in‑memory embeddings. Production workloads hit hundreds of millions of vectors, multi‑region latency targets, and compliance audits. Meanwhile, the vendor map exploded:
- PostgreSQL pgvector extension crossed 15 k GitHub stars.
- Pinecone raised a $100 M Series B.
- Milvus & Zilliz shipped GPU indexing.
Quick Stats
• Median QPS target in SaaS RAG apps: 800
• Average latency SLA (p95): < 250 ms
• “Vector database” search volume ↑ 340 % YoY (Google Trends)
Take‑home: Your embeddings store is now as critical as the LLM itself.
2 – The Five Decision Lenses
- Latency – Index type, ANN algorithm, region proximity.
- Scale – Shards, tier‑ed storage, replica cost.
- Operations – Backup, upgrades, observability hooks.
- Cost – Ingest, read, egress, long‑term storage.
- Ecosystem – Client SDKs, SQL joins, community support.
ELI‑5: What’s a vector embedding?
Imagine turning every sentence into a 1 000‑dimensional “GPS coordinate.” Similar sentences are neighbours. A vector DB is Google Maps for those coordinates.
3 – Deep‑Dive: The Contenders
DB | Sweet Spot | Pros | Cons |
---|---|---|---|
pgvector | Startups on Postgres already | Zero new infra. SQL joins. Logical replica. | 5–10× slower on > 50 M rows. Manual shard ops. |
Pinecone | Multi-tenant SaaS, < 15 ms global | Fully managed. Hybrid indexes. Usage-based pricing. | Egress fees. Vendor lock-in. U-turn latency EU ↔ US. |
Weaviate | Teams needing hybrid search | Built-in BM25. GraphQL API. | JVM footprint. Ops burden. |
Milvus 2 | GPU lovers, > 100 M vectors | IVF-GPU, disk-ANN. Tier-ed storage. | Kubernetes-only. Steep learning curve. |
Qdrant | Rust fans, on-prem | Light binary. gRPC + REST. Dynamic payloads. | Smaller community. DIY backups. |
Chroma | Prototypers, hack-n-ship | Single command start. Pythonic. Few knobs. | Immature clustering. RAM heavy. |
Take‑home: There is no “best vector DB,” only best‑fit for your lens mix.
How to read the star matrix
The table rates six popular vector databases on three dimensions that matter most in production RAG/LLM pipelines:
Column | What ★ mean (max = 5) | Bench-tested yardsticks |
---|---|---|
Latency | Query round-trip time & recall hit-rate under a realistic ANN index | p95 < 150 ms = ★★★★★ |
Scale | How gracefully the store shards to 100 M+ embeddings and multi-region replicas | Linear shard-throughput with < 20 % tail latency penalty |
Cost | Total monthly bill (compute + storage + egress) at 10 M vectors / 5 TB egress | <$200 = ★★★★★ |
Below is the reasoning behind each row, plus the customer persona we see winning with that choice.
DB | Star mix explained | Perfect-fit situations / personas |
---|---|---|
pgvector ★★★ latency · ★★★ scale · ★★★★ cost |
Lives inside Postgres—no extra hop. IVFFlat / HNSW fast enough to ~50 M rows. Essentially free (PG storage & CPU only). | Bootstrap SaaS or internal team already on Postgres with < 50 M embeddings. CTO wants “one less moving part” and DevOps headcount is tight. |
Pinecone ★★ latency · ★★★★ scale · ★★ cost |
Managed service with global edge clusters gives sub-15 ms latency in NA/EU and slightly higher cross-ocean. Auto-scales to billions of vectors. Egress and per-namespace fees grow quickly. | Multi-tenant SaaS serving users across regions. VP Eng prefers “no infra to touch,” finance is willing to pay for peace-of-mind SLAs. |
Weaviate ★★★ latency · ★★★ scale · ★★★ cost |
Hybrid BM25 + vector search out of the box and GraphQL API. JVM base means higher RAM/CPU than Rust/Go peers. Available self-hosted or in cloud at mid-range cost. | Product-search or media companies mixing keyword filters with semantic similarity and needing flexible schema plus built-in hybrid search. |
Milvus 2 ★★★★ latency · ★★★★ scale · ★★★ cost |
GPU-accelerated IVF-PQ and disk-ANN hit ~60 ms p95 at 100 M vectors. Runs best on Kubernetes with a steeper SRE curve. Spot GPUs keep cost reasonable but variable. | Computer-vision, biotech, or mapping scale-ups ingesting 100 M–1 B vectors, backed by a platform team fluent in K8s and GPU scheduling. |
Qdrant ★★★ latency · ★★★ scale · ★★★★ cost |
Rust binary yields small memory footprint. gRPC & REST with dynamic payload filters. On-prem-friendly licence and minimal vendor fees. | Banks, telecoms, or EU Gov-Tech that must run on-prem / VPC-only with strict infosec and modest budgets. |
Chroma ★★ latency · ★★ scale · ★★★★ cost |
pip install chromadb && start gives an instant prototype. Ideal for notebooks; clustering is early-stage. In-RAM default limits raw scale but keeps cost low. |
ML research teams, hackathons, indie builders needing rapid iteration now and planning to migrate once durability and scale become critical. |
Quick takeaway:
- Small team, tiny infra budget, Postgres already in play? → pgvector.
- Global SaaS, need five-nines SLAs tomorrow? → Pinecone.
- Hybrid keyword + vector search? → Weaviate.
- 100 M-plus vectors and GPU budget? → Milvus.
- Regulated industry, self-host mandate? → Qdrant.
- Demo-day prototype? → Chroma (just plan an exit path).
Use the star mix and persona guidance as a first filter—then run a weekend bake-off in your own VPC before betting the company on any single store.
4 – Hybrid Hot/Cold Tiering
Why bother with two stores?
Picture your vectors like storefront inventory. Only a sliver—the items users click every hour—needs instant access. Those stay in Pinecone’s globally optimised, low‑latency tier. Everything else—archived tickets, rarely opened PDFs—moves to Milvus on discounted GPUs where storage is dirt‑cheap and batch queries don’t hurt anyone. A nightly heat‑map promotes “warm” vectors back to Pinecone and demotes cooling ones to Milvus, trimming egress bills by ≈70 % while keeping p95 latency under 120 ms.
Pattern A – pgvector hot, S3 cold
Recent 60 days in Postgres for low latency; older vectors off‑loaded to parquet on S3 + DuckDB.
Pattern B – Pinecone edge, Milvus bulk
Top 1 M active embeddings live in Pinecone’s Starter tier; the long tail sits in a cheaper Milvus cluster running on spot GPUs.
Take‑home: Tiering lets you pay Pinecone prices only for the slices that matter.
5 – Migration & Lock‑In Checklist
- Export embeddings in plain float32 arrays (no proprietary binary).
- Keep idempotent UUIDs—enable cross‑DB re‑ingest.
- Abstract query layer behind a repository interface (
upsert()
,search()
).
Code snippet (pgvector quick start):
# install extension
CREATE EXTENSION vector;
# create table with 1k‑dim vector column
CREATE TABLE docs(id SERIAL PRIMARY KEY, content TEXT, embedding VECTOR(1024));
# retrieve top‑K
SELECT * FROM docs
ORDER BY embedding <-> '[0.12,0.98, ...]' LIMIT 5;
6 – Benchmarks (AWS m6i.large, single node)
DB | 10 M vectors QPS | p95 Latency | Monthly Cost* |
---|---|---|---|
pgvector | 520 | 240 ms | $110 |
Pinecone Sca-S | 850 | 95 ms | $690 |
Weaviate (BM25 hybrid) | 610 | 180 ms | $210 |
Milvus IVF-GPU | 1 200 | 60 ms | $480 |
Qdrant | 640 | 170 ms | $160 |
*Cost includes compute, storage, and ~5 TB egress per month (pricing as of Apr 2025).
*Cost = compute + storage + egress 5 TB/month; prices as of Apr 2025.
Take‑home: You’re paying roughly $1 per 100 req/s saved.
7 – Hidden Pitfalls & Mitigation
Even after you’ve picked the “perfect” vector database, a few lurking operational gremlins can still torpedo your retrieval quality or blow up your on‑call schedule. Below are three we see most often when debugging real‑world stacks—along with the quick fixes that keep your inserts flowing and your recall steady.
Pitfall | Symptom | Fix |
---|---|---|
Index rebuild downtime | Inserts blocked hours | Use HNSW with dynamic insert |
Ingress throttling | Slow initial load | Batch insert in 10 k chunks |
Latent LLM drift | Recall ↓ over time | Nightly re-embed & diff-sync |
8 – Week‑by‑Week Roadmap (Zero Downtime Swap)
- Week 1: Collect metrics, run 1 M row bake‑off in staging.
- Week 2: Abstract repository, dual‑write to target DB.
- Week 3: Canary read‑path (10 % traffic).
- Week 4: Cut over, purge legacy store, enable tiering.
9 – Where 8tomic Labs Fits In
We’ve migrated vector stacks for fintech SAR, dev‑tooling, and health‑tech clients in four‑week sprints. Our Vector DB Fit‑Check delivers:
Deliverable | Time | Outcome |
---|---|---|
Metrics & Cost Model | 3 days | Before/after TCO sheet |
6-DB Benchmark Script | 1 week | Latency + recall in your VPC |
Migration Blueprint | 1 week | Zero-downtime plan |
Ready to escape vector‑DB anxiety?
Book your 30‑minute Fit Check ↗
Written by Arpan Mukherjee
Founder & CEO @ 8tomic Labs