Executive Snapshot
Challenge: A leading no-code platform needed fast, private recommendation capabilities embedded in their on-prem deployments—without compromising compliance or performance.
Solution: 8tomic Labs designed an on-prem vector search and recommendation engine using pgvector, FastAPI, and a tailored Llama3-27B model, orchestrated via Prefect.
Impact: Achieved <100 ms p95 recommendation latency, 4× throughput increase, and $150 K annual OpEx savings on cloud vendor fees.
1. Client Context & Challenge
The client—a mid-sized enterprise offering a proprietary no-code app builder—enables citizen developers to visually design business apps and forms through a drag-and-drop interface. Their platform supports dynamic forms, conditional fields, and workflow orchestration, empowering non-technical users to launch custom applications in hours rather than weeks.
However, they faced four critical requirements:
- Dynamic Configuration & Extensibility: Admins must onboard new data sources and recommendation flows via UI—no redeployments or code changes. The solution needed to let non-technical users supply SQL connections and custom queries on the fly to feed the recommendation engine.
- Data Privacy & Compliance: All customer data and recommendation logic must run 100 % on-prem; no cloud egress allowed due to pharma regulations.
- Performance SLAs: Embedded recommendations must surface in under 150 ms to maintain seamless end-user interactions within the no-code forms and apps.
- Zero-Code Integration: Product teams and end-users expected a plug-and-play widget—auto-embed into any form—without writing integration code for each deployment.
They explored cloud-based AI solutions, but their major pharmaceutical customers mandate air-gapped on-prem deployments—cloud-based offerings were architecturally infeasible for compliance reasons.
2. Solution Architecture
Our architecture balanced privacy, speed, and ease of integration:
The diagram shows how data ingestion, vector storage, model serving, and the API layer interconnect to support sub-100 ms recommendations.
3. Implementation Details
We selected each component for its on-prem friendliness and performance:
- Vector Store: pgvector extension on PostgreSQL 14. Handles 10 M vectors with efficient IVF indexing.
- Embedding Model: Llama3‑27B containerized via Modal; quantized to 4‑bit for memory savings.
- Orchestration: Celery distributed task queues manage long-running ingestion, cleaning, and re-vectorization jobs.
- API: FastAPI with Uvicorn worker pool; Redis LRU cache for hot queries.
Dynamic Data Pipeline & API Definition
To support the no-code builder’s dynamic app and form creation, we implemented a configurable pipeline and API definition that administrators manage via the platform:
- SQL Connection & Query Interface: Admins supply a JDBC/ODBC connection string and a custom SQL query to pull historical or legacy data directly into the pipeline.
- Automated Cleaning & Vectorization: A Celery task picks up new ingestion jobs, executes the provided query, applies cleaning routines (null filtering, normalization), and generates embeddings for each record.
- Dynamic API Registration: For each new data source, the system dynamically registers a FastAPI endpoint and Swagger definition. This endpoint handles user-triggered calls at runtime—fetching live vector searches when the no-code form is used.
- Form-Builder Injection: The no-code platform injects a customizable widget into each form or app. At runtime, when a user interacts (e.g., starts filling a field), the widget calls the dynamic API to surface contextual recommendations—guiding the user or determining the next step in the workflow.
This pattern lets non-technical admins onboard new data sources and recommendation flows without code changes—enabling true no-code extensibility and real-time AI assistance.
4. Runtime Recommendations
Once the dynamic API is registered and the form widget injected, runtime recommendations follow a dual-search approach:
- Vector Similarity Search: The widget sends the current form context (e.g., filled fields, user selections) as an embedding request to Llama3-27B. The resulting vector is queried against pgvector to retrieve top-K similar records from historical data.
- Textual Keyword Search: In parallel, a lightweight full-text index (PostgreSQL tsvector) performs keyword matches on form fields or metadata.
- Score Blending: We normalize both the cosine similarity scores and text search relevance scores, applying configurable weights (default: 70 % vector, 30 % text).
- Final Ranking & Response: The combined score determines the ranked list of suggestions, which the widget displays contextually—either as inline hints, dropdown recommendations, or next-action prompts.
This dual-search design ensures both semantic relevance and exact keyword matching—delivering more accurate and comprehensive recommendations in real time.
4. Performance & Benchmark Results
We ran load tests simulating 1000 QPS of recommendation queries.
These benchmarks provide quantitative proof that the on-prem design outperforms the previous cloud setup in both speed and cost.
Metric | Baseline (Cloud) | On-Prem (8tomic) | Improvement |
---|---|---|---|
p95 Latency | 180 ms | < 100 ms | 1.8× faster |
Throughput | 500 RPS | 2 000 RPS | 4× |
Cost / Month | >$12 K | $2 K | 83 % savings |
Cache Hit Rate | — | 75 % | — |
5. Lessons Learned & Trade-Offs
- Caching vs. Freshness: Redis cache cut latency by 40 % but introduced staleness. We implemented TTLs (30 s) to balance cache benefit.
- Quantization Impact: 4-bit quantization of Llama3-27B reduced memory by 60 % with only 2 % loss in embedding quality.
- Schema Evolution: New fields in customer CSVs broke ingestion. Solution: auto-generate migrations via Prefect validation hooks.
Transparent discussion of trade-offs helps readers anticipate similar challenges.
6. Business Impact & ROI
User Engagement Lift: Click-through rates on recommended items jumped 27 % post-launch.
Operational Savings: Eliminating cloud vendor fees saved $120 K annually.
The calculator allows readers to estimate their own savings by adjusting query volume and unit cost.
ROI Calculator:
8. Where 8tomic Labs can help
If you’re evaluating OnPrem AI recommendations, 8tomic Labs can provide the below:
- Architecture workshops to align on SLAs and compliance.
- Prototype builds within 2 weeks.
- Performance tuning and long-term support.
Ready to power your enterprise app with lightning-fast, private recommendations?
Book an OnPrem AI Blueprint Session Today↗
Written by Arpan Mukherjee
Founder & CEO @ 8tomic Labs