How We Delivered On-Prem AI Recommendations to Power a No-Code Enterprise App

published on 29 July 2025

Executive Snapshot
Challenge: A leading no-code platform needed fast, private recommendation capabilities embedded in their on-prem deployments—without compromising compliance or performance.
Solution: 8tomic Labs designed an on-prem vector search and recommendation engine using pgvector, FastAPI, and a tailored Llama3-27B model, orchestrated via Prefect.
Impact: Achieved <100 ms p95 recommendation latency, 4× throughput increase, and $150 K annual OpEx savings on cloud vendor fees.

1. Client Context & Challenge

The client—a mid-sized enterprise offering a proprietary no-code app builder—enables citizen developers to visually design business apps and forms through a drag-and-drop interface. Their platform supports dynamic forms, conditional fields, and workflow orchestration, empowering non-technical users to launch custom applications in hours rather than weeks.

However, they faced four critical requirements:

  1. Dynamic Configuration & Extensibility: Admins must onboard new data sources and recommendation flows via UI—no redeployments or code changes. The solution needed to let non-technical users supply SQL connections and custom queries on the fly to feed the recommendation engine.
  2. Data Privacy & Compliance: All customer data and recommendation logic must run 100 % on-prem; no cloud egress allowed due to pharma regulations.
  3. Performance SLAs: Embedded recommendations must surface in under 150 ms to maintain seamless end-user interactions within the no-code forms and apps.
  4. Zero-Code Integration: Product teams and end-users expected a plug-and-play widget—auto-embed into any form—without writing integration code for each deployment.

They explored cloud-based AI solutions, but their major pharmaceutical customers mandate air-gapped on-prem deployments—cloud-based offerings were architecturally infeasible for compliance reasons.

2. Solution Architecture

Our architecture balanced privacy, speed, and ease of integration:

The diagram shows how data ingestion, vector storage, model serving, and the API layer interconnect to support sub-100 ms recommendations.

3. Implementation Details

We selected each component for its on-prem friendliness and performance:

  • Vector Store: pgvector extension on PostgreSQL 14. Handles 10 M vectors with efficient IVF indexing.
  • Embedding Model: Llama3‑27B containerized via Modal; quantized to 4‑bit for memory savings.
  • Orchestration: Celery distributed task queues manage long-running ingestion, cleaning, and re-vectorization jobs.
  • API: FastAPI with Uvicorn worker pool; Redis LRU cache for hot queries.

Dynamic Data Pipeline & API Definition

To support the no-code builder’s dynamic app and form creation, we implemented a configurable pipeline and API definition that administrators manage via the platform:

  1. SQL Connection & Query Interface: Admins supply a JDBC/ODBC connection string and a custom SQL query to pull historical or legacy data directly into the pipeline.
  2. Automated Cleaning & Vectorization: A Celery task picks up new ingestion jobs, executes the provided query, applies cleaning routines (null filtering, normalization), and generates embeddings for each record.
  3. Dynamic API Registration: For each new data source, the system dynamically registers a FastAPI endpoint and Swagger definition. This endpoint handles user-triggered calls at runtime—fetching live vector searches when the no-code form is used.
  4. Form-Builder Injection: The no-code platform injects a customizable widget into each form or app. At runtime, when a user interacts (e.g., starts filling a field), the widget calls the dynamic API to surface contextual recommendations—guiding the user or determining the next step in the workflow.

This pattern lets non-technical admins onboard new data sources and recommendation flows without code changes—enabling true no-code extensibility and real-time AI assistance.

4. Runtime Recommendations

Once the dynamic API is registered and the form widget injected, runtime recommendations follow a dual-search approach:

  1. Vector Similarity Search: The widget sends the current form context (e.g., filled fields, user selections) as an embedding request to Llama3-27B. The resulting vector is queried against pgvector to retrieve top-K similar records from historical data.
  2. Textual Keyword Search: In parallel, a lightweight full-text index (PostgreSQL tsvector) performs keyword matches on form fields or metadata.
  3. Score Blending: We normalize both the cosine similarity scores and text search relevance scores, applying configurable weights (default: 70 % vector, 30 % text).
  4. Final Ranking & Response: The combined score determines the ranked list of suggestions, which the widget displays contextually—either as inline hints, dropdown recommendations, or next-action prompts.

This dual-search design ensures both semantic relevance and exact keyword matching—delivering more accurate and comprehensive recommendations in real time.

4. Performance & Benchmark Results

We ran load tests simulating 1000 QPS of recommendation queries. 

These benchmarks provide quantitative proof that the on-prem design outperforms the previous cloud setup in both speed and cost.

Metric Baseline (Cloud) On-Prem (8tomic) Improvement
p95 Latency 180 ms < 100 ms 1.8× faster
Throughput 500 RPS 2 000 RPS
Cost / Month >$12 K $2 K 83 % savings
Cache Hit Rate 75 %

5. Lessons Learned & Trade-Offs

  1. Caching vs. Freshness: Redis cache cut latency by 40 % but introduced staleness. We implemented TTLs (30 s) to balance cache benefit.
  2. Quantization Impact: 4-bit quantization of Llama3-27B reduced memory by 60 % with only 2 % loss in embedding quality.
  3. Schema Evolution: New fields in customer CSVs broke ingestion. Solution: auto-generate migrations via Prefect validation hooks.

Transparent discussion of trade-offs helps readers anticipate similar challenges.

6. Business Impact & ROI

User Engagement Lift: Click-through rates on recommended items jumped 27 % post-launch.
Operational Savings: Eliminating cloud vendor fees saved $120 K annually.

The calculator allows readers to estimate their own savings by adjusting query volume and unit cost.

ROI Calculator:

8. Where 8tomic Labs can help

If you’re evaluating OnPrem AI recommendations, 8tomic Labs can provide the below: 

  • Architecture workshops to align on SLAs and compliance.
  • Prototype builds within 2 weeks.
  • Performance tuning and long-term support.

Ready to power your enterprise app with lightning-fast, private recommendations?

Book an OnPrem AI Blueprint Session Today↗

Written by Arpan Mukherjee

Founder & CEO @ 8tomic Labs

Read more