Nov 28, 2025
The Tech Behind a Reliable eCommerce AI Agent
For Borgh, we set out to build an AI assistant that feels less like a chatbot and more like a domain-aware product expert.

Justin Plagis
Chief Product
Nov 28, 2025
The Tech Behind a Reliable eCommerce AI Agent
For Borgh, we set out to build an AI assistant that feels less like a chatbot and more like a domain-aware product expert.

Justin Plagis
Chief Product
With a catalog of 15,000+ Milwaukee and Makita SKUs, the challenge wasn’t generating text — it was getting retrieval, structure and trust right.
From a CTO lens, this project demonstrates how to combine vector search, traditional filtering, prompt governance and lightweight orchestration into a production-safe E-commerce agent. The result runs across Supabase, LangChain (n8n), OpenAI and LangFuse, with a React front-end on Vercel.
Below is the architecture and reasoning behind it.
Why we didn’t “just add a chatbot”
Tool shoppers aren’t browsing for inspiration — they’re solving a job. Their queries typically involve:
Brand preferences (“Makita M-series?”)
Specific variants (“M12 vs M18?”)
Edge constraints (budget, category, availability)
SKU-level precision
Generic LLM chat falls short here. It hallucinates SKUs, misses category boundaries, and lacks observability. We needed an agent that behaves more like a typed API client than a text generator — with controlled formatting, deterministic retrieval, and prompt governance.
This set the core requirements:
Hybrid retrieval: structured filtering + pgvector semantic search
Deterministic output structure: always SKU-first, always markdown
Observability: traces, versioned prompts, LLM-as-a-judge scoring
Composable orchestration: no monolithic backend, just tools + workflows
Replaceable components: interchangeable LLM, modular edge functions, simple FE integration.
The Architectural Overview
We built the stack around four components:
Experience layer:
React chat widget on Vercel
Sends messages + session ID to an n8n webhook
Renders structured markdown answers
Orchestration layer (n8n):
Fetches latest system prompt from LangFuse
Runs an OpenAI 5.1 mini model as an agent
Provides two Supabase-backed tools:
search-products
search-information
Converts the model output into a structured JSON response
Logs to Google Sheets for lightweight QA
Data layer (Supabase):
Postgres for structured product data
pgvector for semantic embeddings
Edge functions that expose search endpoints to the agent
Acts as the central, typed data platform
Content governance (LangFuse):
Stores versioned system prompts
Captures traces and tool usage
Runs LLM-as-a-judge evaluations for regression testing
Provides analytics for prompt iteration
This architecture keeps each piece isolated, replaceable and inspectable.
Keeping 15,000 SKUs fresh: Magento → Supabase → pgvector
Rather than reindexing the full catalog, we rely on delta ingestion from Magento:
Magento exports daily change logs (CRUD operations) to an sFTP location.
A scheduled Supabase edge function pulls and parses the logs.
Each changed product is normalized into:
Core fields (SKU, price, brand, category, stock)
Descriptive text for embeddings
We compute OpenAI embeddings and update:
Postgres (structured data)
pgvector (semantic layer)
This gives the agent two search primitives:
Precision via SQL filters
Flexibility via semantic similarity
It’s lightweight, predictable, and avoids coupling to Magento’s runtime performance.
Making the agent reliable: a multi-step retrieval strategy
The key decision was to treat the LLM not as a “chat model”, but as a controller that produces deterministic search requests.
Everything the agent does is defined in the system prompt — tone, structure, constraints, and methodology and governed through LangFuse.
Clarify-First
Before any search, the agent must ask 1–3 targeted questions if the query is ambiguous (brand, category, price, number of results). This prevents retrieval drift and significantly reduces irrelevant matches.
Internal Query Plan
The LLM must internally construct (but not display) a typed JSON query plan defining:
Intent (find, compare, browse, order_note)
Mode (close vs. wide)
Constraints (brand, category_ids, min/max price, result count)
Tool configuration (search type, allowed categories, limit, shape)
This forces the model to reason explicitly before making a tool call.
Close → Wide search pattern
We enforce a two-stage search:
Close mode
Constrain to category IDs (derived from a seed SKU or query)
Apply brand + price filters
Use hybrid search as default
Wide mode (fallback)
Triggered automatically if close returns insufficient results
Removes category constraints
Uses semantic/hybrid retrieval over the full catalog
This mirrors how human sales experts narrow → broaden depending on available inventory.
Post-Filtering & Diversity
After retrieval, the agent must:
Remove irrelevant or cross-category items
Eliminate accessories (strict rule)
Produce a maximum of three options
Provide brand or variant diversity wherever possible
Finally, the answer is formatted in stable markdown with SKU-first bullets.
This is the difference between “chatbot answers” and ecommerce-grade product recommendations.
Prompt governance & observability with LangFuse
The project only works because we built observability into the core.
Versioned prompts
LangFuse stores the system prompt (borgh-chat-prompt) as a versioned artifact. n8n retrieves the latest version on every request. This lets us:
Update behavior without deployment
Roll back instantly
Run controlled A/B or phased rollouts
Traces & tool visibility
Every interaction logs:
User input
System prompt version
Tool calls (search-products, search-information)
Final output
Confidence scores
This makes the agent fully inspectable — a non-negotiable requirement for commercial use.
LLM-as-a-judge
We run automated evaluations that check:
Relevance of recommended products
Structural compliance (markdown, SKU-first, max 3 options)
Tone consistency with the Borgh persona
Correctness of “clarify-first” logic
These scores allow regression testing for each prompt or retrieval change, similar to traditional API behavior tests.
Lightweight QA in Google Sheets
We complement LangFuse with a Google Sheet log for human QA, making stakeholder review easy without exposing internal systems.
What’s next
With retrieval, governance and UX foundations in place, next steps are straightforward, expand the agents options to:
Enabling the Agents ability to make discount deals for lowest prices
Helping customers order directly from the agent with Magento backend
Customer support to help with basic support actions like invoicing, shipping and billing information
Last but not least, here's a fine print of the architecture:
[ Shopper ] | v [ Website (React, Vercel) ] | POST message | v [ n8n Orchestrator ] - Fetch prompt from LangFuse - LangChain AI Agent (OpenAI 5.1 mini) - Tools: * search-products (Supabase) * search-information (Supabase) - Parse output - Log to Sheets | v [ Front-end renders structured markdown ] Side connections: [ LangFuse ] <-> prompts, traces, evaluations [ Supabase ] -> Postgres + pgvector + edge functions [ Magento ] -> sFTP -> Supabase ingestion job
From a CTO lens, this project demonstrates how to combine vector search, traditional filtering, prompt governance and lightweight orchestration into a production-safe E-commerce agent. The result runs across Supabase, LangChain (n8n), OpenAI and LangFuse, with a React front-end on Vercel.
Below is the architecture and reasoning behind it.
Why we didn’t “just add a chatbot”
Tool shoppers aren’t browsing for inspiration — they’re solving a job. Their queries typically involve:
Brand preferences (“Makita M-series?”)
Specific variants (“M12 vs M18?”)
Edge constraints (budget, category, availability)
SKU-level precision
Generic LLM chat falls short here. It hallucinates SKUs, misses category boundaries, and lacks observability. We needed an agent that behaves more like a typed API client than a text generator — with controlled formatting, deterministic retrieval, and prompt governance.
This set the core requirements:
Hybrid retrieval: structured filtering + pgvector semantic search
Deterministic output structure: always SKU-first, always markdown
Observability: traces, versioned prompts, LLM-as-a-judge scoring
Composable orchestration: no monolithic backend, just tools + workflows
Replaceable components: interchangeable LLM, modular edge functions, simple FE integration.
The Architectural Overview
We built the stack around four components:
Experience layer:
React chat widget on Vercel
Sends messages + session ID to an n8n webhook
Renders structured markdown answers
Orchestration layer (n8n):
Fetches latest system prompt from LangFuse
Runs an OpenAI 5.1 mini model as an agent
Provides two Supabase-backed tools:
search-products
search-information
Converts the model output into a structured JSON response
Logs to Google Sheets for lightweight QA
Data layer (Supabase):
Postgres for structured product data
pgvector for semantic embeddings
Edge functions that expose search endpoints to the agent
Acts as the central, typed data platform
Content governance (LangFuse):
Stores versioned system prompts
Captures traces and tool usage
Runs LLM-as-a-judge evaluations for regression testing
Provides analytics for prompt iteration
This architecture keeps each piece isolated, replaceable and inspectable.
Keeping 15,000 SKUs fresh: Magento → Supabase → pgvector
Rather than reindexing the full catalog, we rely on delta ingestion from Magento:
Magento exports daily change logs (CRUD operations) to an sFTP location.
A scheduled Supabase edge function pulls and parses the logs.
Each changed product is normalized into:
Core fields (SKU, price, brand, category, stock)
Descriptive text for embeddings
We compute OpenAI embeddings and update:
Postgres (structured data)
pgvector (semantic layer)
This gives the agent two search primitives:
Precision via SQL filters
Flexibility via semantic similarity
It’s lightweight, predictable, and avoids coupling to Magento’s runtime performance.
Making the agent reliable: a multi-step retrieval strategy
The key decision was to treat the LLM not as a “chat model”, but as a controller that produces deterministic search requests.
Everything the agent does is defined in the system prompt — tone, structure, constraints, and methodology and governed through LangFuse.
Clarify-First
Before any search, the agent must ask 1–3 targeted questions if the query is ambiguous (brand, category, price, number of results). This prevents retrieval drift and significantly reduces irrelevant matches.
Internal Query Plan
The LLM must internally construct (but not display) a typed JSON query plan defining:
Intent (find, compare, browse, order_note)
Mode (close vs. wide)
Constraints (brand, category_ids, min/max price, result count)
Tool configuration (search type, allowed categories, limit, shape)
This forces the model to reason explicitly before making a tool call.
Close → Wide search pattern
We enforce a two-stage search:
Close mode
Constrain to category IDs (derived from a seed SKU or query)
Apply brand + price filters
Use hybrid search as default
Wide mode (fallback)
Triggered automatically if close returns insufficient results
Removes category constraints
Uses semantic/hybrid retrieval over the full catalog
This mirrors how human sales experts narrow → broaden depending on available inventory.
Post-Filtering & Diversity
After retrieval, the agent must:
Remove irrelevant or cross-category items
Eliminate accessories (strict rule)
Produce a maximum of three options
Provide brand or variant diversity wherever possible
Finally, the answer is formatted in stable markdown with SKU-first bullets.
This is the difference between “chatbot answers” and ecommerce-grade product recommendations.
Prompt governance & observability with LangFuse
The project only works because we built observability into the core.
Versioned prompts
LangFuse stores the system prompt (borgh-chat-prompt) as a versioned artifact. n8n retrieves the latest version on every request. This lets us:
Update behavior without deployment
Roll back instantly
Run controlled A/B or phased rollouts
Traces & tool visibility
Every interaction logs:
User input
System prompt version
Tool calls (search-products, search-information)
Final output
Confidence scores
This makes the agent fully inspectable — a non-negotiable requirement for commercial use.
LLM-as-a-judge
We run automated evaluations that check:
Relevance of recommended products
Structural compliance (markdown, SKU-first, max 3 options)
Tone consistency with the Borgh persona
Correctness of “clarify-first” logic
These scores allow regression testing for each prompt or retrieval change, similar to traditional API behavior tests.
Lightweight QA in Google Sheets
We complement LangFuse with a Google Sheet log for human QA, making stakeholder review easy without exposing internal systems.
What’s next
With retrieval, governance and UX foundations in place, next steps are straightforward, expand the agents options to:
Enabling the Agents ability to make discount deals for lowest prices
Helping customers order directly from the agent with Magento backend
Customer support to help with basic support actions like invoicing, shipping and billing information
Last but not least, here's a fine print of the architecture:
[ Shopper ] | v [ Website (React, Vercel) ] | POST message | v [ n8n Orchestrator ] - Fetch prompt from LangFuse - LangChain AI Agent (OpenAI 5.1 mini) - Tools: * search-products (Supabase) * search-information (Supabase) - Parse output - Log to Sheets | v [ Front-end renders structured markdown ] Side connections: [ LangFuse ] <-> prompts, traces, evaluations [ Supabase ] -> Postgres + pgvector + edge functions [ Magento ] -> sFTP -> Supabase ingestion job

