man in orange and black vest wearing white helmet holding yellow and black power tool
man in orange and black vest wearing white helmet holding yellow and black power tool
man in orange and black vest wearing white helmet holding yellow and black power tool

Nov 28, 2025

The 4-Layer Stack That Keeps AI Product Recommendations Reliable

With a catalog of 15,000+ Milwaukee and Makita SKUs, the challenge wasn’t generating text, it was getting retrieval, structure and trust right.

Justin Plagis

Chief Product

Nov 28, 2025

The 4-Layer Stack That Keeps AI Product Recommendations Reliable

With a catalog of 15,000+ Milwaukee and Makita SKUs, the challenge wasn’t generating text, it was getting retrieval, structure and trust right.

Justin Plagis

Chief Product

How we built Borgh a 15,000-SKU recommendation agent that can't afford to hallucinate

Generic LLMs hallucinate product codes. When you're running a 15,000-SKU tool catalog, that's not a quirky limitation. It's a business killer.

Tool buyers ask specific questions: "I need an M18 SDS drill that can hammer and chisel for removing bathroom tiles." They want to know if product 4933451430 is still available or if there's a newer model. They're comparing Milwaukee's FMT versus BMT variants.

Standard chatbots can't handle this. They invent SKUs, recommend discontinued products, and can't tell the difference between body-only and kit configurations. We built Borgh's recommendation system to fix that.

Here's the architecture and what broke along the way.

The Problem: 15,000 SKUs, Zero Room for Error

Borgh sells Milwaukee and Makita professional tools online. Their customers aren't browsing. They're solving jobs. A typical query: "I'm looking for a Milwaukee SDS drill in the M18 line that can both hammer drill and chisel."

That's specific. Brand, product line, two functional requirements. Get any of those wrong and the recommendation is useless.

Here's what we were up against:

  • Unstructured data between brands: Milwaukee and Makita structure their product data differently. SKU formats don't match. Category naming is inconsistent. One brand uses "impact driver," the other uses different terminology. We solved this with repeatable patterns for consistent attributes (brand, price, stock) and let semantic search handle the messy stuff (descriptions, feature lists, specifications). Saved us from normalizing everything manually.

  • Variant hell: Every tool comes in multiple configurations: body only (Z-series for Makita), with case (ZJ), with batteries and charger (kit). Users need to know which they're looking at. Generic search returns all variants mixed together.

  • Guardrail failures: Early versions would recommend accessories when asked about tools. We don't sell accessories, strict business rule. But the LLM kept suggesting drill bits, batteries, cases. Filtering post-retrieval wasn't enough. We needed structural prevention.

  • Context management problems: We started with GPT-4.1-mini. It struggled with multi-turn conversations where users refined their requirements across 3-4 messages. The model lost track of earlier constraints (budget, brand preference, specific features). By message five, recommendations drifted.

The biggest issue? No observability. When recommendations went wrong, we had no way to trace why. Was it the user query? The retrieval logic? The prompt? We were debugging blind.

Why Basic Solutions Failed

We tried the obvious approaches first.

Pure vector search returned semantically similar products that violated business constraints. User asks for Milwaukee, vector search returns Makita because the descriptions are similar. Budget is €250, here's a €450 option because the feature descriptions match. Semantic similarity isn't business logic.

Heavy filtering upfront was too rigid. If we constrained by exact category IDs before search, we'd miss relevant products because Milwaukee and Makita don't categorize identically. User asks for "pipe cutter," but one brand filed it under "cutting tools" and another under "plumbing tools."

Letting the LLM freestyle produced inconsistent outputs. Sometimes it returned one product, sometimes five. Formatting changed. SKU placement was random. We couldn't parse the outputs reliably for the front-end.

The real problem: these weren't separate issues to fix individually. We needed an architecture where each piece handled one job well and could fail gracefully.

Layer 1: Experience (React + Vercel)

Front-end is simple. Chat widget sends messages to the backend, receives structured markdown, renders it. No AI logic in the browser. This keeps things fast and secure.

Every response follows the same structure: product name bold, SKU as first bullet, specs, price, who it's for. Max three options. The consistency matters. Users know where to look for information.

Layer 2: Orchestration (n8n + LangChain + OpenAI)

We needed a lightweight agent layer without building backend infrastructure. n8n gave us visual workflows on top of LangChain. When we needed to update retrieval logic or swap models, it was configuration changes, not code deployment.

The orchestrator fetches the latest system prompt from LangFuse and runs OpenAI as an agent.

Two tools connect to Supabase:

  • search-products for hybrid retrieval

  • search-information for general knowledge

Here's the key architectural decision: the LLM doesn't chat. It constructs search requests.

Clarify-first logic. Before searching, the agent asks targeted questions if the query is ambiguous. Brand? Category? Price range? This prevents retrieval drift. Real query: "I'm looking for an M18 SDS drill with chisel and hammer mode." Agent searches immediately. Enough context. But "I need a drill" gets clarification questions.

Multi-step structure:

This solved our context management problems. The agent follows explicit steps:

  1. Analyze user query

  2. Decide: clarify or search?

  3. If search: construct constraints (brand, category, price)

  4. Execute close search (narrow by category)

  5. If <3 results: fallback to wide search (remove category constraint)

  6. Filter results (remove accessories, check relevance)

  7. Format output (markdown, SKU-first)

We had to enforce this structure because GPT-4.1-mini would skip steps or forget earlier constraints. The structured prompt made behavior predictable.

Model changes: We started with GPT-4.1-mini for cost. Latency was fine, but context management failed in multi-turn conversations. The model lost track of constraints after 3-4 exchanges. Switched to GPT-5-mini with low reasoning mode. Better at maintaining context across turns without the overhead of high reasoning models.

Prompt engineering trick: XML format in the system prompt helped consistency with the lower-tier models. GPT-5-mini doesn't need it as much, but we kept it because it doesn't hurt.

Outputs convert to structured JSON. The front-end parses this reliably.

Layer 3: Data (Supabase + pgvector + Magento)

Data layer provides two primitives: SQL filters for precision, pgvector for semantic flexibility.

Postgres holds structured data: SKU, price, brand, category, stock. pgvector holds embeddings. Edge functions expose search endpoints.

Delta ingestion, not full reindex. Magento exports daily change logs (CRUD operations). A scheduled Supabase function pulls changes, normalizes product data, computes embeddings only for what changed. This keeps things lightweight and decouples from Magento's runtime performance.

The hybrid retrieval strategy:

  • Close search: constrain by category, apply brand and price filters, search with SQL + vector within those bounds. Handles "Milwaukee M18 impact driver under €300" with precision.

  • Wide search: remove category constraints, search full catalog semantically. Triggered when close returns fewer than three results.

  • Post-filtering: agent removes irrelevant items, eliminates accessories (business rule), caps at three options.

What broke: Initial filtering was too aggressive. User asks for "new machines." We filtered by date, but Milwaukee/Makita don't timestamp releases consistently. Changed to filter by availability + stock instead.

Layer 4: Governance (LangFuse)

This layer made the system maintainable.

Versioned prompts. LangFuse stores the system prompt as a versioned artifact. n8n fetches the latest on every request. We can update agent behavior without deploying code, roll back if something breaks, run controlled tests.

Full traces. Every interaction logs: user input, prompt version, tools called, outputs, confidence scores. When something goes wrong, we know exactly why.

LLM-as-a-judge - automated evaluation checks:

  • Are recommendations relevant?

  • Does output follow markdown structure?

  • SKU-first format?

  • Max three options?

  • Correct accessory filtering?

This is regression testing for AI. Change the prompt, re-run evaluations, see what breaks.

Real example from testing: User asks "would you choose Trump or Milwaukee?" Off-topic. Agent should decline and redirect. LLM-as-a-judge caught when early prompts would try to answer this. We added explicit off-topic handling.

What We Built vs. What We Learned

Built: System handles 15,000 SKUs, zero hallucinated product codes, sub-2-second responses. Covers brand constraints (Milwaukee/Makita only), variant precision (M12 vs M18, body-only vs kit), multi-turn conversations with consistent context.

What we learned:

  • Guardrails need structure, not just instructions. Telling the LLM "don't recommend accessories" didn't work. We had to build detection logic in the filtering layer and enforce it architecturally.

  • Unstructured data between brands is worse than missing data. Milwaukee uses one category tree, Makita uses another. We solved this with repeatable patterns for the consistent stuff and semantic search for everything else. Saved massive normalization headaches.

  • Context management requires forced structure. Free-form LLM responses drift in multi-turn conversations. GPT-4.1-mini lost track of constraints across messages. The multi-step structure we enforced (analyze → clarify → search → filter → format) kept behavior predictable. Ended up on GPT-5-mini with low reasoning mode for better context handling.

  • Observability isn't optional. LangFuse traces made debugging possible. Without them, we'd still be guessing why certain queries failed.

This pattern works beyond tool catalogs. Anywhere you need AI over structured data with accuracy requirements: parts catalogs, document retrieval, technical support. The principle: treat AI as a controller, not a creative writer.

When this multi-layer approach makes sense:

This architecture works when:

  • Large catalog (10,000+ items)

  • Complex attributes (variants, compatibility, specs)

  • Professional users with specific requirements

  • Existing data infrastructure

It's overkill if:

  • Catalog under 100 items

  • Simple browsing behavior

  • Basic categorization works

The inflection point: when support spends hours answering "which product fits my needs?"

Building AI That Works in Production

AI doesn't have to hallucinate. With hybrid retrieval, structured outputs, and governance, you get recommendations that work at scale.

This is the difference between demo and infrastructure. One looks good in screenshots. The other runs your business.

We've built similar systems for legal documents, B2B parts, and knowledge bases. The patterns repeat: combine structured and semantic search, enforce output consistency, observe everything.

If you're building product recommendations, document search, or any high-stakes AI where accuracy matters, we'd like to hear what you're running into.

Want to see the system in action?
Try it out here: https://korting.onlinebouwgereedschap.nl/

Generic LLMs hallucinate product codes. When you're running a 15,000-SKU tool catalog, that's not a quirky limitation. It's a business killer.

Tool buyers ask specific questions: "I need an M18 SDS drill that can hammer and chisel for removing bathroom tiles." They want to know if product 4933451430 is still available or if there's a newer model. They're comparing Milwaukee's FMT versus BMT variants.

Standard chatbots can't handle this. They invent SKUs, recommend discontinued products, and can't tell the difference between body-only and kit configurations. We built Borgh's recommendation system to fix that.

Here's the architecture and what broke along the way.

The Problem: 15,000 SKUs, Zero Room for Error

Borgh sells Milwaukee and Makita professional tools online. Their customers aren't browsing. They're solving jobs. A typical query: "I'm looking for a Milwaukee SDS drill in the M18 line that can both hammer drill and chisel."

That's specific. Brand, product line, two functional requirements. Get any of those wrong and the recommendation is useless.

Here's what we were up against:

  • Unstructured data between brands: Milwaukee and Makita structure their product data differently. SKU formats don't match. Category naming is inconsistent. One brand uses "impact driver," the other uses different terminology. We solved this with repeatable patterns for consistent attributes (brand, price, stock) and let semantic search handle the messy stuff (descriptions, feature lists, specifications). Saved us from normalizing everything manually.

  • Variant hell: Every tool comes in multiple configurations: body only (Z-series for Makita), with case (ZJ), with batteries and charger (kit). Users need to know which they're looking at. Generic search returns all variants mixed together.

  • Guardrail failures: Early versions would recommend accessories when asked about tools. We don't sell accessories, strict business rule. But the LLM kept suggesting drill bits, batteries, cases. Filtering post-retrieval wasn't enough. We needed structural prevention.

  • Context management problems: We started with GPT-4.1-mini. It struggled with multi-turn conversations where users refined their requirements across 3-4 messages. The model lost track of earlier constraints (budget, brand preference, specific features). By message five, recommendations drifted.

The biggest issue? No observability. When recommendations went wrong, we had no way to trace why. Was it the user query? The retrieval logic? The prompt? We were debugging blind.

Why Basic Solutions Failed

We tried the obvious approaches first.

Pure vector search returned semantically similar products that violated business constraints. User asks for Milwaukee, vector search returns Makita because the descriptions are similar. Budget is €250, here's a €450 option because the feature descriptions match. Semantic similarity isn't business logic.

Heavy filtering upfront was too rigid. If we constrained by exact category IDs before search, we'd miss relevant products because Milwaukee and Makita don't categorize identically. User asks for "pipe cutter," but one brand filed it under "cutting tools" and another under "plumbing tools."

Letting the LLM freestyle produced inconsistent outputs. Sometimes it returned one product, sometimes five. Formatting changed. SKU placement was random. We couldn't parse the outputs reliably for the front-end.

The real problem: these weren't separate issues to fix individually. We needed an architecture where each piece handled one job well and could fail gracefully.

Layer 1: Experience (React + Vercel)

Front-end is simple. Chat widget sends messages to the backend, receives structured markdown, renders it. No AI logic in the browser. This keeps things fast and secure.

Every response follows the same structure: product name bold, SKU as first bullet, specs, price, who it's for. Max three options. The consistency matters. Users know where to look for information.

Layer 2: Orchestration (n8n + LangChain + OpenAI)

We needed a lightweight agent layer without building backend infrastructure. n8n gave us visual workflows on top of LangChain. When we needed to update retrieval logic or swap models, it was configuration changes, not code deployment.

The orchestrator fetches the latest system prompt from LangFuse and runs OpenAI as an agent.

Two tools connect to Supabase:

  • search-products for hybrid retrieval

  • search-information for general knowledge

Here's the key architectural decision: the LLM doesn't chat. It constructs search requests.

Clarify-first logic. Before searching, the agent asks targeted questions if the query is ambiguous. Brand? Category? Price range? This prevents retrieval drift. Real query: "I'm looking for an M18 SDS drill with chisel and hammer mode." Agent searches immediately. Enough context. But "I need a drill" gets clarification questions.

Multi-step structure:

This solved our context management problems. The agent follows explicit steps:

  1. Analyze user query

  2. Decide: clarify or search?

  3. If search: construct constraints (brand, category, price)

  4. Execute close search (narrow by category)

  5. If <3 results: fallback to wide search (remove category constraint)

  6. Filter results (remove accessories, check relevance)

  7. Format output (markdown, SKU-first)

We had to enforce this structure because GPT-4.1-mini would skip steps or forget earlier constraints. The structured prompt made behavior predictable.

Model changes: We started with GPT-4.1-mini for cost. Latency was fine, but context management failed in multi-turn conversations. The model lost track of constraints after 3-4 exchanges. Switched to GPT-5-mini with low reasoning mode. Better at maintaining context across turns without the overhead of high reasoning models.

Prompt engineering trick: XML format in the system prompt helped consistency with the lower-tier models. GPT-5-mini doesn't need it as much, but we kept it because it doesn't hurt.

Outputs convert to structured JSON. The front-end parses this reliably.

Layer 3: Data (Supabase + pgvector + Magento)

Data layer provides two primitives: SQL filters for precision, pgvector for semantic flexibility.

Postgres holds structured data: SKU, price, brand, category, stock. pgvector holds embeddings. Edge functions expose search endpoints.

Delta ingestion, not full reindex. Magento exports daily change logs (CRUD operations). A scheduled Supabase function pulls changes, normalizes product data, computes embeddings only for what changed. This keeps things lightweight and decouples from Magento's runtime performance.

The hybrid retrieval strategy:

  • Close search: constrain by category, apply brand and price filters, search with SQL + vector within those bounds. Handles "Milwaukee M18 impact driver under €300" with precision.

  • Wide search: remove category constraints, search full catalog semantically. Triggered when close returns fewer than three results.

  • Post-filtering: agent removes irrelevant items, eliminates accessories (business rule), caps at three options.

What broke: Initial filtering was too aggressive. User asks for "new machines." We filtered by date, but Milwaukee/Makita don't timestamp releases consistently. Changed to filter by availability + stock instead.

Layer 4: Governance (LangFuse)

This layer made the system maintainable.

Versioned prompts. LangFuse stores the system prompt as a versioned artifact. n8n fetches the latest on every request. We can update agent behavior without deploying code, roll back if something breaks, run controlled tests.

Full traces. Every interaction logs: user input, prompt version, tools called, outputs, confidence scores. When something goes wrong, we know exactly why.

LLM-as-a-judge - automated evaluation checks:

  • Are recommendations relevant?

  • Does output follow markdown structure?

  • SKU-first format?

  • Max three options?

  • Correct accessory filtering?

This is regression testing for AI. Change the prompt, re-run evaluations, see what breaks.

Real example from testing: User asks "would you choose Trump or Milwaukee?" Off-topic. Agent should decline and redirect. LLM-as-a-judge caught when early prompts would try to answer this. We added explicit off-topic handling.

What We Built vs. What We Learned

Built: System handles 15,000 SKUs, zero hallucinated product codes, sub-2-second responses. Covers brand constraints (Milwaukee/Makita only), variant precision (M12 vs M18, body-only vs kit), multi-turn conversations with consistent context.

What we learned:

  • Guardrails need structure, not just instructions. Telling the LLM "don't recommend accessories" didn't work. We had to build detection logic in the filtering layer and enforce it architecturally.

  • Unstructured data between brands is worse than missing data. Milwaukee uses one category tree, Makita uses another. We solved this with repeatable patterns for the consistent stuff and semantic search for everything else. Saved massive normalization headaches.

  • Context management requires forced structure. Free-form LLM responses drift in multi-turn conversations. GPT-4.1-mini lost track of constraints across messages. The multi-step structure we enforced (analyze → clarify → search → filter → format) kept behavior predictable. Ended up on GPT-5-mini with low reasoning mode for better context handling.

  • Observability isn't optional. LangFuse traces made debugging possible. Without them, we'd still be guessing why certain queries failed.

This pattern works beyond tool catalogs. Anywhere you need AI over structured data with accuracy requirements: parts catalogs, document retrieval, technical support. The principle: treat AI as a controller, not a creative writer.

When this multi-layer approach makes sense:

This architecture works when:

  • Large catalog (10,000+ items)

  • Complex attributes (variants, compatibility, specs)

  • Professional users with specific requirements

  • Existing data infrastructure

It's overkill if:

  • Catalog under 100 items

  • Simple browsing behavior

  • Basic categorization works

The inflection point: when support spends hours answering "which product fits my needs?"

Building AI That Works in Production

AI doesn't have to hallucinate. With hybrid retrieval, structured outputs, and governance, you get recommendations that work at scale.

This is the difference between demo and infrastructure. One looks good in screenshots. The other runs your business.

We've built similar systems for legal documents, B2B parts, and knowledge bases. The patterns repeat: combine structured and semantic search, enforce output consistency, observe everything.

If you're building product recommendations, document search, or any high-stakes AI where accuracy matters, we'd like to hear what you're running into.

Want to see the system in action?
Try it out here: https://korting.onlinebouwgereedschap.nl/