Multi-tenant RAG + Agent platform built 100% on AWS Bedrock — Cohere
embed-multilingual-v3for embeddings, Claude Sonnet 4.6 for generation, a LangGraph tool-calling agent over RAG + mock-ERP tools, FastAPI + pgvector backend, Svelte 5 frontend, packaged as a Helm chart with Prometheus/Grafana observability, fronted by a Kong API gateway and traced with LangFuse.Built by a CTO with 22+ years of enterprise systems (SAP / Epicor / Tiptop ERP + BI) — practical AI engineering grounded in real production constraints, not toy demos.
🔒 This is a showcase repository
Architecture, decisions, demo, and tech writeup are public here. Full source code lives in a private repository — recruiters can request read access by emailing [email protected] with their GitHub username.
30-second walkthrough — upload README → ask in Traditional Chinese → get source-cited answer.
RAG test cases the system handles correctly:
A LangGraph agent decides which tool to call — RAG knowledge-base search or one of five mock-ERP tools — and chains them for multi-step reasoning. Every response includes an auditable tool trace.
erp_check_inventory then erp_estimate_delivery, and summarizes stock + delivery dateerp_list_inventory (structured ERP data, not the knowledge base)Hard guardrails live in the harness (tool allow-list, tenant isolation, recursion cap), not just the prompt.
The agent chained erp_check_inventory → erp_estimate_delivery and rendered a stock + delivery summary — the tool trace above the answer is fully auditable.
After 22+ years deploying SAP, Epicor and Tiptop ERP across manufacturing and consulting, I keep seeing the same pattern when companies try to "do AI":
Enterprise AI Platform is the reference architecture I built to address exactly that — a multi-tenant RAG service where every BU shares one infrastructure, all LLM traffic goes through a single AWS Bedrock IAM key, and source-cited answers ground the AI in actual documents (not hallucinations).
This is the kind of system I would deploy on day one if asked to lead AI integration at a manufacturing or ERP-heavy organization.
flowchart TB
UI["Svelte 5 (runes) + Vite + Tailwind<br/>RAG chat · Agent chat (tool trace) · Upload"]
API["FastAPI async (Python 3.12)<br/>/ingest · /rag/query · /agent/chat · /erp/* · /metrics"]
AGENT["LangGraph agent (custom StateGraph)<br/>agent → tools → agent loop"]
TOOLS["Tools: RAG search + 5 mock-ERP<br/>(inventory · orders · partners · delivery)"]
DB[("PostgreSQL 16 + pgvector<br/>1024 dim · cosine · multi-tenant")]
EMB["Bedrock: Cohere embed-multilingual-v3"]
LLM["Bedrock: Claude Sonnet 4.6<br/>(US cross-region inference profile)"]
UI -->|fetch JSON| API
API --> AGENT
AGENT -->|bind_tools| LLM
AGENT --> TOOLS
API -->|SQLAlchemy 2.0 async| DB
API -->|boto3| EMB
TOOLS -.->|RAG| DB
EMB -.->|1024-d vector| DB
Deployment & ops: packaged as a single Helm chart (backend + frontend + in-cluster pgvector); Prometheus scrapes /metrics via a ServiceMonitor, and a Grafana dashboard auto-imports (verified on minikube + kube-prometheus-stack). A Kong API gateway (DB-less) fronts the backend with API-key auth + rate limiting, and LangFuse traces every agent run.
search_document (ingest) vs search_query (retrieval). Higher accuracy than symmetric embeddings on multilingual content.us.anthropic.claude-sonnet-4-6 auto-fails over across US regions. No extra cost, no manual fallback code.tenant_id. Designed for the real enterprise case of one shared service across BUs.AWS_BEARER_TOKEN_BEDROCK covers both Cohere + Claude. One bill, one audit trail.[1][2] citations + collapsible source cards. Validated against "out-of-scope" prompts to prevent hallucination.StateGraph routes between RAG and five mock-ERP tools, chains them for multi-step reasoning (stock → delivery date), and returns an auditable tool_trace./metrics with custom agent_tool_calls_total{tool} + auto-imported Grafana dashboard.| Layer | Choice | Why |
|---|---|---|
| Frontend | Svelte 5 (runes) + Vite + Tailwind | Small bundle, no SSR overhead, runes for clean state |
| Backend | FastAPI + SQLAlchemy 2.0 async | Async friendly for LLM-long-tail latency, OpenAPI baked in |
| Vector DB | pgvector + cosine distance | Join with metadata in the same transaction, no extra ops |
| Embedding | Cohere embed-multilingual-v3 via Bedrock |
100+ languages, leading Traditional Chinese quality |
| Generation | Claude Sonnet 4.6 (US cross-region IP) via Bedrock | Top-tier Chinese reasoning, AWS-governed billing |
| Async Bedrock | boto3 + asyncio.to_thread |
Avoids pulling in aioboto3 dependency tree |
| Chunking | LangChain RecursiveCharacterTextSplitter | Chinese punctuation aware split points |
| Agent | LangGraph custom StateGraph + langchain-aws |
Explicit tool-calling loop, auditable trace |
| Orchestration | Kubernetes + Helm chart | Whole stack, one helm install |
| Observability | Prometheus + Grafana (kube-prometheus-stack) | /metrics + custom agent/RAG metrics |
| API Gateway | Kong (DB-less, declarative) | API-key auth + rate limiting, config as code |
| LLM Tracing | LangFuse (self-hosted) | Per-run agent trace: question → tools → answer |
| Package mgmt | uv (backend) + pnpm (frontend) |
Fastest installs, smallest on-disk footprint |
| Container | Docker multi-stage (builder + runtime) | ~250 MB backend, ~50 MB nginx frontend |
| CI | GitHub Actions (lint + test + docker smoke) | Three parallel jobs |
The private repo contains 9 full Architecture Decision Records following Michael Nygard's format. Summaries:
Single IAM key for both Cohere + Claude, monthly AWS billing (vs prepaid Stripe), cross-region inference profile for free HA, AWS-native plumbing (S3 / CloudWatch / Budgets). Bypasses Taiwan-specific Stripe / VAT friction. Trade-off: one-time use-case form on first Anthropic invocation.
embed-multilingual-v3 over Amazon TitanLeading multilingual quality on Traditional Chinese, first-class asymmetric retrieval API (search_document vs search_query), 1024-dim sweet spot. 5× the unit cost of Titan v2 is acceptable for the quality gap.
RAG generation is "synthesize from supplied context" — Sonnet handles it cleanly at 5× lower cost and 2× faster than Opus. Opus's edge is novel reasoning, which RAG doesn't need. Cost-aware model selection is itself a senior engineering signal.
One stateful system instead of two — vector lives in the same Postgres as metadata, JOIN in one transaction, multi-tenant via WHERE tenant_id = …. Pinecone's $70+/month adds vendor + bill + network hop without proportional benefit at this scale (< 10M vectors).
Single-page chat UI has no SSR / file-based routing / per-route data needs. Vanilla Vite produces a pure static SPA that's trivial to mount into FastAPI later. Cuts framework surface in half without sacrificing reactivity.
StateGraph over prebuilt create_react_agentThe explicit agent → tools → agent loop with a tools_condition edge and recursion cap makes the control flow self-documenting and leaves clean seams for enterprise concerns (approval gates, guardrails, auditable tool traces). Depends only on stable StateGraph / ToolNode primitives, not a churning prebuilt helper.
Self-contained helm install on bare minikube (no cloud account for the demo) and demonstrates stateful-workload competence (StatefulSet + PVC + readiness probe). Managed Postgres is documented as the production swap (override DATABASE_URL).
Gateway policy lives in one version-controlled kong.yml (no gateway database), verifiable in three curls (401 / 200 / 429) and clean for GitOps. Kong Ingress Controller and DB-backed Kong were considered but add machinery without demo benefit.
Postgres-only self-host (avoids the heavy v3 ClickHouse stack) for a real, reproducible trace. Instruments with the low-level LangFuse client rather than the LangChain handler, sidestepping a v2-handler-needs-old-langchain version conflict. Tracing is optional and graceful; LangFuse Cloud + the v4 SDK is the documented production swap.
Numbers are from manual smoke tests on a local Mac. Not production benchmarks.
| Metric | Value | Notes |
|---|---|---|
| Embedding latency | ~150 ms | Cohere multilingual v3, batch of 6 chunks |
| RAG P50 latency | ~1.8 s | top_k = 3, Sonnet 4.6 |
| RAG P95 latency | ~3.5 s | with full source content rendered |
| Docker image size | ~250 MB | multi-stage, Python slim |
| Bedrock cold start | ~6 s | first invocation only |
| Item | Unit price | Demo usage | Cost |
|---|---|---|---|
| Cohere embed | $0.10 / 1M tokens | 50K tokens | $0.005 |
| Claude Sonnet input | $3 / 1M tokens | 10K tokens | $0.03 |
| Claude Sonnet output | $15 / 1M tokens | 5K tokens | $0.08 |
| Total per demo run | < $0.20 |
AWS monthly billing — no Stripe prepay, no VAT validation friction.
| Week | Scope | Status |
|---|---|---|
| 1 | RAG core (ingest + query + Svelte 5 UI) | ✅ |
| 2 | LangGraph Agent + tool calling + mock ERP webhook | ✅ |
| 3 | Helm Chart + K8s deployment + Prometheus monitoring | ✅ |
| 4 | Kong API Gateway + LangFuse tracing + ADR catalog | ✅ |
Hugo Peng — CTO at an EdTech company, MBA, with 22+ years across the dominant ERP ecosystems in Greater China:
| Years | Role | Stack |
|---|---|---|
| 2025 – now | CTO @ EdTech company (current) | AI strategy + engineering |
| 2018 – 2025 | Consultant → CIO → CTO @ itelligensys | Epicor ERP consulting and platform leadership |
| 2015 – 2017 | Senior Engineer @ Vivotek | SAP ERP (PP / MM / PM / QM / SD / BI) |
| 2010 – 2014 | SAP ERP PP + BI Leader @ TSMT | SAP module ownership + BI team lead |
| 2006 – 2010 | ERP / BI Specialist @ DragonJet | Tiptop ERP |
| 2004 – 2006 | ERP Logistics Specialist @ Rossmax | Tiptop ERP |
Especially in roles where enterprise system integration and AI engineering need the same person in the room.
The complete implementation lives in a private GitHub repository containing:
For recruiters and hiring managers:
The showcase content (this README, diagrams, demo gif) is published under MIT for portfolio purposes. See LICENSE.