In Part 1, I wrote about why we built an AI travel interface from scratch, what we learned from our first failed attempt in 2019, and why a plugin can never get you to a fully integrated conversational experience. This part is the how. The architecture, the search pipeline, the memory problem that took four attempts to solve, and the UI/UX work that makes 25 seconds feel like 5.
The architecture
Since this is a tech blog and not a marketing page, let me show you what's actually running. This isn't a weekend project. This is production infrastructure.
Frontend (React + TypeScript + Zustand)
│
│ WebSocket (real-time, bidirectional)
│
FastAPI Backend
├── Query Orchestrator
│ ├── LLM Client (multi-provider with automatic fallback)
│ ├── Tool Registry (9 specialized tools)
│ └── Observer System (async post-turn analysis)
├── RAG Pipeline
│ ├── Vector Database (3 content collections)
│ ├── Embedding Engine
│ └── Reranker (dual provider, switchable)
├── Conversation Memory (dual storage)
│ ├── SQL (message history)
│ └── Vector DB (semantic search over past messages)
├── Security Layer
│ ├── Input Guard (prompt injection defense)
│ ├── Output Guard (credential leak prevention)
│ └── Rate Limiting + IP Management
├── MCP Server (Model Context Protocol)
│ └── External AI assistant access to our data
└── Lead Management
├── Proposal Generation
└── Specialist Handoff
The frontend is a React application with real-time WebSocket communication. No polling. No request-response cycles where you wait for a full answer. Every token streams in real-time, buffered at 50ms intervals to prevent UI flicker. Product results push to the interface the moment they're found, before the LLM even starts composing its response.
The backend is Python/FastAPI with a modular tool system. The LLM doesn't just generate text. It has access to 9 specialized tools and decides autonomously which ones to use. Need to search trips? It calls search_products. Need destination information? It calls search_knowledge. Need to recall what was discussed earlier? It queries conversation memory semantically.
The LLM isn't generating answers. It's orchestrating tools and composing a response from real data.
The LLM layer runs with a multi-provider fallback system. If the primary provider goes down, the system automatically switches to the secondary. No downtime. No errors. The customer never notices.
And there's already an MCP server running, the protocol I described in my Web 4.0 post. External AI assistants can already query our travel data through it. It's early days, but the infrastructure for AI-to-business communication is live. When Lisa's AI assistant wants to search 333travel's products, the endpoint already exists.
The RAG pipeline: where the real work happens
RAG, Retrieval-Augmented Generation, is the backbone of the entire system. It's what separates "GPT with a knowledge base" from "an AI that actually knows your product." But RAG done poorly is almost worse than no RAG at all.
We maintain three separate vector collections:
| Collection | What's in it | What it does |
|---|---|---|
| Travel Products | Roundtrips, hotels, tours, cruises | Powers the main product search |
| Travel Details | Day programs, excursions, inclusions | Answers specific itinerary questions |
| Travel Knowledge | Blogs, destination guides, travel info | Provides context and inspiration |
When a customer asks something, the query gets embedded and thrown against the relevant collection. But here's the critical part: we don't just take the top vector matches and call it a day.
Vector similarity alone is mediocre. It gets you in the right ballpark, but it doesn't get you the best results. So we do a broad fetch, significantly more results than we need, and then run them through a reranker. The reranker evaluates each query-document pair for actual semantic relevance, not just embedding proximity.
The difference is enormous. It's the difference between "here are trips that contain some of the words you used" and "here are trips that actually match what you're looking for."
But it goes further than that. Before the query even hits the vector database, it runs through a fuzzy normalization layer. Every country and location name in our system is cached at startup. When a customer types "Bali" we know they mean Indonesia. When they write "Tailand" we know what they meant. Aliases, spelling variations, language differences, all resolved before the search begins. Without this, you get empty results for queries that should have matched. And on top of that, the search combines semantic similarity with metadata filters. Country, duration, product type, these aren't just fields in a database. They're active filters that work together with the vector search to narrow results before the reranker even starts.
After reranking, the results split into two paths. The full set of reranked results pushes to the frontend immediately, but blurred. The customer sees product cards appearing while Joy is still thinking. It signals progress without overwhelming. Meanwhile, the LLM gets a stripped-down version of those results, only the fields it needs to reason about. This projection saves roughly a third of the tokens, which means faster responses and lower costs. Then Joy composes her answer and discusses the 4 or 5 trips that genuinely fit the question. Only those discussed products become visible to the customer. The rest disappear. What the customer sees is a curated, considered selection. Not a list of 10 search results. A recommendation of 5 that actually make sense.
Broad fetch → Rerank → Project. Three steps that make the difference between a search result and a recommendation.
The memory problem
This is the part that took the most iterations to get right.
Conversational AI has a dirty secret: it doesn't remember anything. Every LLM call is stateless. The model has no idea what was said before unless you explicitly feed it the context. And there's a limit to how much context you can feed.
For a simple Q&A chatbot, this doesn't matter. Someone asks a question, gets an answer, done.
But for a travel assistant? Memory is everything.
A customer says: "I want to go to Thailand for two weeks." Three turns later: "Actually, make it three weeks. And I want to include Laos." Five turns later: "What was that trip you showed me earlier with the cooking class?"
Without memory, every turn is a blank slate. The system doesn't know about Thailand. Doesn't know about the three-week change. Doesn't know about Laos. Definitely doesn't know about the cooking class trip from seven messages ago.
Without memory, every message is a first date. Your AI has no idea what happened before.
This wasn't solved in one attempt. Not in two. Not even in three.
The first version was simple: feed the last 10 question-answer pairs into the LLM context. It worked for short conversations. But our conversations aren't short. Customers explore, compare, come back to earlier ideas, change their mind. Ten turns isn't enough when someone has been chatting for twenty minutes about three different countries.
So we tried generating summaries in the background. The idea was sound: condense the conversation into key points so the LLM always has the full picture without eating up the entire context window. The reality was different. The LLM started hallucinating details into the summaries. Mixing up countries. Attributing preferences the customer never mentioned. The summary was supposed to be the source of truth, and it was making things up.
Then we designed a three-stage memory system: short-term, ephemeral, and long-term. The ephemeral layer was supposed to track which topics belonged to which destination, which preferences applied to which part of the trip. On paper, elegant. In practice, a nightmare. It required the AI to understand complex relational context, like "when they said budget was flexible, they meant for Thailand, not for Vietnam." It couldn't. Too many edge cases. Too fragile.
What we landed on was simpler and more robust. Every message gets embedded as a vector in a dedicated memory collection in the same database we use for product search. When the customer says "that trip with the cooking class," the system runs a semantic search across all past messages in that session. It finds the relevant message even if the exact words don't match. Maybe Joy called it a "culinary experience" three turns ago. Vector search doesn't care about word matching. It matches meaning. Combined with the last 10 exchanges in the prompt for immediate context, Joy doesn't need to make that tool call for every question. She has the recent conversation right there, and the full history searchable when she needs it.
Four attempts. Each one taught us something about what AI can and can't do with context. The final version works not because it's the smartest design, but because it's the most honest about the limitations.
AI memory isn't one problem. It's two: recalling what was said, and understanding what was meant. We needed two different systems to solve them.
UI/UX: the invisible work
Here's where most AI projects die, and they don't even realize it.
You can have the most sophisticated RAG pipeline in the world. The smartest retrieval. The best reranking. But if the interface feels clunky, slow, or confusing, none of it matters. The customer doesn't care about your architecture. They care about how it feels.
If your AI is smart but your interface is slow, you built a genius locked in a closet.
We went deep on this. And "deep" means dozens of iterations on things most people would consider details. But details are the product.
Streaming that feels natural. LLM responses stream token by token over WebSocket. But raw streaming creates flickering text that's jittery and unpleasant. We buffer at 50ms intervals. Fast enough to feel real-time, slow enough to prevent constant re-renders. The difference is subtle but massive in terms of perceived quality.
The speed tradeoff. Let's be honest about something: a product search with a full answer takes 20 to 30 seconds. That's long. We know it's long. But we made a deliberate choice: quality over speed. The broad fetch, the reranking, the semantic matching, every step in the pipeline exists because it makes the results better. Cut any of them and you get faster responses with worse recommendations. We'd rather have Joy work for 25 seconds and deliver 5 trips that genuinely fit than return 10 mediocre matches in 8 seconds.
So the question became: how do we make 25 seconds not feel like 25 seconds? That's where the UX work came in. Joy shows what she's doing. Searching products. Looking up destination information. Checking conversation history. Every tool call is visible as it happens. Product cards push in blurred while she's still composing her answer. Once the answer is ready, only the trips Joy actually discusses become visible. The rest fade away. The customer never stares at a blank screen. They see the process unfold. And it turns out that watching something work feels completely different from waiting for something to finish.
Part of the latency is also a language problem. Dutch is poorly represented in most embedding and language models. The semantic understanding that works beautifully in English degrades when your customers and your product data are in Dutch. It means more processing, more careful matching, more effort to get the same level of quality.
What is GPT-NL? GPT-NL is a Dutch initiative to build a large language model specifically trained on and optimized for the Dutch language.

Where current models like GPT and Claude treat Dutch as a non existent language, GPT-NL aims to make Dutch a first-class citizen in AI. For applications like ours, where both the customer and the product data are Dutch, a model that truly understands the language could significantly improve semantic search quality and reduce the processing overhead we currently need to compensate for that gap. We're watching this project closely.
And then there's model selection. The quality of the conversation depends heavily on which LLM you use. We run on Sonnet 4.6 as our primary, and the difference with smaller or cheaper models is not subtle. It's the difference between a conversation that feels natural and one that feels scripted. Haiku, the faster and cheaper option, feels like a downgrade the moment you switch. GPT serves as our automatic fallback, but it's a step back in conversational quality. These aren't interchangeable commodities. The model you choose defines the experience your customer gets.
Instant page loads. The entire conversation persists in local storage. When you refresh the page or come back later, your conversation is already there. Instantly. No loading spinner. No "reconnecting..." message. The sync with the server happens silently in the background.
Markdown normalization. LLMs have opinions about formatting. Claude loves bold headers with emoji. GPT loves numbered lists. We normalize everything server-side. Bold-with-emoji gets converted to clean headers. Horizontal rules get stripped. Font weights get tuned down. The result looks like it was designed, not generated. The customer should never feel like they're reading AI output.
Session continuity. Your session survives across page refreshes, browser closes, and reconnections. Come back a day later, your conversation is still there. Your saved trips are still there. And if a session does expire, you get a clean overlay explaining what happened, not a broken page or a cryptic error.
The lead flow. When a customer is ready to talk to a specialist, the system generates a travel proposal pre-filled with what it already knows from the conversation. Destinations discussed, trip preferences, products explored. Minimal friction. Maximum context. The human specialist picks up exactly where Joy left off. That's not a handoff. That's the whole point. Joy exists to make that human conversation better, not to avoid it.
Every one of these details took multiple iterations. Every one of them is invisible to the customer. That's the point.
Security: an honest note
I'll keep this section brief and honest. Security isn't something you finish. It's something you work on constantly.
We have a dual guard system. An input guard before the LLM that catches prompt injection attempts, and an output guard after the LLM that catches credential leaks or system prompt fragments in responses. Both are deterministic and add virtually zero latency.
The system prompt is hardened with anti-jailbreak instructions. There's rate limiting, bot protection, and IP management for persistent abuse.
Is it bulletproof? No. Nothing is. But it's built with defense in depth in mind, and we iterate on it as new attack patterns emerge. If you're building any AI interface that talks to the public and you haven't thought about input/output guards, you have a problem waiting to happen.
The bigger picture: an extra channel, not a replacement
Let me be very clear about what Joy is and what she isn't.
Joy is not a replacement for our website. She's not a replacement for phone or email. She's an additional channel. A new way for customers to explore our products, alongside everything that already exists. She will never be a replacement for human contact. We value that the most in our company. Joy is a means to human contact. She helps customers figure out what they want, so that when they talk to one of our specialists, that conversation starts at a completely different level.
Joy isn't the destination. She's the road that leads to a real conversation with a real person.
The MCP server is already live, which means AI assistants can already query our data programmatically. When the Web 4.0 vision materializes, when Lisa's AI assistant autonomously searches for her perfect Thailand trip, the endpoint already exists. Not because we scrambled to build it. Because we've been exploring this infrastructure for over a year.
Joy isn't the end product. She's the beginning of a channel strategy that includes both human-to-AI and AI-to-AI interactions. And she's a test, a very serious, very thorough test, of whether conversational AI can actually add value for travel customers today.
The results so far have been better than I expected. Not flawless, nothing is, but genuinely impressive in how natural the conversations feel and how relevant the recommendations are. When you see a customer describe a vague idea and Joy comes back with trips that actually fit, you realize this isn't a gimmick. It works. And it works because of everything underneath it.
So, what are we trying to solve again?
I opened with this question because it's the one that guided everything we built. Not "how do we use AI?" Not "what's our competitors doing?" Not "how do we look innovative?"
Just: what problem are we solving, and does this solution actually work?
We built Joy to find out. With real architecture. Real data. Real customers. And the willingness to say "it's not ready yet" if that turned out to be the case.
If you're thinking about adding AI to your product, start there. Not with the tool. Not with the vendor pitch. Not with the pressure to keep up.
Start with the question. Build from the answer.
Your customers deserve better than a FAQ with a face. They deserve a real conversation. And eventually, a real person. Make sure your AI leads them there, not away from it.
If you haven't read Part 1, that's where the story starts: why we built this, what we learned from failing at it in 2019, and why you can't get here with a plugin.

