LLM Memory Management — How a $4k/month Token Leak Nearly Broke Our Chatbot
Stop treating LLM memory as a black box.
- Semantic Memory Stores user facts and preferences. Production risk: unbounded growth if extraction thresholds are too low.
- Episodic Memory Stores conversation summaries. Production risk: summary drift over time if not re-summarized.
- Procedural Memory Stores system behavior rules. Production risk: prompt injection via user-controlled memory updates.
- Memory Extraction LLM call to parse raw text into structured factoids. Production risk: cost explosion if you extract after every turn.
- Memory Retrieval Vector search to find relevant memories. Production risk: stale embeddings after schema change.
- Memory Consolidation Merging and deduplicating memories. Production risk: data loss if merge logic is too aggressive.
Think of LLM memory like a sticky note system. The model starts each conversation with a blank slate, so you have to write down what it learned from previous chats. If you write too much, the notes get expensive and slow. If you write too little, the model forgets who you are. This article shows you how to write the right notes, at the right time, without burning cash.
Three weeks ago, our customer support chatbot’s monthly token bill jumped from $2,400 to $6,800. No traffic spike. No model upgrade. Just a silent memory leak. The memory system we'd built — a simple vector store of user preferences — was growing unbounded. Every conversation extracted 15-20 new factoids, and we were injecting all of them into every prompt. The p99 latency went from 1.2s to 4.7s. Users started seeing 'I'm sorry, I can't answer that' timeouts. We had built a memory system that remembered everything and cost us everything.
Most tutorials on LLM memory management stop after showing you how to extract and store memories. They don't tell you about the memory consolidation pipeline you need to prevent token bloat. They don't mention that your embedding model will silently break after a schema change. They definitely don't show you how to debug a memory system at 2am when the p99 is screaming red.
This article covers exactly what I wish I'd known before that incident: how memory extraction actually works under the hood, the production patterns for scaling to 10K+ users, the common mistakes that cost real money, and a debugging guide for when things go wrong. Every section includes a real incident, runnable code, and a production insight that the docs won't tell you.
How LLM Memory Extraction Actually Works Under the Hood
Memory extraction is not magic. It's a structured LLM call that takes raw conversation text and outputs a JSON array of factoids. The prompt typically looks like: 'Extract important facts about the user from this conversation. Return a JSON array of objects with keys: content, importance (1-10), category.' The LLM then parses the conversation and generates these factoids.
What the docs don't tell you: the LLM will hallucinate facts if the prompt is too vague. We saw this when our extraction prompt didn't specify 'only extract facts explicitly stated by the user'. The model started inferring preferences like 'User likes blue color' from a message that mentioned 'blue sky'. We fixed this by adding an explicit constraint: 'Only extract facts that are directly stated, not inferred.'
Another hidden detail: extraction is expensive. Each call consumes ~200-500 tokens for the prompt + output. If you extract after every user message, you're burning tokens. We now extract only after every 3rd message, or when the conversation exceeds 2000 tokens. This cut our extraction costs by 60%.
Building a Production-Grade Memory Store with ChromaDB and Deduplication
Once you have extracted memories, you need to store them efficiently. We use ChromaDB for vector storage because it's simple to set up and has good Python bindings. But the default setup is not production-ready. You need to add deduplication at write time, not just at read time.
The deduplication logic: before inserting a new memory, compute its embedding and check cosine similarity against all existing memories for that user. If similarity > 0.85, skip insertion. This prevents the store from filling with near-duplicate facts like 'User likes Python' and 'User enjoys programming in Python'.
We also add a timestamp and a hit counter to each memory. The hit counter increments every time a memory is retrieved and injected into a prompt. This allows us to prune low-value memories (those with < 5 hits in 30 days) during consolidation.
When NOT to Use LLM Memory — The Case for Stateless Design
Not every application needs long-term memory. In fact, adding memory to a system that doesn't need it adds latency, cost, and complexity. Here's when you should skip it:
- One-shot tasks: If users interact with your app once (e.g., a translation tool), memory adds no value. The user won't come back.
- Highly sensitive data: If your app deals with PII or health data, storing user conversations as memories creates compliance headaches. GDPR right-to-erasure becomes a nightmare when memories are spread across vector stores.
- High-throughput, low-latency systems: If you need sub-200ms responses, the memory retrieval step adds 50-100ms. Skip it.
- When the context window is enough: For short conversations (< 4K tokens), just include the raw history. No need for extraction.
We learned this the hard way when we added memory to our internal log analysis tool. Users would run a single query, get an answer, and leave. The memory store grew to 50K entries in a month, and nobody ever retrieved them. We removed memory and saved $800/month in embedding API costs.
Production Patterns for Scaling Memory to 10K+ Users
Scaling memory to thousands of users requires more than just a vector store. Here are the patterns we use in production:
- Shard by user ID: Use consistent hashing to distribute users across multiple ChromaDB instances. This prevents a single instance from becoming a bottleneck.
- Batch extraction: Don't extract memories after every message. Batch them: collect 5-10 messages, then extract in one call. This reduces API calls by 80%.
- Lazy retrieval: Don't retrieve memories on every turn. Retrieve only when the conversation enters a new topic (detected by embedding similarity drop > 0.3).
- Memory TTL: Set a time-to-live on memories. For most apps, 30 days is enough. After that, archive to cold storage (S3) and only retrieve if explicitly needed.
- Pre-compute embeddings: For known users, pre-compute and cache their top 10 memories every hour. This avoids the retrieval step for most interactions.
We serve 15K active users with this setup. The p99 retrieval latency is 45ms, and the monthly embedding cost is $1,200.
Common Mistakes That Cost Real Money — With Specific Examples
Here are the three most expensive mistakes we've seen teams make with LLM memory:
- Extracting after every turn: A team building a personal assistant extracted memories after every user message. With 10 messages per session and 1000 users, that's 10K extraction calls per day. At $0.0015 per call (GPT-4o-mini), that's $15/day or $450/month. But they also injected all memories into the prompt, adding 2000 tokens per turn. That's another $20/day. Total: $35/day for a feature that didn't improve user satisfaction. Fix: extract every 5th message, inject only top 5 memories.
- No deduplication: Another team stored every extracted factoid without checking for duplicates. After a week, one user had 200 memories, 80% of which were duplicates like 'User likes coffee' and 'User prefers coffee'. The injection prompt was 4000 tokens just for memories. Fix: add cosine similarity dedup at write time.
- Using the same embedding model for retrieval and extraction: A team used 'text-embedding-3-small' for both extraction and retrieval. When they switched to 'text-embedding-3-large' for better accuracy, the old embeddings became incompatible, and retrieval returned garbage. Fix: version your embeddings. Store the model name in metadata and re-embed on model change.
Comparison: LangMem vs. Custom Memory vs. Mem0
We evaluated three approaches for memory management: LangMem (LangChain's memory module), a custom-built system, and Mem0 (an open-source memory layer). Here's the production comparison:
LangMem: Good for quick prototyping. It handles extraction and storage out of the box. But it's opinionated: it uses LangChain's abstractions, which can be hard to customize. We found it hard to add custom deduplication logic. Also, it uses OpenAI embeddings by default, which adds API costs. For production, we needed more control.
Mem0: Excellent for teams that want a turnkey solution. It handles extraction, storage, and retrieval with a simple API. But it's a black box: when something goes wrong (e.g., token leak), it's hard to debug. We also hit a bug where Mem0's consolidation cron job ran every hour and caused latency spikes. The fix was to disable the cron and run it manually.
Custom system: This is what we ended up with. It gives us full control over every aspect: extraction prompt, deduplication logic, storage backend, retrieval strategy. The trade-off is development time: it took us 2 weeks to build vs. 2 days to integrate LangMem. But for a system handling 15K users, the control is worth it.
Recommendation: Start with LangMem or Mem0 for MVP. Switch to custom when you hit scaling or customization limits.
Debugging and Monitoring Memory Systems in Production
You can't fix what you can't see. Here's the monitoring setup we use for our memory system:
- Memory store size per user: Track the number of memories per user. Alert if any user exceeds 500 memories (indicates dedup failure).
- Extraction call count: Track the number of extraction calls per user per session. Alert if > 10 calls per session (indicates buffer not clearing).
- Injection token count: Track the number of tokens injected into the prompt from memory. Alert if > 2000 tokens (indicates no re-ranking).
- Retrieval latency: Track p50, p95, p99 of memory retrieval. Alert if p99 > 200ms.
- Embedding model version: Track the current embedding model version. Alert if it changes without a re-embed job.
We use Prometheus for metrics and Grafana for dashboards. Here's a sample metric definition.
The Unbounded Memory Leak That Cost $4,000 in One Weekend
extract_memories() which returned 5-10 new factoids per turn, even if they were redundant with existing ones. The vector store grew linearly with conversation length, and we injected all memories into the system prompt without any budget or relevance filtering.- Always set a hard cap on the number of memories stored per user. Unbounded growth is a ticking time bomb.
- Implement deduplication at extraction time, not just at retrieval time. It's cheaper to skip a write than to filter a read.
- Monitor memory store size and injection token count as a standard metric. We now have a dashboard for 'memories per user' and 'memory token % of prompt'.
grep 'extract_memories' /var/log/app/llm.log | wc -l vs yesterday. If count > 2x, extraction prompt is too aggressive.collection.get(where={'user_id': 'abc'}, limit=50). Check if old memories have high similarity to current query.curl -X POST http://localhost:8000/embed -d '{"input": "test"}'. Compare hash of model config with prod config.grep 'consolidated' /var/log/app/memory.log. If deletion count > 20% of total, reduce similarity threshold from 0.85 to 0.75.python -c "import json; logs = [json.loads(l) for l in open('llm.log') if 'extract' in l]; print(f'Extractions in last hour: {len(logs)}')"python -c "import json; logs = [json.loads(l) for l in open('llm.log') if 'inject_memories' in l]; print(f'Avg token injection: {sum(l['tokens'] for l in logs)/len(logs)}')"extraction_interval=5 in config.Key takeaways
Common mistakes to avoid
4 patternsAppending full history to every prompt
No deduplication on memory writes
Using LLM to summarize memory every turn
Storing memory per-session instead of per-user
Interview Questions on This Topic
How would you design a memory system for an LLM chatbot that handles 100K users?
Frequently Asked Questions
That's Context Engineering. Mark it forged?
7 min read · try the examples if you haven't