Senior 7 min · May 22, 2026

LLM Memory Management — How a $4k/month Token Leak Nearly Broke Our Chatbot

Stop treating LLM memory as a black box.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide ⚙ Triage Commands
Quick Answer
  • Semantic Memory Stores user facts and preferences. Production risk: unbounded growth if extraction thresholds are too low.
  • Episodic Memory Stores conversation summaries. Production risk: summary drift over time if not re-summarized.
  • Procedural Memory Stores system behavior rules. Production risk: prompt injection via user-controlled memory updates.
  • Memory Extraction LLM call to parse raw text into structured factoids. Production risk: cost explosion if you extract after every turn.
  • Memory Retrieval Vector search to find relevant memories. Production risk: stale embeddings after schema change.
  • Memory Consolidation Merging and deduplicating memories. Production risk: data loss if merge logic is too aggressive.
✦ Definition~90s read
What is LLM Memory Management?

LLM memory management is the infrastructure layer that gives chatbots persistent context across sessions, solving the fundamental statelessness problem of large language models. Without it, every conversation starts from scratch — no recall of user preferences, past decisions, or ongoing tasks.

Think of LLM memory like a sticky note system.

The core challenge isn't storing data; it's extracting, deduplicating, and retrieving the right information from unstructured conversation history at inference time, all while keeping token costs under control. A single misconfigured memory pipeline can silently leak thousands of dollars monthly through redundant context injection, as the $4k/month token leak in this article demonstrates.

Under the hood, memory management typically involves three phases: extraction (parsing conversation logs into structured facts or summaries using LLM calls), storage (persisting embeddings and metadata in vector databases like ChromaDB or pgvector), and retrieval (fetching relevant memories at query time via semantic similarity search). Production systems must handle deduplication — avoiding storing 'user likes pizza' ten times — and implement eviction policies for stale or contradictory memories.

The trade-off is always between recall quality and latency/cost: injecting too much context bloats prompts, while too little makes the bot seem amnesiac.

When to skip memory entirely: stateless design wins for single-turn tasks (translation, summarization, code generation) where context is provided inline. Memory adds complexity, latency, and cost — don't use it unless your use case genuinely requires cross-session continuity.

For production scaling beyond 10K users, you'll need sharded vector stores, async extraction pipelines, and tiered memory (short-term vs. long-term) to avoid O(n²) retrieval costs. Tools like LangMem and Mem0 abstract some of this but introduce vendor lock-in and opaque pricing; custom solutions with ChromaDB give you full control over deduplication logic and cost optimization, which is critical when every token counts.

LLM Memory Management Architecture diagram: LLM Memory Management LLM Memory Management store retrieve 1 Conversation Incoming messages 2 Working Memory Active context window 3 External Store Redis / Postgres 4 Retriever Semantic search 5 Context Builder Merge + rank 6 LLM Response Memory-grounded reply THECODEFORGE.IO
Plain-English First

Think of LLM memory like a sticky note system. The model starts each conversation with a blank slate, so you have to write down what it learned from previous chats. If you write too much, the notes get expensive and slow. If you write too little, the model forgets who you are. This article shows you how to write the right notes, at the right time, without burning cash.

Three weeks ago, our customer support chatbot’s monthly token bill jumped from $2,400 to $6,800. No traffic spike. No model upgrade. Just a silent memory leak. The memory system we'd built — a simple vector store of user preferences — was growing unbounded. Every conversation extracted 15-20 new factoids, and we were injecting all of them into every prompt. The p99 latency went from 1.2s to 4.7s. Users started seeing 'I'm sorry, I can't answer that' timeouts. We had built a memory system that remembered everything and cost us everything.

Most tutorials on LLM memory management stop after showing you how to extract and store memories. They don't tell you about the memory consolidation pipeline you need to prevent token bloat. They don't mention that your embedding model will silently break after a schema change. They definitely don't show you how to debug a memory system at 2am when the p99 is screaming red.

This article covers exactly what I wish I'd known before that incident: how memory extraction actually works under the hood, the production patterns for scaling to 10K+ users, the common mistakes that cost real money, and a debugging guide for when things go wrong. Every section includes a real incident, runnable code, and a production insight that the docs won't tell you.

How LLM Memory Extraction Actually Works Under the Hood

Memory extraction is not magic. It's a structured LLM call that takes raw conversation text and outputs a JSON array of factoids. The prompt typically looks like: 'Extract important facts about the user from this conversation. Return a JSON array of objects with keys: content, importance (1-10), category.' The LLM then parses the conversation and generates these factoids.

What the docs don't tell you: the LLM will hallucinate facts if the prompt is too vague. We saw this when our extraction prompt didn't specify 'only extract facts explicitly stated by the user'. The model started inferring preferences like 'User likes blue color' from a message that mentioned 'blue sky'. We fixed this by adding an explicit constraint: 'Only extract facts that are directly stated, not inferred.'

Another hidden detail: extraction is expensive. Each call consumes ~200-500 tokens for the prompt + output. If you extract after every user message, you're burning tokens. We now extract only after every 3rd message, or when the conversation exceeds 2000 tokens. This cut our extraction costs by 60%.

memory_extraction.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
import json
from openai import OpenAI

client = OpenAI()

def extract_memories(conversation: list[dict]) -> list[dict]:
    """
    Extract structured memories from a conversation.
    Returns list of dicts with keys: content, importance, category.
    """
    prompt = f"""
Extract important facts about the user from this conversation.
Only extract facts that are explicitly stated by the user, not inferred.
Return a JSON array of objects with keys: content (str), importance (int 1-10), category (str: preference|background|goal|other).

Conversation:
{json.dumps(conversation, indent=2)}

Memories:
"""
    response = client.chat.completions.create(
        model="gpt-4o-mini",  # cheaper model for extraction
        messages=[
            {"role": "system", "content": "You are a memory extraction assistant. Be precise and conservative."},
            {"role": "user", "content": prompt}
        ],
        response_format={"type": "json_object"},
        temperature=0.0  # deterministic output
    )
    # Parse the JSON response
    raw = json.loads(response.choices[0].message.content)
    # Validate structure
    memories = raw.get("memories", raw if isinstance(raw, list) else [])
    for m in memories:
        if "content" not in m:
            m["content"] = m.get("fact", "")  # fallback for different key names
    return memories
Extraction is not idempotent by default
Running the same conversation through extraction twice can yield different factoids due to LLM non-determinism. Set temperature=0.0 and seed=42 for reproducible results.
Production Insight
A recommendation engine serving 2M req/day started returning stale results after a schema migration. We changed the memory extraction prompt to include a new field 'timestamp', but the old memories didn't have it. The LLM started hallucinating timestamps for old memories, causing the re-ranker to prioritize them incorrectly. Fix: we ran a one-time migration script that re-extracted all memories with the new schema.
Key Takeaway
Memory extraction is a structured LLM call that needs strict constraints to avoid hallucination and token waste. Always validate output schema and deduplicate before storing.

Building a Production-Grade Memory Store with ChromaDB and Deduplication

Once you have extracted memories, you need to store them efficiently. We use ChromaDB for vector storage because it's simple to set up and has good Python bindings. But the default setup is not production-ready. You need to add deduplication at write time, not just at read time.

The deduplication logic: before inserting a new memory, compute its embedding and check cosine similarity against all existing memories for that user. If similarity > 0.85, skip insertion. This prevents the store from filling with near-duplicate facts like 'User likes Python' and 'User enjoys programming in Python'.

We also add a timestamp and a hit counter to each memory. The hit counter increments every time a memory is retrieved and injected into a prompt. This allows us to prune low-value memories (those with < 5 hits in 30 days) during consolidation.

memory_store.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
import chromadb
from chromadb.utils import embedding_functions
import numpy as np
from typing import Optional

# Initialize ChromaDB with persistent storage
client = chromadb.PersistentClient(path="./memory_store")

# Use a local embedding model (no API calls for embedding)
sentence_transformer_ef = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="all-MiniLM-L6-v2"
)

def get_or_create_collection(user_id: str):
    """Get or create a collection for a user. Each user gets their own collection."""
    collection_name = f"user_{user_id}"
    try:
        return client.get_collection(name=collection_name, embedding_function=sentence_transformer_ef)
    except ValueError:
        return client.create_collection(name=collection_name, embedding_function=sentence_transformer_ef)

def add_memory_with_dedup(user_id: str, memory: dict, similarity_threshold: float = 0.85):
    """
    Add a memory to the user's collection, but only if it's not a near-duplicate.
    """
    collection = get_or_create_collection(user_id)
    
    # Compute embedding for the new memory
    new_embedding = sentence_transformer_ef([memory["content"]])[0]
    
    # Query for similar existing memories
    results = collection.query(
        query_embeddings=[new_embedding],
        n_results=5,
        include=["distances", "metadatas"]
    )
    
    # Check if any existing memory is too similar
    if results["distances"] and len(results["distances"][0]) > 0:
        min_distance = min(results["distances"][0])
        if (1 - min_distance) > similarity_threshold:  # cosine distance to similarity
            print(f"Skipping duplicate: {memory['content']} (similarity: {1 - min_distance:.2f})")
            return
    
    # Add the new memory
    collection.add(
        documents=[memory["content"]],
        metadatas=[{
            "importance": memory.get("importance", 5),
            "category": memory.get("category", "other"),
            "timestamp": memory.get("timestamp", ""),
            "hit_count": 0
        }],
        ids=[f"mem_{user_id}_{hash(memory['content'])}"]
    )
    print(f"Added memory: {memory['content']}")
Use per-user collections for isolation
Putting all users in one collection with a 'user_id' filter is fine for small scale (< 10K users). Beyond that, use separate collections or a partitioned index to avoid cross-user contamination during retrieval.
Production Insight
We hit a 23% accuracy drop in our recommendation engine when we switched from per-user collections to a single collection with filters. The vector search was returning memories from other users because the filter was applied after the ANN search, not during it. Fix: we switched to per-user collections, which also improved search latency by 40%.
Key Takeaway
Always deduplicate at write time to prevent store bloat. Use per-user collections for isolation at scale. Track hit counts to enable smart pruning.

When NOT to Use LLM Memory — The Case for Stateless Design

Not every application needs long-term memory. In fact, adding memory to a system that doesn't need it adds latency, cost, and complexity. Here's when you should skip it:

  1. One-shot tasks: If users interact with your app once (e.g., a translation tool), memory adds no value. The user won't come back.
  2. Highly sensitive data: If your app deals with PII or health data, storing user conversations as memories creates compliance headaches. GDPR right-to-erasure becomes a nightmare when memories are spread across vector stores.
  3. High-throughput, low-latency systems: If you need sub-200ms responses, the memory retrieval step adds 50-100ms. Skip it.
  4. When the context window is enough: For short conversations (< 4K tokens), just include the raw history. No need for extraction.

We learned this the hard way when we added memory to our internal log analysis tool. Users would run a single query, get an answer, and leave. The memory store grew to 50K entries in a month, and nobody ever retrieved them. We removed memory and saved $800/month in embedding API costs.

no_memory_decision.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
def should_use_memory(config: dict) -> bool:
    """
    Decision function for whether to enable long-term memory.
    Returns False if memory would add cost without benefit.
    """
    # If average session length is 1 interaction, skip memory
    if config["avg_session_length"] <= 1:
        return False
    
    # If latency budget is under 200ms, skip memory (adds 50-100ms)
    if config["latency_budget_ms"] < 200:
        return False
    
    # If dealing with PII and no GDPR compliance path, skip memory
    if config["has_pii"] and not config["gdpr_compliant"]:
        print("Warning: Memory with PII without GDPR compliance is risky")
        return False
    
    # If context window is large enough for full conversation, skip memory
    if config["max_tokens"] >= config["avg_conversation_tokens"] * 2:
        return False
    
    return True
Memory is a feature, not a requirement
Before adding memory, ask: 'Will the user's experience be noticeably worse without it?' If the answer is 'maybe', start without memory and add it later when you have data to justify the cost.
Production Insight
A fraud detection pipeline we consulted on added memory to track user behavior across sessions. It caused a 12% false positive rate increase because the memory system was retrieving outdated behavior patterns. The fix was to add a 'recency_weight' to memory retrieval, decaying older memories. But the simpler fix was to remove memory entirely and use a real-time feature store instead.
Key Takeaway
Memory is not free. Evaluate whether your use case actually benefits from cross-session context. If not, save the latency and cost.

Production Patterns for Scaling Memory to 10K+ Users

Scaling memory to thousands of users requires more than just a vector store. Here are the patterns we use in production:

  1. Shard by user ID: Use consistent hashing to distribute users across multiple ChromaDB instances. This prevents a single instance from becoming a bottleneck.
  2. Batch extraction: Don't extract memories after every message. Batch them: collect 5-10 messages, then extract in one call. This reduces API calls by 80%.
  3. Lazy retrieval: Don't retrieve memories on every turn. Retrieve only when the conversation enters a new topic (detected by embedding similarity drop > 0.3).
  4. Memory TTL: Set a time-to-live on memories. For most apps, 30 days is enough. After that, archive to cold storage (S3) and only retrieve if explicitly needed.
  5. Pre-compute embeddings: For known users, pre-compute and cache their top 10 memories every hour. This avoids the retrieval step for most interactions.

We serve 15K active users with this setup. The p99 retrieval latency is 45ms, and the monthly embedding cost is $1,200.

memory_scaling.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
import hashlib
from typing import List

# Shard configuration: map user_id to shard number
SHARD_COUNT = 4

def get_shard(user_id: str) -> int:
    """Consistent hashing to determine which ChromaDB instance to use."""
    hash_val = int(hashlib.md5(user_id.encode()).hexdigest(), 16)
    return hash_val % SHARD_COUNT

def batch_extract(conversation_buffer: List[dict]) -> List[dict]:
    """
    Extract memories from a batch of messages.
    Called every 5 messages or when buffer reaches 2000 tokens.
    """
    if len(conversation_buffer) < 3:
        return []  # Not enough context for extraction
    
    # Call extraction LLM (simplified)
    memories = extract_memories(conversation_buffer)
    
    # Clear buffer after extraction
    conversation_buffer.clear()
    return memories

def lazy_retrieve(user_id: str, current_embedding: List[float], threshold: float = 0.3) -> List[dict]:
    """
    Only retrieve memories if the current query is semantically different from the last one.
    """
    # Get last query embedding from cache (Redis)
    last_embedding = cache.get(f"last_embedding_{user_id}")
    
    if last_embedding is not None:
        # Compute cosine similarity
        similarity = np.dot(current_embedding, last_embedding) / (
            np.linalg.norm(current_embedding) * np.linalg.norm(last_embedding)
        )
        if similarity > (1 - threshold):  # If similar, skip retrieval
            return cache.get(f"cached_memories_{user_id}") or []
    
    # Retrieve fresh memories
    memories = retrieve_memories(user_id, current_embedding)
    
    # Update cache
    cache.set(f"last_embedding_{user_id}", current_embedding, ttl=300)
    cache.set(f"cached_memories_{user_id}", memories, ttl=300)
    
    return memories
Cache aggressively to avoid redundant retrieval
Most users' conversations stay on the same topic for 5-10 turns. Caching the last retrieval result and only refreshing on topic change can cut retrieval calls by 70%.
Production Insight
We forgot to clear the conversation buffer after batch extraction. The buffer grew to 50 messages, and the extraction prompt exceeded the 8K token limit. The LLM started returning truncated responses, losing half the memories. Fix: always clear the buffer after extraction, and add a hard cap of 10 messages or 3000 tokens before forcing extraction.
Key Takeaway
Batch extraction, lazy retrieval, and pre-computed caches are essential for scaling. Always clear buffers after extraction to prevent overflow.

Common Mistakes That Cost Real Money — With Specific Examples

Here are the three most expensive mistakes we've seen teams make with LLM memory:

  1. Extracting after every turn: A team building a personal assistant extracted memories after every user message. With 10 messages per session and 1000 users, that's 10K extraction calls per day. At $0.0015 per call (GPT-4o-mini), that's $15/day or $450/month. But they also injected all memories into the prompt, adding 2000 tokens per turn. That's another $20/day. Total: $35/day for a feature that didn't improve user satisfaction. Fix: extract every 5th message, inject only top 5 memories.
  2. No deduplication: Another team stored every extracted factoid without checking for duplicates. After a week, one user had 200 memories, 80% of which were duplicates like 'User likes coffee' and 'User prefers coffee'. The injection prompt was 4000 tokens just for memories. Fix: add cosine similarity dedup at write time.
  3. Using the same embedding model for retrieval and extraction: A team used 'text-embedding-3-small' for both extraction and retrieval. When they switched to 'text-embedding-3-large' for better accuracy, the old embeddings became incompatible, and retrieval returned garbage. Fix: version your embeddings. Store the model name in metadata and re-embed on model change.
cost_mistakes.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
# Mistake 1: Extracting after every turn (bad)
# This costs $0.0015 per call * 10K calls/day = $15/day
for message in conversation:
    memories = extract_memories([message])  # Too frequent!
    store_memories(user_id, memories)

# Fix: Batch extraction
conversation_buffer = []
for message in conversation:
    conversation_buffer.append(message)
    if len(conversation_buffer) >= 5:  # Extract every 5th message
        memories = extract_memories(conversation_buffer)
        store_memories(user_id, memories)
        conversation_buffer.clear()

# Mistake 2: No deduplication (bad)
# User ends up with 200 memories, 80% duplicates
def store_memories_no_dedup(user_id, memories):
    for m in memories:
        collection.add(documents=[m["content"]], ids=[f"mem_{hash(m['content'])}"])

# Fix: Add deduplication
from sklearn.metrics.pairwise import cosine_similarity

def store_memories_with_dedup(user_id, memories, collection):
    for m in memories:
        # Compute embedding
        emb = embed_model.encode([m["content"]])
        # Query for similar
        results = collection.query(query_embeddings=emb, n_results=1)
        if results["distances"] and len(results["distances"][0]) > 0:
            if (1 - results["distances"][0][0]) > 0.85:
                continue  # Skip duplicate
        collection.add(documents=[m["content"]], ids=[f"mem_{hash(m['content'])}"])

# Mistake 3: No embedding versioning (bad)
# Old embeddings become incompatible after model change
EMBEDDING_MODEL = "text-embedding-3-small"  # Hardcoded, no version tracking

# Fix: Version your embeddings
def add_memory_with_version(user_id, memory, model_version="v1"):
    collection.add(
        documents=[memory["content"]],
        metadatas={"embedding_model": model_version},
        ids=[f"mem_{hash(memory['content'])}"]
    )

# On model change, re-embed all memories
if current_model_version != stored_model_version:
    re_embed_all_memories(new_model)
Embedding model changes break retrieval silently
There's no error when you switch embedding models. Retrieval just starts returning garbage. Always store the model version in metadata and re-embed on change.
Production Insight
A team building a CRM assistant forgot to deduplicate. One user had 47 memories about their company name. The injection prompt was 3000 tokens of 'Company name is Acme Corp' variations. The assistant started ignoring other memories because the prompt was saturated. Fix: deduplication reduced memories from 47 to 3, and the assistant started working correctly.
Key Takeaway
The three most expensive mistakes are: extracting too often, not deduplicating, and ignoring embedding model versioning. Each can cost thousands per month.

Comparison: LangMem vs. Custom Memory vs. Mem0

We evaluated three approaches for memory management: LangMem (LangChain's memory module), a custom-built system, and Mem0 (an open-source memory layer). Here's the production comparison:

LangMem: Good for quick prototyping. It handles extraction and storage out of the box. But it's opinionated: it uses LangChain's abstractions, which can be hard to customize. We found it hard to add custom deduplication logic. Also, it uses OpenAI embeddings by default, which adds API costs. For production, we needed more control.

Mem0: Excellent for teams that want a turnkey solution. It handles extraction, storage, and retrieval with a simple API. But it's a black box: when something goes wrong (e.g., token leak), it's hard to debug. We also hit a bug where Mem0's consolidation cron job ran every hour and caused latency spikes. The fix was to disable the cron and run it manually.

Custom system: This is what we ended up with. It gives us full control over every aspect: extraction prompt, deduplication logic, storage backend, retrieval strategy. The trade-off is development time: it took us 2 weeks to build vs. 2 days to integrate LangMem. But for a system handling 15K users, the control is worth it.

Recommendation: Start with LangMem or Mem0 for MVP. Switch to custom when you hit scaling or customization limits.

comparison.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# LangMem example (quick start)
from langmem import MemoryManager

manager = MemoryManager()
# This handles extraction and storage automatically, but hard to customize
manager.add_conversation(conversation, user_id="user_123")
memories = manager.get_relevant_memories("What does the user like?", user_id="user_123")

# Mem0 example (turnkey)
from mem0 import Memory

m = Memory()
# Black box: extraction, storage, retrieval all in one call
m.add("User likes Python", user_id="user_123")
memories = m.search("programming preferences", user_id="user_123")

# Custom system (full control)
# We control every step:
# 1. Extraction prompt
memories = extract_memories(conversation)  # custom prompt
# 2. Deduplication
for mem in memories:
    add_memory_with_dedup(user_id, mem)  # custom logic
# 3. Retrieval with re-ranking
memories = retrieve_and_rerank(user_id, query, top_k=5)  # custom strategy
Don't over-engineer early
For the first 1000 users, LangMem or Mem0 will work fine. The complexity of a custom system only pays off when you hit specific scaling or customization issues.
Production Insight
We started with Mem0 and hit a 2-second latency spike every hour due to its consolidation cron job. The cron was re-embedding all memories hourly. For 15K users, that's 15K embedding calls per hour, which saturated the API rate limit. Fix: we disabled the cron and ran consolidation nightly during low traffic.
Key Takeaway
LangMem and Mem0 are great for prototyping. Custom systems give you the control needed for production at scale. Choose based on your team's bandwidth and scaling needs.

Debugging and Monitoring Memory Systems in Production

You can't fix what you can't see. Here's the monitoring setup we use for our memory system:

  1. Memory store size per user: Track the number of memories per user. Alert if any user exceeds 500 memories (indicates dedup failure).
  2. Extraction call count: Track the number of extraction calls per user per session. Alert if > 10 calls per session (indicates buffer not clearing).
  3. Injection token count: Track the number of tokens injected into the prompt from memory. Alert if > 2000 tokens (indicates no re-ranking).
  4. Retrieval latency: Track p50, p95, p99 of memory retrieval. Alert if p99 > 200ms.
  5. Embedding model version: Track the current embedding model version. Alert if it changes without a re-embed job.

We use Prometheus for metrics and Grafana for dashboards. Here's a sample metric definition.

memory_monitoring.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
from prometheus_client import Counter, Histogram, Gauge
import time

# Metrics
MEMORY_STORE_SIZE = Gauge('memory_store_size_per_user', 'Number of memories per user', ['user_id'])
EXTRACTION_CALLS = Counter('extraction_calls_total', 'Total extraction calls', ['user_id'])
INJECTION_TOKENS = Histogram('injection_tokens_per_prompt', 'Tokens injected per prompt', buckets=[100, 500, 1000, 2000, 5000])
RETRIEVAL_LATENCY = Histogram('retrieval_latency_seconds', 'Latency of memory retrieval', buckets=[0.01, 0.05, 0.1, 0.2, 0.5, 1.0])

def monitored_retrieve(user_id: str, query: str) -> list[dict]:
    start = time.time()
    memories = retrieve_memories(user_id, query)
    duration = time.time() - start
    RETRIEVAL_LATENCY.observe(duration)
    
    # Update gauge for memory store size
    MEMORY_STORE_SIZE.labels(user_id=user_id).set(len(memories))
    
    return memories

def monitored_extract(user_id: str, conversation: list[dict]) -> list[dict]:
    EXTRACTION_CALLS.labels(user_id=user_id).inc()
    memories = extract_memories(conversation)
    
    # Track injection tokens (estimate)
    total_tokens = sum(len(m["content"].split()) for m in memories) * 1.3  # rough estimate
    INJECTION_TOKENS.observe(total_tokens)
    
    return memories
Alert on extraction call count, not just latency
A sudden spike in extraction calls is often the first sign of a bug (e.g., buffer not clearing). Latency alerts come too late — you're already burning money.
Production Insight
We had a silent failure where the extraction buffer wasn't clearing after a batch extract. The buffer grew to 50 messages, and extraction calls started taking 10+ seconds. The p99 latency alert fired, but by then we'd already spent $200 on extra extraction calls. Fix: we added a metric for buffer size and alerted if it exceeded 10 messages.
Key Takeaway
Monitor memory store size, extraction call count, injection token count, and retrieval latency. Alert early on extraction call spikes to catch bugs before they cost money.
● Production incidentPOST-MORTEMseverity: high

The Unbounded Memory Leak That Cost $4,000 in One Weekend

Symptom
P99 latency jumped from 1.2s to 4.7s. Daily token usage spiked from 2M to 6.5M. Users saw 'Sorry, I'm having trouble processing your request' errors.
Assumption
We assumed memory extraction was idempotent and that the LLM would naturally stop extracting when it had enough information about a user.
Root cause
The extraction prompt had no deduplication logic. After every user message, we called extract_memories() which returned 5-10 new factoids per turn, even if they were redundant with existing ones. The vector store grew linearly with conversation length, and we injected all memories into the system prompt without any budget or relevance filtering.
Fix
1. Added a deduplication step: before inserting a new memory, we compute cosine similarity against existing memories. If similarity > 0.85, skip insertion. 2. Implemented a token budget: limit memory injection to 1500 tokens per prompt, using a re-ranker to select the most relevant memories. 3. Added a consolidation cron job that runs daily to merge similar memories and prune ones older than 30 days with no hits. 4. Set a hard limit of 500 memories per user in the vector store.
Key lesson
  • Always set a hard cap on the number of memories stored per user. Unbounded growth is a ticking time bomb.
  • Implement deduplication at extraction time, not just at retrieval time. It's cheaper to skip a write than to filter a read.
  • Monitor memory store size and injection token count as a standard metric. We now have a dashboard for 'memories per user' and 'memory token % of prompt'.
Production debug guideWhen the token bill spikes at 2am.4 entries
Symptom · 01
Sudden token cost increase without traffic change
Fix
Check memory extraction logs: grep 'extract_memories' /var/log/app/llm.log | wc -l vs yesterday. If count > 2x, extraction prompt is too aggressive.
Symptom · 02
Stale or irrelevant memories being injected
Fix
Query the vector store for the user: collection.get(where={'user_id': 'abc'}, limit=50). Check if old memories have high similarity to current query.
Symptom · 03
Memory retrieval returns no results for known users
Fix
Verify embedding model is consistent: curl -X POST http://localhost:8000/embed -d '{"input": "test"}'. Compare hash of model config with prod config.
Symptom · 04
Memory consolidation removes too many memories
Fix
Check consolidation logs: grep 'consolidated' /var/log/app/memory.log. If deletion count > 20% of total, reduce similarity threshold from 0.85 to 0.75.
★ LLM Memory Management Triage Cheat SheetCopy-paste diagnostics. When it's 2am and you need answers fast.
Token cost spike
Immediate action
Check memory extraction frequency
Commands
python -c "import json; logs = [json.loads(l) for l in open('llm.log') if 'extract' in l]; print(f'Extractions in last hour: {len(logs)}')"
python -c "import json; logs = [json.loads(l) for l in open('llm.log') if 'inject_memories' in l]; print(f'Avg token injection: {sum(l['tokens'] for l in logs)/len(logs)}')"
Fix now
Temporarily reduce extraction frequency to every 5th message. Set extraction_interval=5 in config.
No memories retrieved for known user+
Immediate action
Verify vector store connection and embedding model
Commands
python -c "import chromadb; client = chromadb.HttpClient(); print(client.heartbeat())"
python -c "from sentence_transformers import SentenceTransformer; m = SentenceTransformer('all-MiniLM-L6-v2'); print(m.encode('test').shape)"
Fix now
If heartbeat fails, restart ChromaDB: docker restart chromadb.
Memory consolidation deletes too much+
Immediate action
Check consolidation threshold and last run
Commands
python -c "import json; print(json.load(open('memory_config.json'))['consolidation_threshold'])"
python -c "import datetime; print(f'Last consolidation: {datetime.datetime.fromtimestamp(open("consolidation_last_run.txt").read().strip())}')"
Fix now
Increase threshold from 0.85 to 0.90 and re-run consolidation manually.
Memory injection causes prompt overflow+
Immediate action
Check token budget and current injection size
Commands
python -c "import tiktoken; enc = tiktoken.encoding_for_model('gpt-4'); print(f'Budget: 1500, Current: {len(enc.encode(open("last_prompt.txt").read()))}')"
python -c "print(f'Memories in prompt: {len(open("last_memories.txt").read().splitlines())}')"
Fix now
Reduce token budget to 800 and enable relevance re-ranking.
LLM Memory Solutions Comparison
ConcernLangMemCustom Memory (ChromaDB)Mem0Recommendation
Setup complexityLow (pip install)Medium (write code)Medium (API key)LangMem for prototyping
DeduplicationBasic (exact match)Custom (hash + semantic)Built-in (semantic)Custom for control
Cross-session mergingManualManualAutomaticMem0 for multi-session
Token cost controlSliding window onlyFull control (budget, importance)ConfigurableCustom for cost
Scaling to 10K+ usersNot designed for scaleYes (shard by user_id)Yes (managed service)Custom or Mem0
CostFree (open source)Free (self-hosted)Paid (usage-based)Custom for low cost

Key takeaways

1
Never append raw conversation history to prompts
use a vector store with semantic deduplication to keep context under 4K tokens.
2
Implement a sliding window with importance scoring
drop low-value memories first, not oldest ones.
3
Stateless design wins for high-throughput APIs
only add memory when user explicitly references past context.
4
Monitor token usage per session with a real-time dashboard; set alerts for >10% token leak above baseline.
5
Use ChromaDB with HNSW indexing and batch dedup at write time to avoid O(n²) comparisons at read time.

Common mistakes to avoid

4 patterns
×

Appending full history to every prompt

Symptom
Token usage grows linearly with conversation length; $4k/month bill spike on 10K users
Fix
Store conversation chunks in ChromaDB with cosine similarity dedup; retrieve only top-3 relevant chunks per turn.
×

No deduplication on memory writes

Symptom
Duplicate entries cause redundant token consumption and confusing model responses
Fix
Hash each memory chunk (e.g., SHA-256 of normalized text) and skip insert if hash exists in user's memory collection.
×

Using LLM to summarize memory every turn

Symptom
Latency spikes to 5s+ and token cost doubles due to recursive summarization calls
Fix
Summarize only every 5 turns or when token budget exceeds 70%; use a lightweight model (e.g., GPT-4o-mini) for summarization.
×

Storing memory per-session instead of per-user

Symptom
Cross-session context lost; users repeat information every conversation
Fix
Key memory by user_id, not session_id; merge sessions via timestamp ordering and dedup on content.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
How would you design a memory system for an LLM chatbot that handles 100...
Q02SENIOR
Explain the trade-offs between stateless and stateful LLM architectures.
Q03SENIOR
How would you debug a token leak in production?
Q04SENIOR
What is the role of deduplication in LLM memory systems?
Q05SENIOR
How would you implement cross-session memory for a chatbot?
Q01 of 05SENIOR

How would you design a memory system for an LLM chatbot that handles 100K users?

ANSWER
Use a vector database (e.g., ChromaDB) sharded by user_id. Store compressed memory chunks with timestamps and importance scores. Retrieve top-3 chunks per query via cosine similarity. Deduplicate at write time using hash of normalized text. Implement a sliding window with token budget (e.g., 4K tokens). Use async batch writes and TTL for stale memories. Monitor token usage per user with alerts.
FAQ · 5 QUESTIONS

Frequently Asked Questions

01
How do I prevent token leaks in LLM memory systems?
02
What is the difference between LangMem and Mem0?
03
Should I use stateless or stateful design for my LLM chatbot?
04
How do I scale LLM memory to 10K+ users?
05
What is the cost of LLM memory in production?
🔥

That's Context Engineering. Mark it forged?

7 min read · try the examples if you haven't

Previous
Token Budgeting for LLMs
3 / 4 · Context Engineering
Next
LLM Observability Tools