Biggest mistake: using MemoryVectorStore in serverless — it re-indexes on every cold start
Plain-English First
Imagine hiring a senior engineer who has memorized your entire codebase. You ask 'how does auth work?' and they instantly pull the relevant files and explain. That's RAG: your code is indexed into a vector database, and when you ask a question, the system finds the most relevant chunks and feeds them to an LLM that answers using YOUR code, not generic internet advice.
Most AI coding assistants hallucinate because they lack context about your specific codebase. The fix is RAG — Retrieval-Augmented Generation — which grounds the LLM in your own source code.
This 2026 guide builds a production-grade assistant. You'll use tree-sitter for AST-aware chunking, store vectors in pgvector (Neon), rate limit with Upstash Redis, and stream responses with Vercel AI SDK. Indexing runs in CI on every push — the API route only queries, never embeds.
By the end you'll have an assistant that answers questions about your codebase in <200ms with token-by-token streaming.
pgvector = permanent library (persists across deploys)
Upstash = bouncer at door (rate limits)
AI SDK = courier delivering pages as they're written (streaming)
Production Insight
Indexing 10k files costs ~$0.40 once, then $0.01 per push for changed files (hash dedupe).
MemoryVectorStore re-indexing cost one team $340 in a weekend.
Rule: if your vector store is in RAM, you're doing it wrong.
Key Takeaway
RAG has three stages: index (CI), retrieve (pgvector), generate (AI SDK). Never combine index+retrieve in same request.
Next.js 16 Streaming with Vercel AI SDK
Don't manually create ReadableStream. Use Vercel AI SDK v5 — it handles SSE, backpressure, edge runtime, and token counting. Perceived latency drops from 8s to 80ms to first token.
Use gpt-4o for generation (128k context, $2.50/1M input), temperature 0.1 for deterministic code.
app/api/chat/route.tsTYPESCRIPT
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
import { streamText } from'ai';
import { openai } from'@ai-sdk/openai';
import { PgVectorStore } from'@langchain/community/vectorstores/pgvector';
import { OpenAIEmbeddings } from'@langchain/openai';
import { Ratelimit } from'@upstash/ratelimit';
import { Redis } from'@upstash/redis';
import { encoding_for_model } from'tiktoken';
const ratelimit = newRatelimit({ redis: Redis.fromEnv(), limiter: Ratelimit.slidingWindow(100, '1 d') });
const embeddings = newOpenAIEmbeddings({ model: 'text-embedding-3-large' });
const store = awaitPgVectorStore.initialize(embeddings, { postgresConnectionOptions: { connectionString: process.env.DATABASE_URL! }, tableName: 'code_embeddings' });
const enc = encoding_for_model('gpt-4o');
functioncapContext(chunks: string[], maxTokens = 20000) {
let total = 0; const kept = [];
for (const c of chunks) { const t = enc.encode(c).length; if (total + t > maxTokens) break; kept.push(c); total += t; }
return kept;
}
exportasyncfunctionPOST(req: Request) {
const { messages } = await req.json();
const ip = req.headers.get('x-forwarded-for') ?? 'anon';
const { success } = await ratelimit.limit(ip);
if (!success) returnnewResponse('Rate limited', { status: 429 });
const query = messages.at(-1).content;
const results = await store.similaritySearchWithScore(query, 12);
const filtered = results.filter(([, score]) => score > 0.78);
if (filtered.length === 0) returnnewResponse("I don't see this in the codebase.");
const context = capContext(filtered.map(([d]) => `[${d.metadata.source}]\n${d.pageContent}`)).join('\n\n---\n\n');
const result = streamText({
model: openai('gpt-4o'),
system: `You are a coding assistant. Answer using ONLYthis code. Cite files.\n\n${context}`,
messages,
temperature: 0.1,
});
return result.toTextStreamResponse();
}
AI SDK vs Manual Streaming
Manual ReadableStream misses backpressure and error propagation. AI SDK handles SSE framing, retries, and usage tracking automatically. Always use streamText().toTextStreamResponse() in 2026.
Production Insight
AI SDK reduces streaming code from 40 lines to 5. It also tracks token usage for billing.
Temperature 0.1 is critical — at 0.7, gpt-4o invents APIs that match your naming style but don't exist.
Key Takeaway
Use Vercel AI SDK for streaming. Never build ReadableStream manually. Add Upstash rate limiting — in-memory Maps don't work on serverless.
Tree-Sitter AST Chunking for Code
Regex chunking breaks on: arrow functions, generics, decorators. Tree-sitter parses code into AST and extracts complete functions, classes, and methods. Each chunk is semantically complete.
Prepend import blocks to every chunk so LLM knows dependencies. Store symbol name, file path, and hash in metadata for deduplication.
Exclude: .env, /.pem, /secrets/, /node_modules/. Run git-secrets scan before indexing. Never index compiled output.
Production Insight
Hybrid search finds getUserById when vector alone finds getUser (0.81 vs 0.79 similarity).
Upstash rate limiting costs $0 but prevents $500 surprise bills from bots.
Key Takeaway
Use hybrid search, exclude secrets, rate limit with Redis, monitor with Helicone/Langfuse.
● Production incidentPOST-MORTEMseverity: high
MemoryVectorStore caused $340 in embedding costs and 12s cold starts
Symptom
First query after deploy took 90 seconds then timed out. OpenAI dashboard showed 4,000 embedding calls in one hour. Users saw 504 errors.
Assumption
The team assumed 'lazy-loading' the vector store was efficient.
Root cause
MemoryVectorStore lives in RAM and is lost on every serverless cold start. The code called indexRepository() inside the API route. Each of 20 Vercel instances re-embedded the entire repo on first request, costing $0.17 per instance and exceeding the 10s function timeout.
Fix
Moved indexing to GitHub Action that runs on push. Vectors now stored in Neon pgvector. API route queries in 45ms. Embedding costs dropped from $340/month to $0.40/month (incremental re-index only).
Key lesson
Never index in the request path — do it offline in CI
MemoryVectorStore is for demos only — use pgvector/Pinecone in production
Production debug guideCommon failures with AI SDK + pgvector5 entries
Symptom · 01
Assistant gives generic answers
→
Fix
Query pgvector directly: SELECT content FROM code_embeddings ORDER BY embedding <=> $1 LIMIT 5. If empty, CI indexing failed.
Symptom · 02
First query after deploy is slow
→
Fix
You're indexing in-request. Check API route for indexRepository() calls. Move to CI.
Symptom · 03
Responses truncated
→
Fix
Calculate total tokens with tiktoken. Cap context at 20k tokens. gpt-4o needs room for reasoning.
Symptom · 04
Rate limit bypassed
→
Fix
You're using in-memory Map. Switch to Upstash Redis — serverless has multiple instances.
Symptom · 05
Finds getUser but not getUserById
→
Fix
Add hybrid search. Pure vector misses exact matches. Use pgvector + tsvector with alpha 0.7.
2026 Vector Store Comparison
Feature
Neon pgvector
Pinecone Serverless
MemoryVectorStore
Persistence
Yes (Postgres)
Yes
No — lost on restart
Cost for 1M vectors
$5-10/mo
$70/mo
$0 (but OOMs)
Hybrid search
Native (tsvector)
Paid add-on
No
Serverless cold start
45ms
30ms
90,000ms (re-index)
Best for
Production, self-hosted
Enterprise, managed
Demos only
Key takeaways
1
2026 stack
tree-sitter + pgvector + AI SDK + Upstash — never MemoryVectorStore
2
Index offline in CI, query online in <50ms
never embed in request path
3
Hybrid search (vector + BM25) is required for code
pure vector misses exact symbols
4
Relevance threshold 0.78 + context cap 20k tokens prevents hallucinations
5
Temperature 0.1, rate limit with Redis, exclude secrets via .gitignore patterns
Common mistakes to avoid
6 patterns
×
Indexing in API route with MemoryVectorStore
Symptom
90s cold starts, $300+ embedding bills
Fix
Index in GitHub Action, store in pgvector, query only in API
×
Pure vector search, no hybrid
Symptom
Can't find exact function names
Fix
Use pgvector + tsvector hybrid with alpha 0.7
×
No relevance threshold
Symptom
Confident hallucinations on out-of-scope questions
Fix
Filter by cosine >0.78, return 'I don't see this' if below
×
In-memory rate limiting
Symptom
Rate limit bypassed across instances
Fix
Use Upstash Redis with @upstash/ratelimit
×
Regex chunking
Symptom
Splits functions mid-body, 60% recall
Fix
Use tree-sitter to extract complete AST nodes
×
Indexing secrets
Symptom
API keys in vector store
Fix
Exclude .env, .pem, secrets/** via .gitignore patterns in CI
INTERVIEW PREP · PRACTICE MODE
Interview Questions on This Topic
Q01SENIOR
Why can't you use MemoryVectorStore in production serverless?
Q02SENIOR
Explain hybrid search and why it's critical for code RAG.
Q03SENIOR
How do you prevent a RAG assistant from indexing secrets?
Q01 of 03SENIOR
Why can't you use MemoryVectorStore in production serverless?
ANSWER
It stores vectors in process RAM, which is lost on every cold start. Serverless platforms spin up multiple instances, each would re-embed the entire repo on first request. For a 2k file repo, that's ~$0.17 and 90 seconds per instance. Use pgvector or Pinecone to persist vectors externally, so API routes only query (45ms) never index.
Q02 of 03SENIOR
Explain hybrid search and why it's critical for code RAG.
ANSWER
Pure vector search finds semantically similar code but misses exact symbol matches (getUserById vs getUser). Pure keyword search finds exact matches but misses semantic variations. Hybrid combines BM25 keyword score + vector cosine similarity (typically 0.7/0.3 weighting). In pgvector, use tsvector for keywords and <=> for vectors, order by weighted sum. This achieves 95%+ recall for code vs 70% for vector alone.
Q03 of 03SENIOR
How do you prevent a RAG assistant from indexing secrets?
ANSWER
Three layers: 1) In CI indexing script, respect .gitignore plus explicit denylist ['*/.env', '*/.pem', '/secrets/']. 2) Run git-secrets scan before embedding. 3) In pgvector, add row-level policy to never return files matching patterns. Also hash content and audit embeddings periodically for high-entropy strings.
01
Why can't you use MemoryVectorStore in production serverless?
SENIOR
02
Explain hybrid search and why it's critical for code RAG.
SENIOR
03
How do you prevent a RAG assistant from indexing secrets?
SENIOR
FAQ · 4 QUESTIONS
Frequently Asked Questions
01
Can I use local models instead of OpenAI?
Yes. Replace openai('gpt-4o') with ollama('codellama:13b') via AI SDK. Use nomic-embed-text for embeddings. Quality drops ~25% for complex reasoning but eliminates API costs and keeps code on-premise.
Was this helpful?
02
How much does this cost in production?
Neon pgvector: $5/mo. Upstash Redis: free tier. Initial index (10k files): ~$0.40 one-time. 100 daily queries with gpt-4o: ~$8-12/day. With caching and gpt-4o-mini for simple queries: ~$3/day. Compare to $340/month for re-indexing every cold start.
Was this helpful?
03
Why tree-sitter over LangChain splitter?
LangChain splits on characters, breaking functions mid-body (60% recall). Tree-sitter parses AST and extracts complete functions/classes (98% recall). For TypeScript, it correctly handles generics, decorators, and arrow functions that regex misses.
Was this helpful?
04
Do I need LangChain in 2026?
Only for indexing. For querying, AI SDK + pgvector is simpler and faster. LangChain adds 400kb bundle size and abstraction overhead. Use it in CI scripts, not in the hot path.