How to Build Your Own AI Coding Assistant with Next.js 16, OpenAI & RAG (2026 Stack)
- 2026 stack: tree-sitter + pgvector + AI SDK + Upstash β never MemoryVectorStore
- Index offline in CI, query online in <50ms β never embed in request path
- Hybrid search (vector + BM25) is required for code β pure vector misses exact symbols
- Build a custom AI coding assistant using Next.js 16 API routes, Vercel AI SDK, OpenAI gpt-4o, and RAG
- RAG grounds the LLM in YOUR codebase via vector search β no more hallucinated APIs
- 2026 stack: tree-sitter for AST chunking, pgvector/Neon for persistent vectors, Upstash Redis for rate limits, AI SDK for streaming
- Index offline in CI (GitHub Action), query online in <100ms β never embed in request path
- Hybrid search (BM25 + vector) finds exact symbols AND semantic matches
- Guardrails: similarity >0.78, context budget 20k tokens, temperature 0.1, rate limit 100/day
- Biggest mistake: using MemoryVectorStore in serverless β it re-indexes on every cold start
Production Incident
Production Debug GuideCommon failures with AI SDK + pgvector
Most AI coding assistants hallucinate because they lack context about your specific codebase. The fix is RAG β Retrieval-Augmented Generation β which grounds the LLM in your own source code.
This 2026 guide builds a production-grade assistant. You'll use tree-sitter for AST-aware chunking, store vectors in pgvector (Neon), rate limit with Upstash Redis, and stream responses with Vercel AI SDK. Indexing runs in CI on every push β the API route only queries, never embeds.
By the end you'll have an assistant that answers questions about your codebase in <200ms with token-by-token streaming.
Architecture: 2026 RAG Pipeline for Code
Indexing (CI, offline): GitHub Action walks repo β tree-sitter extracts functions/classes β OpenAI text-embedding-3-large generates vectors β stored in Neon pgvector with metadata (file, symbol, imports).
Retrieval (API, online): User query β hybrid search (vector + BM25) β top 12 chunks β cap at 20k tokens using tiktoken β inject into system prompt.
Generation (streaming): Vercel AI SDK calls gpt-4o with streamText() β tokens stream via Server-Sent Events β client renders in real-time.
Critical: same embedding model for index and query. Use tree-sitter, not regex, for chunking.
import Parser from 'tree-sitter'; import { TypeScript } from 'tree-sitter-typescript'; import { OpenAIEmbeddings } from '@langchain/openai'; import { PgVectorStore } from '@langchain/community/vectorstores/pgvector'; import fs from 'fs/promises'; import crypto from 'crypto'; import { glob } from 'glob'; const IGNORED = ['**/node_modules/**', '**/.git/**', '**/.env*', '**/*.pem', '**/dist/**']; const parser = new Parser(); parser.setLanguage(TypeScript.typescript); const embeddings = new OpenAIEmbeddings({ model: 'text-embedding-3-large', batchSize: 100 }); const store = await PgVectorStore.initialize(embeddings, { postgresConnectionOptions: { connectionString: process.env.DATABASE_URL }, tableName: 'code_embeddings' }); export async function indexRepo() { const files = await glob('**/*.{ts,tsx}', { ignore: IGNORED }); for (const path of files) { const source = await fs.readFile(path, 'utf-8'); const tree = parser.parse(source); const nodes = tree.rootNode.descendantsOfType(['function_declaration', 'class_declaration', 'method_definition']); const imports = tree.rootNode.descendantsOfType('import_statement').map(n => source.slice(n.startIndex, n.endIndex)).join('\n'); const docs = nodes.map(node => { const body = source.slice(node.startIndex, node.endIndex); const content = `${imports}\n\n${body}`; const hash = crypto.createHash('sha256').update(content).digest('hex'); return { pageContent: content, metadata: { source: path, symbol: node.childForFieldName('name')?.text, hash } }; }); await store.addDocuments(docs, { ids: docs.map(d => d.metadata.hash) }); } }
- CI = librarian cataloging books (runs on push)
- pgvector = permanent library (persists across deploys)
- Upstash = bouncer at door (rate limits)
- AI SDK = courier delivering pages as they're written (streaming)
Next.js 16 Streaming with Vercel AI SDK
Don't manually create ReadableStream. Use Vercel AI SDK v5 β it handles SSE, backpressure, edge runtime, and token counting. Perceived latency drops from 8s to 80ms to first token.
Use gpt-4o for generation (128k context, $2.50/1M input), temperature 0.1 for deterministic code.
import { streamText } from 'ai'; import { openai } from '@ai-sdk/openai'; import { PgVectorStore } from '@langchain/community/vectorstores/pgvector'; import { OpenAIEmbeddings } from '@langchain/openai'; import { Ratelimit } from '@upstash/ratelimit'; import { Redis } from '@upstash/redis'; import { encoding_for_model } from 'tiktoken'; const ratelimit = new Ratelimit({ redis: Redis.fromEnv(), limiter: Ratelimit.slidingWindow(100, '1 d') }); const embeddings = new OpenAIEmbeddings({ model: 'text-embedding-3-large' }); const store = await PgVectorStore.initialize(embeddings, { postgresConnectionOptions: { connectionString: process.env.DATABASE_URL! }, tableName: 'code_embeddings' }); const enc = encoding_for_model('gpt-4o'); function capContext(chunks: string[], maxTokens = 20000) { let total = 0; const kept = []; for (const c of chunks) { const t = enc.encode(c).length; if (total + t > maxTokens) break; kept.push(c); total += t; } return kept; } export async function POST(req: Request) { const { messages } = await req.json(); const ip = req.headers.get('x-forwarded-for') ?? 'anon'; const { success } = await ratelimit.limit(ip); if (!success) return new Response('Rate limited', { status: 429 }); const query = messages.at(-1).content; const results = await store.similaritySearchWithScore(query, 12); const filtered = results.filter(([, score]) => score > 0.78); if (filtered.length === 0) return new Response("I don't see this in the codebase."); const context = capContext(filtered.map(([d]) => `[${d.metadata.source}]\n${d.pageContent}`)).join('\n\n---\n\n'); const result = streamText({ model: openai('gpt-4o'), system: `You are a coding assistant. Answer using ONLY this code. Cite files.\n\n${context}`, messages, temperature: 0.1, }); return result.toTextStreamResponse(); }
Tree-Sitter AST Chunking for Code
Regex chunking breaks on: arrow functions, generics, decorators. Tree-sitter parses code into AST and extracts complete functions, classes, and methods. Each chunk is semantically complete.
Prepend import blocks to every chunk so LLM knows dependencies. Store symbol name, file path, and hash in metadata for deduplication.
import Parser from 'tree-sitter'; import { TypeScript } from 'tree-sitter-typescript'; const parser = new Parser(); parser.setLanguage(TypeScript.typescript); export function chunkCode(source: string, filePath: string) { const tree = parser.parse(source); const imports = tree.rootNode.descendantsOfType('import_statement').map(n => source.slice(n.startIndex, n.endIndex)).join('\n'); const nodes = tree.rootNode.descendantsOfType(['function_declaration', 'class_declaration', 'method_definition', 'arrow_function']); return nodes.map(node => ({ content: `${imports}\n\n${source.slice(node.startIndex, node.endIndex)}`, metadata: { source: filePath, symbol: node.childForFieldName('name')?.text ?? 'anonymous' } })).filter(c => c.content.length > 50); }
export const foo = async <T>() => {}. Tree-sitter handles all syntax, comments, and nested functions. It's the difference between 60% recall and 98% recall.Production Guardrails: Hybrid Search, Costs, Security
Pure vector search fails on exact symbol names. Hybrid search combines BM25 (keyword) + vector (semantic) with alpha 0.7 weighting.
Security: exclude .env, keys, secrets from indexing. Use .gitignore-style patterns.
Cost control: cache embeddings in CI, use Upstash for rate limits, monitor with Helicone.
-- Neon pgvector hybrid search SELECT content, metadata, 1 - (embedding <=> $1) as vector_score, ts_rank(to_tsvector('english', content), plainto_tsquery($2)) as keyword_score FROM code_embeddings WHERE 1 - (embedding <=> $1) > 0.78 ORDER BY (0.7 * vector_score + 0.3 * keyword_score) DESC LIMIT 12;
| Feature | Neon pgvector | Pinecone Serverless | MemoryVectorStore |
|---|---|---|---|
| Persistence | Yes (Postgres) | Yes | No β lost on restart |
| Cost for 1M vectors | $5-10/mo | $70/mo | $0 (but OOMs) |
| Hybrid search | Native (tsvector) | Paid add-on | No |
| Serverless cold start | 45ms | 30ms | 90,000ms (re-index) |
| Best for | Production, self-hosted | Enterprise, managed | Demos only |
π― Key Takeaways
- 2026 stack: tree-sitter + pgvector + AI SDK + Upstash β never MemoryVectorStore
- Index offline in CI, query online in <50ms β never embed in request path
- Hybrid search (vector + BM25) is required for code β pure vector misses exact symbols
- Relevance threshold 0.78 + context cap 20k tokens prevents hallucinations
- Temperature 0.1, rate limit with Redis, exclude secrets via .gitignore patterns
β Common Mistakes to Avoid
Interview Questions on This Topic
- QWhy can't you use MemoryVectorStore in production serverless?SeniorReveal
- QExplain hybrid search and why it's critical for code RAG.Mid-levelReveal
- QHow do you prevent a RAG assistant from indexing secrets?SeniorReveal
Frequently Asked Questions
Can I use local models instead of OpenAI?
Yes. Replace openai('gpt-4o') with ollama('codellama:13b') via AI SDK. Use nomic-embed-text for embeddings. Quality drops ~25% for complex reasoning but eliminates API costs and keeps code on-premise.
How much does this cost in production?
Neon pgvector: $5/mo. Upstash Redis: free tier. Initial index (10k files): ~$0.40 one-time. 100 daily queries with gpt-4o: ~$8-12/day. With caching and gpt-4o-mini for simple queries: ~$3/day. Compare to $340/month for re-indexing every cold start.
Why tree-sitter over LangChain splitter?
LangChain splits on characters, breaking functions mid-body (60% recall). Tree-sitter parses AST and extracts complete functions/classes (98% recall). For TypeScript, it correctly handles generics, decorators, and arrow functions that regex misses.
Do I need LangChain in 2026?
Only for indexing. For querying, AI SDK + pgvector is simpler and faster. LangChain adds 400kb bundle size and abstraction overhead. Use it in CI scripts, not in the hot path.
Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.