FastAPI Streaming & File Responses: Stop Loading Everything Into Memory
FastAPI streaming responses and file responses explained with production patterns.
20+ years shipping production Python across data and backend systems. Lessons pulled from things that broke in production.
- ✓Basic FastAPI app setup
- ✓Understanding of async/await in Python
- ✓Familiarity with HTTP responses
Use StreamingResponse when you need to stream dynamically generated or large data chunk by chunk. Use FileResponse to serve static files directly from disk without buffering in Python. Both prevent memory exhaustion under load.
Imagine you're filling a swimming pool with a bucket. Normal Response fills the bucket completely, walks to the pool, dumps it, and repeats. StreamingResponse is a hose — water flows continuously as you turn the tap. FileResponse is a pipe connected directly to the reservoir — the water never touches your bucket at all.
| Chrome | Firefox | Safari | Edge |
|---|---|---|---|
| ✓ | ✓ | ✓ | ✓ |
Most FastAPI tutorials show you returning a list of dictionaries or a Pydantic model. That works fine for a JSON API returning 100 users. But the moment you try to return a 2GB CSV export or stream a live video feed, your server falls over. Memory spikes, workers crash, and you're debugging at 2 AM why your container got OOM-killed. The problem isn't FastAPI — it's that you're buffering the entire response in memory before sending a single byte. StreamingResponse and FileResponse are the tools that fix this. By the end of this article, you'll know exactly when to use each, how to avoid the common pitfalls that burn production systems, and how to serve large data without breaking a sweat.
Why You Shouldn't Return a List When You Can Stream
The default FastAPI response pattern — return a list or dict, let it serialize to JSON — works by building the entire response body in memory before sending. For small payloads, this is fine. For large ones, it's a disaster. Every concurrent request doubles memory usage. At 100 concurrent requests for a 500MB CSV, you're looking at 50GB RAM. StreamingResponse lets you send data as it's produced. The client sees the first bytes almost instantly, and your server memory stays flat regardless of response size. The trade-off is that you lose automatic Content-Length headers and error handling becomes trickier — if the stream fails mid-way, the client gets a truncated response. But for large datasets, video, or real-time data, there's no alternative.
FileResponse: The OS Does the Heavy Lifting
When you need to serve a static file from disk, FileResponse is your friend. It uses the operating system's sendfile syscall to transfer data directly from the file descriptor to the network socket, bypassing Python's memory entirely. This means zero copy — the file never enters your application's memory space. For large files (videos, ISOs, logs), this is orders of magnitude more efficient than reading the file into a bytes object and returning it. FileResponse also automatically handles range requests for partial content, enabling pause/resume downloads and video seeking. The catch: it only works with actual files on disk, not dynamically generated content. And it blocks the event loop if the file is on a slow filesystem (NFS, network mounts) — use it with async file I/O or offload to a thread pool.
curl -H "Range: bytes=0-1023" http://localhost:8000/download/video.mp4.Streaming JSON: Don't Wait for the Whole Array
Returning a large JSON array as a single Response forces the client to wait for the entire serialization to complete. For paginated APIs, this is fine. But for real-time dashboards or log streams, you want to send JSON objects as they become available. The trick is to use StreamingResponse with a generator that yields JSON-encoded objects separated by newlines (JSON Lines format) or as a continuous array. The client can parse each line as it arrives. This is how Twitter's streaming API works. The downside: standard JSON parsers expect a complete document. You'll need a streaming JSON parser on the client side (e.g., ijson for Python, or JSON.parse on each line in JavaScript).
When Streaming Breaks: The Gotchas You'll Hit
StreamingResponse isn't magic. It has sharp edges. First, if your generator raises an exception mid-stream, the client gets a truncated response with no error indication. Always wrap generator logic in try/except and log errors. Second, without a Content-Length, proxies like Nginx or Cloudflare may buffer the entire response before forwarding it to the client, negating the memory benefit. Set the X-Accel-Buffering: no header for Nginx, or disable buffering in your reverse proxy. Third, streaming is incompatible with some middleware that reads the entire body (e.g., compression middleware). If you need compression, compress chunks yourself. Finally, async generators must not perform blocking I/O — use async libraries or run blocking calls in a thread pool.
time.sleep() or requests.get(), the entire event loop blocks. Use asyncio.sleep() and httpx.AsyncClient. For CPU-bound work, use asyncio.to_thread() or a process pool. Otherwise, your streaming endpoint becomes a single-threaded bottleneck.FileResponse vs StreamingResponse: When to Use Which
FileResponse is for static files on disk. It's zero-copy, handles range requests, and sets Content-Type automatically. Use it for serving uploaded files, static assets, or any file that exists before the request. StreamingResponse is for dynamically generated content: database query results, real-time data, or data that doesn't exist as a file. It gives you full control over the output format and timing. The rule of thumb: if the data is already on disk, use FileResponse. If you're generating it on the fly, use StreamingResponse. Never read a file into memory just to return it as a Response — that's the classic rookie mistake that costs you memory.
Production Patterns: Rate Limiting and Backpressure
Streaming responses can overwhelm clients or downstream services if you produce data faster than they can consume it. In a production system, you need backpressure. One pattern is to use an asyncio.Queue with a max size: if the queue is full, the producer waits. Another is to yield chunks at a controlled rate using . For file downloads, the OS's TCP stack provides natural backpressure — sendfile blocks when the socket buffer is full. But for generated streams, you must implement it yourself. I've seen a logging pipeline crash a Kafka cluster because the producer streamed events faster than the broker could ingest them. The fix was to add a sliding window that limited the number of outstanding unacknowledged events.asyncio.sleep()
asyncio.QueueFull exception.When Not to Stream: The Case for Simple Responses
Streaming is not free. It adds complexity: error handling, missing Content-Length, proxy buffering issues, and client-side parsing requirements. If your response fits comfortably in memory (say, under 10MB), just return a normal Response. The overhead of setting up a generator and managing the stream isn't worth it. Also, if your API consumers expect a complete JSON document (most do), streaming JSON requires them to use a streaming parser. For internal APIs where both sides are under your control, streaming is fine. For public APIs, stick with pagination or standard responses unless you have a clear need for low latency or huge payloads.
The 4GB Container That Kept Dying
- Never build the entire response body in memory.
- If you can generate it iteratively, stream it.
return Response(content=open('file','rb').read()) with return FileResponse(path='file'). 3. Monitor memory with docker stats or ps aux.time.sleep() with asyncio.sleep(). 3. Use asyncio.to_thread() for CPU-bound work. 4. Profile with cProfile or py-spy.`curl -s -o /dev/null -w '%{size_download}' http://localhost:8000/export``docker stats <container>`return Response(content=data) with return StreamingResponse(generator()) or return FileResponse(path)| File | Command / Code | Purpose |
|---|---|---|
| csv_streaming.py | from fastapi import FastAPI | Why You Shouldn't Return a List When You Can Stream |
| file_response_example.py | from fastapi import FastAPI | FileResponse |
| json_streaming.py | from fastapi import FastAPI | Streaming JSON |
| streaming_with_error_handling.py | from fastapi import FastAPI | When Streaming Breaks |
| backpressure_example.py | from fastapi import FastAPI | Production Patterns |
Key takeaways
Interview Questions on This Topic
How does FileResponse handle HTTP Range requests internally, and what happens if the file is modified between the request and the response?
Frequently Asked Questions
Every FastAPI concept with runnable in-browser examples — params, Pydantic, dependency injection, JWT auth, async, SQLAlchemy, testing, WebSockets, and Docker deployment. The interactive reference for production engineers.
20+ years shipping production Python across data and backend systems. Lessons pulled from things that broke in production.
That's Python Libraries. Mark it forged?
3 min read · try the examples if you haven't