TensorFlow.js for JavaScript Developers – Machine Learning in Browser
- TensorFlow.js moves ML inference to the browser — zero server latency, full data privacy, zero inference server costs at scale.
- Always use pre-trained models converted from Python with tensorflowjs_converter. Training complex models in-browser is not practical for production.
- Memory management is the #1 production concern. Use
tf.tidy()for sync code and try/finally with .dispose() for async code. Monitortf.memory().numTensors as a health metric.
- TensorFlow.js lets you run ML inference and training directly in the browser or Node.js — no Python server required
- Use tf.loadLayersModel() to load pre-trained models from HTTP, IndexedDB, or file system
- WebGPU backend provides 2-10x speedup over WebGL for matrix operations on supported browsers
- Models run client-side: zero server cost, zero API latency, full data privacy by default
- Biggest mistake: training complex models in-browser instead of importing pre-trained ones from Python
- Production rule: always quantize models to float16 before browser deployment — halves size with negligible accuracy loss
GPU backend not activating
console.log('Current backend:', tf.getBackend()); console.log('WebGL2:', !!document.createElement('canvas').getContext('webgl2')); console.log('WebGPU:', 'gpu' in navigator);await tf.setBackend('webgl'); await tf.ready(); console.log('Backend after init:', tf.getBackend());Memory keeps growing with each prediction
console.log('Before:', tf.memory()); const result = tf.tidy(() => model.predict(inputTensor)); const data = await result.data(); result.dispose(); console.log('After:', tf.memory());// numTensors should be stable between predictions. If it grows, tensors are leaking. Check every code path that creates tensors — especially error handling branches where dispose() might be skipped.Model prediction accuracy is much lower than the Python version
const input = tf.browser.fromPixels(image).toFloat(); console.log('Raw pixel range:', input.min().dataSync()[0], '-', input.max().dataSync()[0]); input.dispose();// Compare this output with Python: np.array(image).astype('float32').min(), .max(). Check: resize dimensions, normalization formula, channel order (RGB in browser, potentially BGR in Python/OpenCV), and whether the Python model expects NCHW vs NHWC layout.Production Incident
Production Debug GuideCommon signals when browser-based ML goes wrong and what to check first.
tensor.print() and compare values against the Python preprocessing pipeline. Also check for division by zero in any custom preprocessing steps.tf.memory()) before and after each prediction. If numTensors grows, you are leaking. Wrap prediction code in tf.tidy() for synchronous operations. For async code with await, call tensor.dispose() manually on every tensor after extracting data with tensor.data().predict() throws a shape mismatch error→The input tensor shape does not match what the model expects. Print model.inputs to see expected shapes. Common causes: missing the batch dimension (use expandDims(0)), wrong image dimensions (224x224 vs 256x256), or wrong number of channels (grayscale vs RGB). The error message contains the expected and received shapes — read it carefully.Most ML tutorials assume a Python backend. But JavaScript developers already ship production applications to billions of browsers and hundreds of millions of Node.js servers. TensorFlow.js bridges that gap.
The core value proposition is simple: move inference to the client. This eliminates round-trip latency to a prediction server, reduces infrastructure costs at scale, and keeps sensitive data — photos, voice, health metrics — on the user's device where it belongs. For real-time applications like gesture detection, live audio classification, or interactive image editing, server-side inference introduces latency that users can feel and that degrades the experience.
The common misconception is that browser-based ML is toy-grade. It is not. Models like MobileNet, PoseNet, and custom-trained classifiers run at 30+ FPS on modern hardware with WebGL. With WebGPU, performance jumps another 2-10x. The constraint is model size and memory, not capability. The key is knowing which models to run client-side and which to keep on the server.
Setting Up TensorFlow.js
Installation depends on your deployment target. For quick browser prototypes, use the CDN script tag. For production applications built with bundlers like Webpack, Vite, or Next.js, install via npm. The library provides two main packages: @tensorflow/tfjs bundles the full runtime including all backends, while @tensorflow/tfjs-core provides just the tensor operations for custom builds where bundle size matters.
The setup step that most tutorials skip — and that causes the most production issues — is backend verification. TensorFlow.js selects a compute backend automatically based on device capabilities, but this selection can fail silently. If the WebGL backend fails to initialize (common on older mobile devices or headless environments), the library falls back to CPU without any warning. Your code runs, your predictions work, and everything is 50x slower than it should be. Always verify the active backend after initialization.
// Option 1: CDN — simplest for prototypes and demos // <script src="https://cdn.jsdelivr.net/npm/@tensorflow/tfjs@4.22.0"></script> // Option 2: npm — for production bundlers (Next.js, Vite, Webpack) // npm install @tensorflow/tfjs @tensorflow/tfjs-backend-webgl import * as tf from '@tensorflow/tfjs'; import '@tensorflow/tfjs-backend-webgl'; // Explicit backend import // Verify installation, backend, and GPU availability async function initTF() { await tf.ready(); const backend = tf.getBackend(); const memInfo = tf.memory(); console.log(`TensorFlow.js v${tf.version.tfjs}`); console.log(`Backend: ${backend}`); console.log(`GPU Tensors: ${memInfo.numTensors}`); console.log(`GPU Memory: ${(memInfo.numBytes / 1e6).toFixed(1)}MB`); if (backend === 'cpu') { console.warn( 'WARNING: Running on CPU backend. GPU acceleration is not available. ' + 'Performance will be 10-50x slower than WebGL/WebGPU.' ); } // Quick sanity test — verify tensor operations work const test = tf.tensor([1, 2, 3, 4]); console.log('Sanity check:', test.dataSync()); // [1, 2, 3, 4] test.dispose(); return backend; } initTF();
Backend: webgl
GPU Tensors: 0
GPU Memory: 0.0MB
Sanity check: Float32Array(4) [1, 2, 3, 4]
- webgpu — fastest, requires Chrome 113+ or Edge 113+. Uses GPU compute shaders directly. Best for large models and real-time video.
- webgl — wide support across all modern browsers. Uses GPU fragment shaders repurposed for parallel compute. The production default.
- wasm — WebAssembly backend. Runs on CPU but uses SIMD instructions. Good fallback for environments without GPU access.
- cpu — slowest but universally available. Pure JavaScript. Use only for tiny models, debugging, or server-side Node.js without native bindings.
model.predict() before it resolves, the CPU backend may be used silently — your code works but at 50x slower performance with no error or warning.tf.ready() at app initialization before any tensor operation. Log tf.getBackend() to verify GPU activation.tf.ready() before any tensor operation.Loading Pre-trained Models
The most common production pattern is loading a pre-trained model, not training in the browser. TensorFlow.js supports models converted from Python TensorFlow/Keras via the tensorflowjs_converter CLI, as well as models hosted directly on TensorFlow Hub or custom CDN endpoints. Two loading functions handle different model formats: tf.loadLayersModel() for Keras Sequential and Functional models, and tf.loadGraphModel() for TensorFlow SavedModels converted to graph format.
Model loading involves three network-dependent steps: fetching the model.json topology file, downloading the weight shard files (one or more .bin files), and initializing the computation graph in GPU memory. The topology fetch is small (typically 10-100KB), but weight shards can be tens of megabytes. Progressive loading with an onProgress callback lets you show meaningful load indicators to users instead of a frozen screen.
The detail that catches every team at least once: the first prediction after loading is always slow. This is not a bug. The GPU backend needs to compile shader programs for every unique operation in the model graph. Shader compilation happens lazily on the first inference call, not during model load. Running a dummy prediction during the loading phase — a warm-up pass — moves this cost out of the user's interaction path.
import * as tf from '@tensorflow/tfjs'; import '@tensorflow/tfjs-backend-webgl'; // Load from URL (CDN-hosted model) const MODEL_URL = 'https://storage.googleapis.com/your-bucket/models/classifier_v2/model.json'; async function loadModel() { await tf.ready(); console.log(`Backend: ${tf.getBackend()}`); const startLoad = performance.now(); const model = await tf.loadLayersModel(MODEL_URL, { onProgress: (fraction) => { // Update a loading bar in the UI console.log(`Loading: ${(fraction * 100).toFixed(1)}%`); } }); const loadTime = performance.now() - startLoad; console.log(`Model loaded in ${loadTime.toFixed(0)}ms`); console.log(`Input shape: ${JSON.stringify(model.inputs[0].shape)}`); console.log(`Output shape: ${JSON.stringify(model.outputs[0].shape)}`); // Warm up — first prediction compiles GPU shaders const startWarmup = performance.now(); const dummyInput = tf.zeros(model.inputs[0].shape.map(d => d || 1)); const warmupOutput = model.predict(dummyInput); await warmupOutput.data(); // Force GPU sync — shaders compile here const warmupTime = performance.now() - startWarmup; tf.dispose([dummyInput, warmupOutput]); console.log(`Warm-up inference: ${warmupTime.toFixed(0)}ms (includes shader compilation)`); return model; } // Load from IndexedDB (cached model for offline and repeat visits) async function loadCachedModel(modelId) { try { const model = await tf.loadLayersModel(`indexeddb://${modelId}`); console.log(`Loaded cached model: ${modelId}`); return model; } catch (err) { console.log(`No cached model found for ${modelId}, loading from network`); return null; } } // Save to IndexedDB after first network load async function cacheModel(model, modelId) { await model.save(`indexeddb://${modelId}`); console.log(`Model cached as: ${modelId}`); } // Full loading strategy with cache-first pattern async function loadModelWithCache(modelId, networkUrl) { // Try cache first let model = await loadCachedModel(modelId); if (!model) { // Cache miss — load from network model = await loadModel(networkUrl); await cacheModel(model, modelId); } // Warm up regardless of source const dummy = tf.zeros(model.inputs[0].shape.map(d => d || 1)); const warm = model.predict(dummy); await warm.data(); tf.dispose([dummy, warm]); return model; }
Loading: 25.0%
Loading: 50.0%
Loading: 75.0%
Loading: 100.0%
Model loaded in 1847ms
Input shape: [null,224,224,3]
Output shape: [null,5]
Warm-up inference: 342ms (includes shader compilation)
Running Inference in the Browser
Inference is the primary use case for TensorFlow.js in production. The pattern is straightforward: convert input data (an image, audio clip, or text) to a tensor, run model.predict(), and convert the output back to JavaScript arrays for display or decision-making.
The critical detail that determines whether your model works or produces garbage output is input preprocessing. The JavaScript preprocessing pipeline must exactly reproduce what the Python training pipeline did — same resize dimensions, same normalization formula, same channel ordering. A model trained on images normalized to [-1, 1] will produce nonsensical predictions if you feed it images normalized to [0, 1]. The values look plausible, the shapes are correct, the code runs without errors, and every prediction is wrong.
The second critical detail is memory management. Every call to model.predict() allocates new GPU memory for the output tensor. Every intermediate operation — fromPixels, resizeBilinear, toFloat, div — allocates an additional tensor. Without explicit cleanup, running inference in a loop (video processing, real-time camera feed) will exhaust GPU memory and crash the browser tab within seconds. tf.tidy() is the primary defense.
import * as tf from '@tensorflow/tfjs'; // Image classification pipeline — single image async function classifyImage(model, imageElement, labels) { // Preprocess: resize, normalize, add batch dimension // CRITICAL: normalization must match the Python training pipeline const inputTensor = tf.tidy(() => { return tf.browser.fromPixels(imageElement) // [H, W, 3] uint8 .resizeBilinear([224, 224]) // Match model input shape .toFloat() // Cast to float32 .div(127.5) // Scale to [0, 2] .sub(1.0) // Shift to [-1, 1] (MobileNet convention) .expandDims(0); // Add batch dim: [1, 224, 224, 3] }); // Run inference const predictions = model.predict(inputTensor); const probabilities = await predictions.data(); // GPU → CPU transfer // Cleanup — prevent memory leaks tf.dispose([inputTensor, predictions]); // Map to class labels and sort by confidence const results = Array.from(probabilities) .map((prob, i) => ({ label: labels[i], confidence: prob })) .sort((a, b) => b.confidence - a.confidence); return { topPrediction: results[0], allPredictions: results }; } // Real-time video classification at target FPS async function classifyVideoStream(model, videoElement, labels, targetFPS = 30) { const frameInterval = 1000 / targetFPS; let lastFrameTime = 0; let isProcessing = false; async function processFrame(timestamp) { // Skip frame if previous inference is still running if (isProcessing || timestamp - lastFrameTime < frameInterval) { requestAnimationFrame(processFrame); return; } isProcessing = true; lastFrameTime = timestamp; // All tensor ops wrapped in tidy for automatic cleanup const outputTensor = tf.tidy(() => { const frame = tf.browser.fromPixels(videoElement); const resized = tf.image.resizeBilinear(frame, [224, 224]); const normalized = resized.toFloat().div(127.5).sub(1.0); const batched = normalized.expandDims(0); return model.predict(batched); }); const result = await outputTensor.data(); outputTensor.dispose(); // Use result — update UI, trigger action, etc. const topIndex = result.indexOf(Math.max(...result)); console.log(`${labels[topIndex]}: ${(result[topIndex] * 100).toFixed(1)}%`); isProcessing = false; requestAnimationFrame(processFrame); } requestAnimationFrame(processFrame); } // Example: classify a file upload async function handleFileUpload(model, file) { const img = new Image(); img.src = URL.createObjectURL(file); await img.decode(); // Wait for image to load completely const labels = ['cat', 'dog', 'bird', 'fish', 'horse']; const result = await classifyImage(model, img, labels); URL.revokeObjectURL(img.src); // Clean up object URL return result; }
[{ label: 'cat', confidence: 0.942 }, { label: 'dog', confidence: 0.031 }, { label: 'bird', confidence: 0.015 }, { label: 'fish', confidence: 0.008 }, { label: 'horse', confidence: 0.004 }]
- Wrap all tensor creation and operations inside
tf.tidy()callbacks — it automatically disposes intermediate tensors when the callback returns. - Only the tensor returned from
tf.tidy()survives — assign it to a variable, extract data with .data(), then dispose it manually. - Never use async/await inside
tf.tidy(). It only tracks synchronous tensor operations. For async code, dispose tensors manually in a try/finally block. - Monitor with
tf.memory().numTensors — this number should be stable between predictions. If it grows, you have a leak.
tf.tidy() to prevent GPU memory leaks.Converting Python Models to TensorFlow.js
Most production models are trained in Python using TensorFlow or Keras, then converted for browser deployment. The tensorflowjs_converter CLI tool handles this conversion, transforming SavedModel directories, Keras HDF5 files, or TensorFlow Hub modules into the TensorFlow.js graph model format that can be loaded in the browser.
Conversion is not just a format change — it is also the right place to apply optimizations. The --quantize_float16 flag halves model size by storing weights as 16-bit floats instead of 32-bit, with typically less than 1% accuracy loss. Weight sharding splits the model into multiple smaller files for parallel download and CDN-friendly caching. Both optimizations should be applied to every model before browser deployment.
The conversion step is also where you discover op compatibility issues. TensorFlow.js supports a subset of TensorFlow operations. Models that use custom ops, complex control flow with dynamic shapes, or string-based operations will fail during conversion with an explicit error listing the unsupported ops. This is the point to address those issues — either by replacing unsupported ops in the Python model or by restructuring the graph.
# Install the converter pip install tensorflowjs # Convert Keras .h5 model with float16 quantization tensorflowjs_converter \ --input_format=keras \ --output_format=tfjs_graph_model \ --quantize_float16 \ --weight_shard_size_bytes=4194304 \ ./models/my_model.h5 \ ./tfjs_models/my_model # Convert SavedModel directory tensorflowjs_converter \ --input_format=tf_saved_model \ --output_format=tfjs_graph_model \ --signature_name=serving_default \ --saved_model_tags=serve \ --quantize_float16 \ --weight_shard_size_bytes=4194304 \ ./saved_model/ \ ./tfjs_models/my_model # Convert Keras .keras format (TF 2.16+) tensorflowjs_converter \ --input_format=keras \ --output_format=tfjs_graph_model \ --quantize_float16 \ ./models/my_model.keras \ ./tfjs_models/my_model # Verify converted model output structure ls -la ./tfjs_models/my_model/ # model.json (graph topology + weight manifest) # group1-shard1of3.bin (weight data, ~4MB each) # group1-shard2of3.bin # group1-shard3of3.bin # Check model size after conversion du -sh ./tfjs_models/my_model/ # 23M ./tfjs_models/my_model/ (from 46MB float32 original)
Float16 quantization: 46.2MB → 23.1MB (50.0% reduction)
Model converted successfully.
./tfjs_models/my_model/
total 23M
-rw-r--r-- 1 user user 84K model.json
-rw-r--r-- 1 user user 4.0M group1-shard1of3.bin
-rw-r--r-- 1 user user 4.0M group1-shard2of3.bin
-rw-r--r-- 1 user user 3.1M group1-shard3of3.bin
WebGPU Acceleration
WebGPU is the next-generation GPU API that replaces WebGL for general-purpose GPU compute in browsers. Where WebGL repurposes graphics fragment shaders for matrix operations (a clever hack that works but has overhead), WebGPU provides direct access to GPU compute shaders designed for parallel computation. TensorFlow.js uses WebGPU as a backend for faster matrix operations, memory transfers, and kernel dispatch.
The performance gain from WebGPU over WebGL varies by model architecture and operation mix. Matrix-heavy models (transformers, large dense layers) see the largest improvements — 2-10x speedup is typical. Models dominated by small convolutions may see smaller gains because the overhead reduction matters less when each kernel is already fast.
WebGPU support is expanding but not universal. Chrome 113+, Edge 113+, and Firefox Nightly support it. Safari has experimental support behind a flag. For production applications, you must implement a fallback chain: attempt WebGPU first, fall back to WebGL, and use CPU as the last resort. Feature detection is straightforward — check 'gpu' in navigator before attempting initialization.
import * as tf from '@tensorflow/tfjs'; import '@tensorflow/tfjs-backend-webgpu'; // Initialize with fallback chain: WebGPU → WebGL → CPU async function initBestBackend() { const backends = ['webgpu', 'webgl', 'cpu']; for (const backend of backends) { try { // Feature detection for WebGPU if (backend === 'webgpu' && !('gpu' in navigator)) { console.log('WebGPU: not available in this browser'); continue; } await tf.setBackend(backend); await tf.ready(); console.log(`Backend initialized: ${backend}`); return backend; } catch (err) { console.warn(`${backend} backend failed: ${err.message}`); } } throw new Error('No TensorFlow.js backend available'); } // Benchmark to compare backends on the actual device async function benchmarkBackend(iterations = 10) { const backend = tf.getBackend(); const a = tf.randomNormal([1024, 1024]); const b = tf.randomNormal([1024, 1024]); // Warm up — first run includes shader compilation const warmup = tf.matMul(a, b); await warmup.data(); warmup.dispose(); // Timed runs const times = []; for (let i = 0; i < iterations; i++) { const start = performance.now(); const c = tf.matMul(a, b); await c.data(); // Force GPU sync times.push(performance.now() - start); c.dispose(); } tf.dispose([a, b]); const avg = times.reduce((s, t) => s + t, 0) / times.length; const min = Math.min(...times); const max = Math.max(...times); console.log(`Backend: ${backend}`); console.log(`1024x1024 matMul (${iterations} runs):`); console.log(` Avg: ${avg.toFixed(1)}ms`); console.log(` Min: ${min.toFixed(1)}ms`); console.log(` Max: ${max.toFixed(1)}ms`); return { backend, avg, min, max }; } // Usage const activeBackend = await initBestBackend(); await benchmarkBackend();
webgl backend failed: WebGL2 context creation failed
Backend initialized: cpu
-- or on a supported device: --
Backend initialized: webgpu
1024x1024 matMul (10 runs):
Avg: 4.2ms
Min: 3.8ms
Max: 5.1ms
Memory Management in the Browser
Browsers enforce strict memory budgets per tab — typically 200-500MB on mobile and 1-4GB on desktop. TensorFlow.js allocates GPU memory for every tensor created, and unlike regular JavaScript objects, tensors are not managed by the garbage collector. You must dispose them explicitly.
This is the number one production issue with TensorFlow.js. It manifests as tabs crashing after running inference multiple times, especially on mobile devices with tight memory constraints. The failure mode is not graceful — the browser kills the tab with an Out of Memory error, losing any unsaved user state.
The core rule is simple: every tensor must be disposed after use. The practical challenge is that tensor operations create intermediate tensors that are easy to lose track of. A single line like tensor.toFloat().div(255.0).expandDims(0) creates three intermediate tensors, each consuming GPU memory. tf.tidy() solves this by tracking all tensor allocations within its callback and automatically disposing everything except the return value. For async operations where tf.tidy() cannot be used, manual disposal in try/finally blocks is required.
import * as tf from '@tensorflow/tfjs'; // PATTERN 1: tf.tidy for synchronous automatic cleanup function predictSafely(model, imageElement) { // All intermediate tensors created inside tf.tidy are disposed automatically // Only the returned tensor survives return tf.tidy(() => { const input = tf.browser.fromPixels(imageElement) .toFloat() // intermediate tensor 1 .div(255.0) // intermediate tensor 2 .expandDims(0); // intermediate tensor 3 return model.predict(input); // only this survives }); } // PATTERN 2: Manual disposal for async operations async function predictAsync(model, imageElement) { let input = null; let output = null; try { input = tf.tidy(() => { return tf.browser.fromPixels(imageElement) .toFloat().div(255.0).expandDims(0); }); output = model.predict(input); const result = await output.data(); // async — cannot use tf.tidy for this return Array.from(result); } finally { // Dispose in finally block — runs even if an error is thrown if (input) input.dispose(); if (output) output.dispose(); } } // ANTI-PATTERN: Memory leak — tensors never disposed function predictLeaky(model, imageElement) { // BAD: three intermediate tensors leak on every call const pixels = tf.browser.fromPixels(imageElement); // leaked const floats = pixels.toFloat(); // leaked const normalized = floats.div(255.0); // leaked const batched = normalized.expandDims(0); // leaked const output = model.predict(batched); // leaked return output.data(); // Nothing is ever disposed — GPU memory grows until crash } // Memory monitoring — use in development to detect leaks function assertNoLeaks(label, fn) { const before = tf.memory().numTensors; fn(); const after = tf.memory().numTensors; if (after > before + 1) { // +1 for the returned tensor console.error( `[LEAK] ${label}: ${after - before} tensors created, ` + `expected at most 1. Before: ${before}, After: ${after}` ); } } // Full lifecycle monitoring function logMemory(label) { const info = tf.memory(); console.log( `[${label}] Tensors: ${info.numTensors} | ` + `Bytes: ${(info.numBytes / 1e6).toFixed(1)}MB | ` + `Unreliable: ${info.unreliable}` ); } // Cleanup when a model is no longer needed function disposeModel(model) { model.dispose(); // Frees all weight tensors and GPU resources console.log('Model disposed. Remaining tensors:', tf.memory().numTensors); }
[predictAsync] Tensors: 0 (all disposed in finally block)
[predictLeaky] Tensors: +5 per call — LEAK DETECTED
[Before prediction] Tensors: 42 | Bytes: 23.4MB | Unreliable: false
[After prediction] Tensors: 42 | Bytes: 23.4MB | Unreliable: false
(Stable count = no leak)
- tf.tidy() handles disposal for synchronous operations — use it everywhere possible. It is the single most important API for preventing leaks.
- For async code paths (any function with await between tensor creation and disposal), you must call .dispose() manually — use try/finally to guarantee cleanup even on errors.
- model.predict() returns a new tensor every call — the result must be disposed after extracting data with .data() or .dataSync().
- tf.memory().numTensors should be stable between predictions. If it grows by more than 0-1 per prediction cycle, you have a leak that will eventually crash the tab.
tf.memory().numTensors in development and in production error reporting. Emit this value as a metric on every Nth prediction call. A growing count is a pre-crash signal that gives you time to fix the leak before users experience tab crashes.tf.tidy() for sync code, try/finally with manual .dispose() for async code.tf.memory().numTensors in development and production — a growing count means a leak that will crash the tab.tf.tidy(). Automatic disposal of all intermediates. Only the returned tensor survives.tf.tidy() does not track async operations.tf.tidy() inside the loop body. Monitor tf.memory().numTensors every N iterations. Assert stability.model.dispose() to free all weight tensors and associated GPU memory. Verify with tf.memory().Integration with Next.js
TensorFlow.js requires special handling in Next.js because of server-side rendering. The library accesses browser APIs — WebGL context, canvas elements, navigator.gpu — that do not exist in Node.js. Importing TensorFlow.js in a server component or during SSR will throw errors like 'self is not defined' or 'WebGL context creation failed'.
The solution is twofold: mark components that use TensorFlow.js with the 'use client' directive, and import them with Next.js dynamic import using ssr: false. This prevents the component from being evaluated during server-side rendering and ensures TensorFlow.js only loads in the browser.
The second production concern is component lifecycle management. Next.js re-renders components on route changes and state updates. If the model loads in a useEffect without a corresponding cleanup function, navigating away and back creates duplicate model instances — each consuming GPU memory for the full set of weights. After three or four navigations, the tab runs out of memory. Always dispose the model in the useEffect cleanup return function.
// components/ImageClassifier.jsx 'use client'; import { useState, useEffect, useRef, useCallback } from 'react'; export default function ImageClassifier() { const [loading, setLoading] = useState(true); const [result, setResult] = useState(null); const [error, setError] = useState(null); const modelRef = useRef(null); const tfRef = useRef(null); useEffect(() => { let cancelled = false; async function init() { try { // Dynamic import of TensorFlow.js — only in browser const tf = await import('@tensorflow/tfjs'); await import('@tensorflow/tfjs-backend-webgl'); tfRef.current = tf; await tf.ready(); console.log(`Backend: ${tf.getBackend()}`); const model = await tf.loadLayersModel('/models/classifier/model.json'); // Warm up with dummy prediction const inputShape = model.inputs[0].shape.map(d => d || 1); const dummy = tf.zeros(inputShape); const warm = model.predict(dummy); await warm.data(); tf.dispose([dummy, warm]); if (!cancelled) { modelRef.current = model; setLoading(false); } else { model.dispose(); // Component unmounted during loading } } catch (err) { if (!cancelled) { setError(err.message); setLoading(false); } } } init(); // Cleanup on unmount — prevents GPU memory leak on route change return () => { cancelled = true; if (modelRef.current) { modelRef.current.dispose(); modelRef.current = null; console.log('Model disposed on component unmount'); } }; }, []); const handlePredict = useCallback(async (imageElement) => { const model = modelRef.current; const tf = tfRef.current; if (!model || !tf) return null; const prediction = tf.tidy(() => { const input = tf.browser.fromPixels(imageElement) .resizeBilinear([224, 224]) .toFloat().div(127.5).sub(1.0) .expandDims(0); return model.predict(input); }); const data = await prediction.data(); prediction.dispose(); const labels = ['cat', 'dog', 'bird', 'fish', 'horse']; const topIndex = Array.from(data).indexOf(Math.max(...data)); const newResult = { label: labels[topIndex], confidence: data[topIndex] }; setResult(newResult); return newResult; }, []); if (error) return <p>ML Error: {error}</p>; if (loading) return <p>Loading ML model...</p>; return <div>Model ready. Result: {result?.label} ({(result?.confidence * 100)?.toFixed(1)}%)</div>; } // app/page.jsx — dynamic import prevents SSR import dynamic from 'next/dynamic'; const ImageClassifier = dynamic( () => import('@/components/ImageClassifier'), { ssr: false, loading: () => <p>Initializing ML engine...</p> } ); export default function Home() { return ( <main> <h1>Browser ML Demo</h1> <ImageClassifier /> </main> ); }
Performance Optimization
Browser ML performance depends on three factors: model size, backend selection, and input preprocessing pipeline. Optimizing all three is required for real-time applications. A 30 FPS target means the entire pipeline — image capture, preprocessing, inference, post-processing, and UI update — must complete within 33 milliseconds per frame.
Model size is the most impactful lever. A MobileNetV2 (14MB quantized) runs 10x faster than a ResNet-50 (98MB quantized) with comparable accuracy for many classification tasks. Choosing the right architecture for the deployment target is more effective than any runtime optimization.
Input resolution is the second lever. Reducing input from 224x224 to 128x128 cuts tensor size by 66%, which proportionally reduces memory allocation, data transfer, and computation time. Many real-time applications achieve acceptable accuracy at lower resolutions — test before assuming 224x224 is required.
Batching helps throughput but hurts latency. For video processing where you want maximum FPS on a single stream, process one frame at a time. For scenarios where you have multiple independent inputs (batch of uploaded images), stack them into a single tensor and run one predict() call. GPU utilization is higher on batch operations.
import * as tf from '@tensorflow/tfjs'; // Technique 1: Batch predictions for throughput async function batchPredict(model, images) { // Stack multiple images into one batch tensor — one GPU dispatch const batchTensor = tf.tidy(() => { const tensors = images.map(img => tf.browser.fromPixels(img) .resizeBilinear([224, 224]) .toFloat().div(127.5).sub(1.0) ); return tf.stack(tensors); // [batch, 224, 224, 3] }); const predictions = model.predict(batchTensor); const results = await predictions.array(); tf.dispose([batchTensor, predictions]); return results; } // Technique 2: Reduce input resolution for real-time speed function preprocessAtResolution(imageElement, targetSize = 128) { return tf.tidy(() => { return tf.browser.fromPixels(imageElement) .resizeBilinear([targetSize, targetSize]) // 128x128 = 66% fewer pixels than 224x224 .toFloat().div(127.5).sub(1.0) .expandDims(0); }); } // Technique 3: Profile inference to find bottlenecks async function profileInference(model, inputShape, runs = 20) { const input = tf.randomNormal(inputShape); // Warm up — exclude shader compilation from timing const warmup = model.predict(input); await warmup.data(); warmup.dispose(); // Timed runs const times = []; for (let i = 0; i < runs; i++) { const start = performance.now(); const output = model.predict(input); await output.data(); // Force GPU sync times.push(performance.now() - start); output.dispose(); } input.dispose(); const avg = times.reduce((s, t) => s + t, 0) / times.length; const p95 = times.sort((a, b) => a - b)[Math.floor(times.length * 0.95)]; const min = times[0]; console.log(`Inference profile (${runs} runs, ${tf.getBackend()} backend):`); console.log(` Average: ${avg.toFixed(1)}ms`); console.log(` P95: ${p95.toFixed(1)}ms`); console.log(` Min: ${min.toFixed(1)}ms`); console.log(` Target: ${avg < 33 ? '✓ 30 FPS achievable' : '✗ Too slow for 30 FPS'}`); return { avg, p95, min }; } // Technique 4: Skip frames when inference cannot keep up class AdaptiveInference { constructor(model, targetFPS = 30) { this.model = model; this.targetInterval = 1000 / targetFPS; this.isProcessing = false; this.lastTime = 0; this.droppedFrames = 0; this.processedFrames = 0; } async processFrame(imageElement, timestamp) { if (this.isProcessing) { this.droppedFrames++; return null; // Skip — previous frame still processing } if (timestamp - this.lastTime < this.targetInterval) { return null; // Skip — too soon since last frame } this.isProcessing = true; this.lastTime = timestamp; const output = tf.tidy(() => { const input = tf.browser.fromPixels(imageElement) .resizeBilinear([128, 128]) .toFloat().div(127.5).sub(1.0) .expandDims(0); return this.model.predict(input); }); const result = await output.data(); output.dispose(); this.processedFrames++; this.isProcessing = false; return result; } getStats() { const total = this.processedFrames + this.droppedFrames; return { processed: this.processedFrames, dropped: this.droppedFrames, dropRate: total > 0 ? (this.droppedFrames / total * 100).toFixed(1) + '%' : '0%' }; } } // Usage const profiler = await profileInference(model, [1, 224, 224, 3]); const adaptive = new AdaptiveInference(model, 30);
Average: 18.3ms
P95: 22.1ms
Min: 16.7ms
Target: ✓ 30 FPS achievable
- Preprocessing (resize, normalize) typically takes 2-8ms depending on resolution — budget for it explicitly.
- Model inference dominates the budget. Profile it separately with
performance.now()aroundmodel.predict()plus awaitdata(). - If inference alone exceeds 25ms, reduce input resolution or switch to a smaller model architecture — tuning other parameters will not close the gap.
- Batching helps throughput on multiple images but increases per-frame latency. For real-time single-stream video, always predict one frame at a time.
| Backend | Speed | Browser Support | Best For | Fallback Risk |
|---|---|---|---|---|
| WebGPU | Fastest (2-10x vs WebGL) | Chrome 113+, Edge 113+, Firefox (recent) | Large models, transformers, real-time video | Not universally supported — must implement WebGL fallback |
| WebGL | Fast (baseline GPU) | All modern browsers including mobile | General inference, widest device reach | Some older mobile GPUs have limited texture sizes |
| WASM | Medium (CPU with SIMD) | All browsers with WebAssembly support | Web Workers, environments without GPU access | Slower than GPU backends but predictable performance |
| CPU | Slowest (10-50x vs GPU) | Universal — always available | Tiny models under 1MB, debugging, unit tests | Always available — the final fallback |
| Node.js (native bindings) | Fast (C++ TF runtime) | N/A — server only | Server-side inference, batch processing | Not browser-compatible — separate deployment |
🎯 Key Takeaways
- TensorFlow.js moves ML inference to the browser — zero server latency, full data privacy, zero inference server costs at scale.
- Always use pre-trained models converted from Python with tensorflowjs_converter. Training complex models in-browser is not practical for production.
- Memory management is the #1 production concern. Use
tf.tidy()for sync code and try/finally with .dispose() for async code. Monitortf.memory().numTensors as a health metric. - Quantize models to float16 before deployment — halves download size and memory footprint with negligible accuracy loss for most model types.
- WebGPU provides 2-10x speedup over WebGL but requires a fallback chain for unsupported devices. Feature-detect, do not assume.
- In Next.js, always use 'use client' and dynamic import with ssr: false. Dispose models in useEffect cleanup to prevent GPU leaks on route changes.
- Warm up models with a dummy prediction during loading — the first inference includes shader compilation and is 5-30x slower than steady state.
⚠ Common Mistakes to Avoid
Interview Questions on This Topic
- QHow does TensorFlow.js differ from running TensorFlow on a Python server?JuniorReveal
- QA user reports that your TensorFlow.js model gives different results than the Python version. How do you debug this?Mid-levelReveal
- QExplain
tf.tidy()and why it is critical for production TensorFlow.js applications.Mid-levelReveal - QHow would you design a real-time hand gesture recognition system using TensorFlow.js at 30 FPS?SeniorReveal
Frequently Asked Questions
Can I train a model from scratch in the browser with TensorFlow.js?
Technically yes — TensorFlow.js supports model.fit() with the full training API. But it is not recommended for production models. Browser-based training is limited by GPU memory (200-500MB per tab on mobile), lacks optimized training kernels that native CUDA provides, cannot persist checkpoints reliably across sessions, and is dramatically slower than server-side training on equivalent hardware. The practical use case for in-browser training is transfer learning — take a pre-trained model like MobileNet, freeze all layers except the last few, and fine-tune on a small dataset (50-500 examples) that the user provides directly. This works well for personalization features where the user labels their own images and the model adapts without data leaving the device.
What is the maximum model size I can deploy in the browser?
There is no hard limit imposed by TensorFlow.js, but practical constraints narrow the range significantly. Models over 50MB cause noticeable download delays on mobile networks (2-5 seconds on 4G). Models over 100MB risk Out of Memory errors during graph initialization on devices with 2-3GB total RAM. Models over 200MB will crash most mobile browsers. The production sweet spot for broad device compatibility is 5-30MB after float16 quantization. For applications targeting only desktop users with modern hardware, you can push to 100MB with progressive loading and a good loading UX. Use navigator.deviceMemory (where available) to detect device capability and serve appropriately sized models — a 30MB model for desktop, an 8MB model for mobile.
How do I handle model updates after deployment?
Use content-hash filenames for model weight shards — for example, group1-shard1of3.a3f8b2.bin. When the model is retrained and redeployed, the hash changes and CDN caches serve the new version automatically. The model.json manifest contains all shard filenames and must be updated to reference the new hashes. For IndexedDB-cached models, implement a version check on app load: store a model version hash in localStorage, compare it against a version endpoint on your server, and delete and re-download the cached model if they differ. For Service Worker caching, increment the cache name in your Service Worker script to trigger re-download of all model files on the next activation.
Does TensorFlow.js work in Web Workers?
TensorFlow.js supports Web Workers with the WASM and CPU backends. The WebGL and WebGPU backends require access to the DOM — specifically a canvas element for GPU context creation — which is not available in Workers. The practical architecture for Worker-based ML is: run preprocessing (image decode, resize, normalization) in a Worker to keep the main thread responsive, transfer the preprocessed data back to the main thread, and run GPU inference on the main thread. Alternatively, use OffscreenCanvas (supported in Chrome and Firefox) to create a WebGL context inside a Worker, though this path has less community testing and documentation. For CPU-bound models (small classifiers, text processing), running the entire pipeline in a Worker with the WASM backend keeps the main thread completely free.
How does TensorFlow.js compare to ONNX Runtime Web for browser ML?
Both run ML models in the browser. TensorFlow.js is tightly integrated with the TensorFlow and Keras ecosystem — if your models are trained in TensorFlow or Keras, the conversion and deployment path is well-tested and documented. ONNX Runtime Web supports models from any framework (PyTorch, TensorFlow, scikit-learn) exported to the ONNX format, giving it broader framework compatibility. Performance is comparable on WebGL for most model architectures. TensorFlow.js has a larger community, more tutorials, and pre-built model packages (MobileNet, PoseNet, etc.). ONNX Runtime Web has the advantage for teams with PyTorch-trained models who want browser deployment without going through a TensorFlow conversion step. Choose based on your training framework and team expertise.
Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.