AI Object Removal in the Browser: 2026 State of the Art
WASM inpainting models including LaMa and MI-GAN, memory and throughput limits in the browser, and how local inference now compares to cloud services.
AI Object Removal in the Browser: 2026 State of the Art
Removing a person from a beach photo used to mean Photoshop's Content-Aware Fill and a prayer. Then cloud services like Runway and Cleanup.pictures took over for a few years. As of 2026, you can run the same quality models directly in the browser, no upload, no account, no server bill. The gap closed faster than most people noticed.
The model landscape
Four inpainting architectures matter for browser deployment:
- LaMa (Large Mask Inpainting) - 2021 paper, Fourier convolutions, handles large masks well. ONNX weights clock in at 196 MB at fp16.
- MI-GAN - distilled, mobile-oriented, 11 MB at int8. Fast but noticeably worse on complex backgrounds.
- MAT (Mask-Aware Transformer) - 2022, strong on structural completion (windows, tiles, regular textures). 340 MB at fp16.
- ZITS++ - the one most "magic eraser" features quietly use for low-resolution masks. Good speed-quality tradeoff.
In April 2026, LaMa remains the default for general-purpose removal. The Fourier convolutions generalize better than transformer-based models when the mask covers 20 percent or more of the image, which is the case most users hit.
Running inference in the browser
Three runtimes handle the inference path:
| Runtime | Backend | LaMa latency (1024x1024) | Notes |
|---|---|---|---|
| ONNX Runtime Web | WebGPU | 1.8 s | Best numbers on M-series and RTX |
| ONNX Runtime Web | WASM SIMD | 14 s | CPU fallback |
| transformers.js | WebGPU | 2.1 s | Friendlier API, slight overhead |
| MediaPipe Tasks | WebGPU | 1.6 s | Google-maintained, limited model set |
WebGPU is the headline. It's been available in Chrome since 113, Safari since 18, Firefox since 141 (behind a flag until 144, default on by 144). If you ship to WebGPU-capable browsers, inference times for a 1 megapixel image now land between 1 and 3 seconds on any machine made in the last three years.
Memory limits that bite
The real constraint is not compute, it's memory. A few numbers from production:
- Chrome on a 4 GB Chromebook: WebGPU heap typically caps around 1.5 GB of usable GPU memory. LaMa at fp16 plus an 8 megapixel image plus the WebGL/WebGPU frontbuffer overflows.
- iOS Safari: aggressive page reclamation. Tabs in the background get their WebGPU device revoked within about 30 seconds. Plan for reload.
- Any browser, any machine: a single
Uint8ClampedArraylarger than 4 GB throws. For images above 24 megapixels you have to tile.
The standard mitigation is tile-based inference: split the image into overlapping 512x512 or 1024x1024 tiles, run inpainting per tile, blend with feathered masks. LaMa tolerates this reasonably well because its receptive field via Fourier convolutions is global within a tile. Models like MAT produce seams at tile boundaries and need larger overlaps (128+ pixels).
How local compares to cloud
For a 1024x1024 image with a person-sized mask:
- Cloud (Runway, Cleanup.pictures): upload plus inference plus download. Round-trip time 3 to 8 seconds on a good connection. Higher on mobile LTE. Privacy tradeoff: your photo lives on their servers, governed by their retention policy.
- Browser LaMa on WebGPU: 1.8 seconds after first-load model download. Model gets cached by the service worker; subsequent runs are instant start. Images never leave the device.
Where cloud still wins: very high resolutions (12+ megapixels) where the cloud GPU can process in one pass, and specialized models that are too large for reasonable browser download (5 GB+ diffusion-based inpainters).
Where browser wins: everything under 8 megapixels, anything touching private photos, anything where users will make ten edits in a row, and any workflow that batches hundreds of images. Round-tripping a thousand product photos to a cloud endpoint at 4 seconds each costs real money; doing it locally is free.
A practical integration
import { pipeline } from '@xenova/transformers';
const inpainter = await pipeline('image-inpainting', 'Xenova/lama-onnx', {
device: 'webgpu',
dtype: 'fp16',
});
const result = await inpainter({ image, mask });
That is the full API. The model downloads once (about 100 MB quantized), caches in the origin's service worker, and runs locally from then on.
Limits worth naming
Browser inpainting still loses to cloud on three fronts in 2026:
- Very large masks spanning central subjects. Diffusion-based cloud models hallucinate plausible content. LaMa and MI-GAN fill plausibly but conservatively.
- Text regions. No browser-size model reconstructs legible text. If your mask covers signage, expect garbled output.
- Hair and fine detail against complex backgrounds. Alpha-matting in the mask creation stage matters more than the inpainter itself. A crude mask produces a crude result.
For the other 80 percent of use cases (people on beaches, cars in driveways, logos on t-shirts, scratches on scans) the browser path is now the correct default. Processing happens on the converter client-side and the image never touches a server. That's the story of 2026 in image AI more broadly: the edge caught up.