返回博客
tech 2026-02-05

Running AI Models in the Browser with ONNX Runtime

Running AI Models in the Browser with ONNX Runtime

For years, “AI in the browser” sounded like a toy demo: a tiny classifier on MNIST, maybe a style transfer that melted your laptop. That changed when ONNX Runtime Web matured and WebGPU/WebGL paths became reliable enough for real image workloads. At Ai2Done, we use ONNX-powered models to drive features like enhancement and segmentation on-device, because sending user photos to a remote GPU contradicts everything we stand for.

Why ONNX, specifically?

The ONNX format decouples model authoring from deployment. Researchers train in PyTorch or elsewhere, export to ONNX, and we consume a single artifact that ONNX Runtime can optimize for different execution providers. In the browser, that means we can target WebGPU when available and fall back gracefully when it is not.

Conceptually, the inference loop looks like this:

// Pseudocode: load session, feed tensors, read output tensors
const session = await ort.InferenceSession.create("/models/segmentation.onnx");
const feeds = { input: inputTensor };
const results = await session.run(feeds);
const mask = results.output; // used by canvas / WASM glue

The Go and WASM layers in our stack stay thin: they move bytes, expose progress, and keep business rules in internal/apps/ai2done/tool—not inside the model runtime itself.

Memory, tensors, and honesty

Neural networks are hungry. A wrong assumption—loading a “full resolution everything” model on a five-year-old laptop—creates tab crashes and angry users. We mitigate that with:

  • Model quantization where quality allows
  • Progressive loading so the UI stays responsive
  • Clear ceilings on input size, with user-visible messaging

This is the same philosophy we apply to PDF and video WASM: respect browser limits instead of pretending the web is a datacenter.

Privacy as a technical guarantee

When inference runs locally, the privacy story writes itself. Your image never hits our disks for enhancement—not because we promise softly in a banner, but because there is no upload path in the architecture for that operation. That distinction matters for regulated environments and for anyone who simply does not want vacation photos on a stranger’s GPU.

Where Go fits

We still love Go for orchestration, static serving, and embedding WASM bundles. The mental model is clean: Go ships the app, JS bridges ONNX, WASM handles deterministic transforms where shared code with the server is valuable. DDD boundaries keep each layer honest—domain logic in tool/, services coordinating requests, no “smart” templates.

Debugging real-world drift

Models behave differently across devices: color profiles, WebGPU availability, and floating-point quirks can shift outputs subtly. We invest in golden-file tests on representative inputs and telemetry-free user feedback (literally: “report issue” without exfiltrating pixels) to catch edge cases.

Looking forward

On-device ML will keep improving as browsers expose more performance and as models shrink. Ai2Done will keep riding that wave without turning your media into someone else’s training data. If you are an engineer evaluating ONNX in the browser, our advice is simple: treat memory and fallbacks as first-class requirements, and your users will feel the difference.