AIMobileDevelopment

How to Add a Local AI Assistant to a Mobile Browser: Lessons from Puma for Developers

UUnknown

2026-02-07

11 min read

Architectural guide to add a private, low-latency on-device AI assistant to mobile browsers and PWAs — lessons from Puma and 2026 trends.

Hook: Stop shipping cloud-first assistants that leak data — build a fast, private on-device local AI for mobile browsers

Too many students and teachers I work with juggle fragmented tutorials, expensive SaaS, and unclear privacy trade-offs when they try to add an AI helper to a web project. If you want an assistant that answers instantly, keeps user data local, and works inside a mobile browser or PWA, you need a different architecture. In 2026 that architecture is achievable: modern NPUs, quantized LLM runtimes, WebAssembly/WebGPU, and real-world examples like Puma show how a secure local AI can live inside a mobile browsing experience.

Why local AI in a browser matters in 2026

Late 2025 and early 2026 accelerated three trends that make on-device assistants practical for mobile browsers and PWAs:

Edge-capable silicon: modern phones (Apple, Qualcomm, MediaTek) ship with NN accelerators and better GPU compute for ML.
Runtime maturity: libraries like llama.cpp, GGML, and WASM/WebGPU backends let trimmed, quantized LLMs run in constrained environments.
User demand for privacy: users and institutions prefer assistants that don't send every page or selection to cloud APIs — a major selling point for browsers like Puma that offer local AI.

That combination means developers can now build assistants that are fast (low latency), private (no external requests by default), and integrated with browser UI — ideal for students, teachers, and lifelong learners who need reliable, offline-capable help.

High-level architecture: components you must design

Design the assistant as a set of modular layers you can evolve independently:

Model runtime — lightweight LLM or instruction-tuned model running via native binary, WASM, or Core ML.
Inference manager — queues, batching, and resource arbitration between UI and model to avoid freezes and battery spikes.
Context manager — collects page content, selected text, metadata, and short-term memory for retrieval-augmented generation.
Embedding & vector store — on-device index for recent browsing history and user data (encrypted at rest).
Retrieval pipeline — rank candidate snippets and pass condensed context to the model to keep token use low.
UI bridge — browser/PWA integration: floating affordances, selection menu, and accessibility hooks.
Sync & update — optional encrypted cloud sync for user-controlled backups, model updates fetched via signed packages.

Why separation matters

Keep the model runtime and context/embeddings decoupled so you can swap models (smaller for offline, larger when plugged in) without changing the UI layers. This also helps with privacy and auditing: the context manager can filter or redact PII before embeddings are created.

Lesson from Puma: a pragmatic local-first model

Puma demonstrates an important product design point for developers: prioritize user control and transparency. Puma offers a local AI experience on iPhone and Android that you can use as a blueprint:

Make local inference the default, with a clear toggle for cloud-only features.
Offer model choices (tiny/fast vs. accurate/offline vs. cloud LLM) and explain the latency/privacy trade-offs.
Keep requests visible: data never leaves the device unless the user explicitly opts in. See companion writeups on building internal assistants for developer teams for a similar design approach: From Claude Code to Cowork.

"Local-first" means the assistant should work offline for core features and fall back to cloud resources only when necessary and consented to.

Practical steps: build an on-device assistant for a mobile browser or PWA

Below is a step-by-step blueprint. I'll give concrete options for iOS, Android, and cross-platform PWAs, and include UX recommendations so the assistant actually gets used.

1) Choose the right model & runtime

For mobile, prefer quantized models between 1B–7B parameters for good cost/latency trade-offs. Practical runtime choices in 2026:

Native — llama.cpp / GGML builds for ARM64 (Android NDK, iOS). Use metal/vulkan backends for speed.
Core ML — convert models to .mlmodel for iOS and take advantage of Apple Neural Engine.
WASM / WebGPU — for PWAs, run quantized models in-browser using WebAssembly and WebGPU (where available).

Actionable: start with a 2–3B quantized model (Q4/8bit) for testing. Benchmark with realistic prompts and measure tokens/sec on target devices. Use tools like perf traces (Android Systrace, Xcode Instruments).

2) Build a lightweight embedding & vector index (on-device)

Embeddings let you provide retrieved context without sending the full page every time. For on-device indexes:

Use a compact embedding model (distilled sentence transformers or a tiny transformer converted to ONNX/Core ML).
Persist vectors in an encrypted SQLite DB and store an HNSW index exported from a tiny library compiled for mobile (hnswlib or a trimmed C++ implementation).
Limit index size by evicting old items and storing only summaries for long-term pages.

Actionable: implement a simple pipeline — extract text, create a 384-dim embedding, insert into SQLite with an HNSW index file. Encrypt DB with platform keystore.

3) Build the retrieval & prompt assembly pipeline

To save tokens and latency, send the model a compact, ranked context:

Collect signals: selected text, page title/URL, recent visits, user notes.
Query the on-device vector index for top-k candidates.
Summarize and stitch candidates into a short context block with source annotations.
Send the assembled prompt to the model runtime with explicit system instructions to be concise and cite local sources.

Actionable: implement shard-based retrieval: retrieve 5 candidates, run a micro-summarizer (a 50–100M parameter model) to compress each into 1–2 sentences, then pass the compressed context to the primary assistant model.

4) Inference management: prevent freezes and battery spikes

Run heavy inference in background threads or native services. On mobile browsers and PWAs this means:

For native apps: use background services/APIs with CPU/GPU affinity and adaptive throttling.
For PWAs: use Web Workers, SharedArrayBuffer (where allowed), and fallback to server-side for long jobs only when the user grants permission.
Implement prewarming and keep a small model resident for instant replies, escalating to a larger local model or cloud model if the user requests more depth.

Actionable: implement a lightweight LRU that keeps the token embedding matrix and first layers warm and serializes state during background/foreground transitions. For patterns and platform guidance on running services and prewarm behaviors, see developer notes on edge-first developer patterns.

5) UX & integration patterns for browsers & PWAs

Good UX is crucial — users won't adopt an assistant that's slow or intrusive. Use these patterns, many inspired by Puma's approach:

Context-aware affordances — selection menu action (“Ask Assistant about selection”), address bar suggestions, and a floating FAB that opens a minimal assistant sheet.
Page snapshot permissions — explicit permission to read page content; show what will be used and allow per-site granular controls.
Progressive answers — stream partial results while the model completes longer reasoning tasks to give perceived instant responses.
Explainability — provide source snippets and an option to view the retrieval chain so users and teachers can verify facts.
Offline fallback — when offline, offer limited summarization and local search over cached content. If you want an offline-first reference, check the Pocket Zen Note field review for patterns that favor local-first UX.

Actionable: design a compact assistant sheet that shows 1) query, 2) concise answer, 3) sources (collapsible), 4) a feedback button to flag hallucinations.

6) Privacy and security: default-local, auditable, and minimal

Privacy is the main reason to go local. Build with these principles:

Default local-only: only use cloud APIs after explicit user consent for specific features (e.g., large-model reasoning or syncing).
Data minimization: only store embeddings and short-term caches; purge raw pages unless the user bookmarks or allows storage.
Encryption at rest: platform keystore/Keychain to protect vectors and local history.
Transparent UI: show when data is used, what model ran, and provide an audit log for the last N queries.
Secure updates: sign model binaries and restrict dynamic model loading unless the package is verified.

Actionable: implement a per-site permission UI that records consent in an auditable ledger and includes a one-tap revoke feature. Consult consent and measurement playbooks for how to show and measure permission impact in the UX: consent operational playbook.

7) Performance & token budget strategies

On-device resources are finite. Use:

Retrieval-augmented prompts to keep token usage low.
Progressive summarization to compress context before feeding it to the main model.
Quantized checkpoints (4-bit or 8-bit) to reduce memory and inference time.
Smart caching — cache previous answers per page; reuse embeddings across sessions.

Actionable: measure wall-clock latency for common flows and set thresholds for fallback to a cloud call: for example, if on-device inference exceeds 3 seconds for a short question, show a “slow — try cloud” prompt with an opt-in button.

Architecture diagram (textual)

Think of the system as layered microservices inside the device:

UI Layer (PWA / Browser UI)
  ↕
Bridge / RPC (Web Worker, Native Module)
  ↕
Inference Manager (Queue, Prewarm) ↔ Model Runtime (quantized LLM)
  ↕
Retrieval Pipeline ↔ Embedding Store (encrypted SQLite + HNSW)
  ↕
Optional Sync (encrypted backups, signed models)

Developer checklist & starter tasks

Start small and iterate. Here's a 10-step checklist to move from prototype to production:

Pick a quantized model (2–3B) and runtime (llama.cpp or WASM) and run a hello-world prompt locally on a target device.
Implement a minimal UI: floating button + assistant sheet in your PWA or browser extension.
Build a context extractor for web pages (DOM -> cleaned text) and a selection action hook.
Create a tiny embedding pipeline and store vectors in encrypted SQLite.
Implement retrieval + micro-summarizer to produce compact context blocks.
Wire inference manager: background worker + prewarm + cancellation.
Design permission screens; default local-only and log consents.
Benchmark latency and memory; adjust to smaller models or more aggressive quantization as needed.
Build telemetry-free analytics: count usages and measure latency without storing user content.
Run user testing with students and teachers to refine UX (clarity, trust, edge cases).

Edge cases and where cloud helps

On-device assistants aren’t a silver bullet. There are times a cloud LLM is preferable:

Very large knowledge queries that need up-to-date web search across the live web.
Heavy multimodal workloads (high-resolution image understanding) beyond local NPU capability.
When a user explicitly requests higher-quality results and consents to cloud processing.

Design a clear hybrid flow: local-first with an explicit, reversible opt-in to cloud processing for individual queries.

Real-world example: quick PWA implementation sketch

Below is a minimal PWA flow using WASM for inference. This is pseudocode to translate into your stack.

// PWA flow (simplified)
// 1. User selects text and taps "Ask"
const selection = getSelectionText();
const pageMeta = {title: document.title, url: location.href};

// 2. Build or fetch embeddings locally
const emb = await embedAPI.embed(selection);
await localIndex.insert({emb, meta: pageMeta, text: selection});

// 3. Retrieve & summarize
const candidates = await localIndex.search(emb, 5);
const summaries = await microSummarizer.summarize(candidates);

// 4. Assemble prompt and call WASM runtime
const prompt = assemblePrompt({question: userQuery, context: summaries});
const responseStream = wasmRuntime.generate(prompt);
renderStream(responseStream);

Testing, monitoring, and governance

For educational audiences, add safeguards:

Content filters for inappropriate material and a review mode for teachers.
Auditable logs (locally stored) for student queries if required by policy — pair local auditability with operational decision plans for edge deployments: edge auditability.
Automated tests for hallucination rates with canned prompts and ground-truth checks.

Actionable: add a feedback button that tags replies as “correct/incorrect” and stores the examples locally for retraining or heuristics.

Future-proofing: trends to watch in 2026 and beyond

Watch these capabilities that will shape next-generation on-device assistants:

WebNN & WebGPU maturation — faster in-browser inference for PWAs across platforms.
Tighter OS support — Android and iOS will add richer NPU scheduling and sandboxing for local ML.
Model distillation & retrieval improvements — smaller expert models or adapters that dramatically increase accuracy per-token.
Verifiable computation — cryptographic proofs to attest that a computation happened on-device for privacy-sensitive workflows. See operational notes on edge auditability for design patterns.

Design your assistant so model and runtime layers are replaceable — that’s how you’ll adopt future advances without rewriting UI or privacy controls.

Final lessons from Puma and practical takeaways

Make local-first choices obvious — users should know their data stays on-device unless they opt in.
Provide model choices and transparent trade-offs — speed vs. depth vs. privacy.
Keep the assistant lightweight and responsive — prewarm, quantize, summarize.
Design UI to grant and revoke access per-site — that’s critical for adoption in schools and classrooms.

Call to action

If you’re building an on-device assistant for learning, start with a small, local-first prototype today: pick a 2–3B quantized model, implement a secure embedding index, and design a single permissioned interaction (selection -> "Ask") into your PWA or extension. Want a starter kit? Download our developer checklist and a lightweight PWA template (includes a wasm runtime integration example and encrypted SQLite index) to get your local assistant running on a test device in under a day. Join our community to share experiments and get feedback from students and teachers — privacy-first assistants will shape the next generation of web tools.

Unknown

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.