ArXiv Paper Digest: Less Is More: Engineering Challenges of On-Device Small Language Model Integration in a Mobile Application

Less Is More: Engineering Challenges of On-Device Small Language Model Integration in a Mobile Application

2026-04-28

Authors: William Oliveira

Everyone talks about running AI locally on your phone — no cloud, no data leaks, no latency. This paper is a brutally honest field report from a developer who actually tried it, documenting what happens when you shove a small language model (SLM) into a real Android app and ship it.

The author integrated two small models — Gemma 4 E2B (2.6 billion parameters) and Qwen3 0.6B (600 million parameters) — into Palabrita, a production word-guessing game on Android. Over a five-day sprint of 204 commits, the paper catalogues the engineering pain points that benchmarks and demo videos conveniently skip over:

Model size vs. app size: Even "small" models are enormous by mobile app standards. A 2.6B-parameter model can balloon your APK to multiple gigabytes, pushing past app store limits and killing install rates.
Memory and thermal pressure: Running inference on-device competes with everything else the phone is doing. Models that run fine in a benchmark can cause thermal throttling, UI jank, or outright crashes on real mid-range hardware.
Latency that users actually feel: Token generation speeds that seem acceptable in a terminal are painfully slow when a user is waiting for a game hint. The gap between "technically works" and "feels good" is wide.
Quality cliffs: Smaller models save resources but produce noticeably worse output. The 600M-parameter Qwen3 struggled with tasks the 2.6B Gemma handled adequately, forcing constant prompt engineering trade-offs.
Toolchain immaturity: The ecosystem for on-device inference on Android — runtime libraries, quantization tools, model format converters — is fragmented and poorly documented compared to server-side deployment.

The key insight is that on-device SLMs sit in an awkward middle ground: they're too large and resource-hungry to be invisible to users, yet too small to match the quality people now expect from cloud AI. The paper doesn't argue the technology is hopeless — it argues that the engineering community needs to be clear-eyed about the current gap between the promise and the reality, so that tooling, model compression, and hardware support can improve in the right places.

What makes this paper stand out is its format: a longitudinal practitioner case study rather than a benchmark comparison. It captures the kind of messy, real-world friction — build system issues, device-specific bugs, user experience compromises — that controlled experiments rarely surface.

Why it matters: As the industry rushes toward on-device AI, this paper provides a grounded, experience-driven catalogue of the engineering barriers that must be solved before local SLMs can deliver on their privacy and offline promises in production mobile apps.

All newsletters