Less Is More: Engineering Challenges of On-Device Small Language Model Integration in a Mobile Application

2026-04-28

Authors: William Oliveira

ArXiv: 2604.24636v1

PDF: Download PDF

Everyone talks about running AI locally on your phone — no cloud, no data leaks, no latency. This paper is a brutally honest field report from a developer who actually tried it, documenting what happens when you shove a small language model (SLM) into a real Android app and ship it.

The author integrated two small models — Gemma 4 E2B (2.6 billion parameters) and Qwen3 0.6B (600 million parameters) — into Palabrita, a production word-guessing game on Android. Over a five-day sprint of 204 commits, the paper catalogues the engineering pain points that benchmarks and demo videos conveniently skip over:

The key insight is that on-device SLMs sit in an awkward middle ground: they're too large and resource-hungry to be invisible to users, yet too small to match the quality people now expect from cloud AI. The paper doesn't argue the technology is hopeless — it argues that the engineering community needs to be clear-eyed about the current gap between the promise and the reality, so that tooling, model compression, and hardware support can improve in the right places.

What makes this paper stand out is its format: a longitudinal practitioner case study rather than a benchmark comparison. It captures the kind of messy, real-world friction — build system issues, device-specific bugs, user experience compromises — that controlled experiments rarely surface.

Why it matters: As the industry rushes toward on-device AI, this paper provides a grounded, experience-driven catalogue of the engineering barriers that must be solved before local SLMs can deliver on their privacy and offline promises in production mobile apps.

All newsletters