Guide on Operating Gemma 3n on Your Mobile Device

Gemma 3n, a family of AI language models developed by Google DeepMind, is designed specifically for efficient execution on low-resource devices such as mobile phones. This model offers a range of benefits for mobile users, including natural language understanding in emergency scenarios, multimodal input support, and low-latency inference.

Key Features of Gemma 3n

Gemma 3n models feature multimodal input support (text, image, video, audio), quantized integer-4 weights with float activations to reduce model size and improve speed, and the use of the LiteRT-LM inference backend optimized for CPU and GPU acceleration on mobile hardware.

The variant of Gemma 3n documented is commonly 2B to 4B parameter scale, supporting large context lengths up to 32K tokens. It runs efficiently on devices like the Samsung S24 Ultra, using either the CPU (4 threads, XNNPACK delegate) or GPU (assuming model is cached).

Performance and Benchmarks

Benchmarked on the Samsung S24 Ultra, Gemma 3n achieves a prefill speed of approximately 110 tokens per second on the CPU and 816 tokens per second on the GPU. Decode speeds are around 16 tokens per second on the CPU and 15.6 tokens per second on the GPU. On a Macbook Pro 2023 M3 CPU, the prefill speed is approximately 232 tokens per second, and the decode speed is 27.6 tokens per second.

In terms of performance, Gemma 3n outperforms similarly-sized open models, especially for multilingual and reasoning tasks. For instance, a medical benchmark showed a 36% improvement in accuracy after fine-tuning on medical data, and multimodal diagnostic accuracy was approximately 35.5% on specific vision-language tasks.

Benefits on Mobile

Gemma 3n's efficient quantized model makes it suitable for offline use, reducing dependency on internet connectivity. This feature is particularly valuable in emergency scenarios where connectivity may be unavailable. Additionally, the model supports multimodal inputs, enabling richer applications like medical assistants or emergency responders.

Running Gemma 3n on Mobile

To run Gemma 3n on a mobile device, you'll need a device with sufficient RAM and a modern CPU/GPU, such as the Samsung S24 Ultra or an equivalent. You'll also need to install the LiteRT runtime with XNNPACK delegate enabled, access the appropriate quantized Gemma 3n model files, and set up a developer environment for mobile inference.

The step-by-step guide for running Gemma 3n on mobile involves obtaining the model, setting up the inference backend, integrating the model into your mobile app, initialising the model runtime, running inference, optimising latency, and deploying the app for offline use.

Possible Uses of Gemma 3n on Mobile

Gemma 3n can be used for various purposes on mobile devices, such as an offline medical emergency assistant, multilingual conversational AI or translation tools, on-device assistive AI for context-aware multimedia understanding, and general-purpose lightweight generative language models for productivity apps, chatbots, or personal assistants.

The Future of AI with Soumil Jain

Soumil Jain, a Data Scientist, AWS Certified Solutions Architect, and AI & ML Innovator, specialises in Machine Learning, Deep Learning, and AI-driven solutions. His work spans Generative AI, Anomaly Detection, Fake News Detection, and Emotion Recognition. With his expertise, he is passionate about shaping the future of AI.

With the introduction of Gemma 3n, users can streamline activities, trigger new insights, and build connections, all without an internet connection. Whether you're a casual user, a busy professional, or a developer, Gemma 3n offers opportunities for exploration and personalisation of technology.

The artificial intelligence model, Gemma 3n, developed by Google DeepMind, is designed for efficient execution on low-resource devices like smartphones, providing natural language understanding in emergency scenarios, multimodal input support, and low-latency inference.
Gemma 3n models are optimized for CPU and GPU acceleration on mobile hardware, featuring multimodal input support and quantized integer-4 weights with float activations to reduce model size and improve speed.
Benchmarked on a Samsung S24 Ultra, Gemma 3n achieves a prefill speed of 816 tokens per second on the GPU, outperforming similarly-sized open models, especially for multilingual and reasoning tasks.
With the efficient quantized model of Gemma 3n, it is suitable for offline use, reducing dependency on internet connectivity and enabling applications like offline medical emergency assistants and multilingual conversational AI.
Soumil Jain, a Data Scientist specializing in Machine Learning, Deep Learning, and AI-driven solutions, is passionate about shaping the future of AI with innovations like Gemma 3n, which can lead to personalized technology for casual users, busy professionals, and developers alike, streamlining activities, triggering new insights, and building connections, all without an internet connection.

Guide on Operating Gemma 3n on Your Mobile Device