• Business
  • Entertainment
  • Health
  • News
  • Sport
  • Tech
  • World
Newsy Today
news of today
Home - Gemma
Tag:

Gemma

Tech

Gemma 2 12B: Enabling On-Device Multimodal Agentic Workflows

by Chief Editor June 8, 2026
written by Chief Editor

Gemma 4 12B is a new, encoder-free multimodal model designed to run agentic, intelligent workflows directly on local laptops. By eliminating traditional multi-stage vision and audio encoders, the model allows for faster, more efficient processing of multimodal inputs on consumer-grade hardware, according to technical documentation released in June 2026.

How does the encoder-free architecture improve local AI?

Traditional multimodal models rely on separate, heavy encoders for vision and audio, which creates latency and increases memory usage. Gemma 4 12B bypasses this by feeding multimodal data directly into the LLM backbone, using a single decoder-only transformer. This architecture mirrors the advanced structure found in the 31B Dense model, enabling a reduced memory footprint that fits on devices with 16GB of VRAM or unified memory.

How does the encoder-free architecture improve local AI?

The system handles visual data by using a 35M-parameter vision embedder that projects 48×48 pixel patches directly into the LLM’s hidden space. For audio, the model skips separate encoders entirely. Instead, it slices 16 kHz audio into 40 ms frames and projects them linearly into the input space, a shift that simplifies fine-tuning processes for developers.

What can you build with Gemma 4 12B?

Developers are using the model to execute scripts and generate code on the fly. Through the Google AI Edge Gallery app, users can turn natural language instructions into functional programs. One demonstration showcased the model creating a Python script to render a PNG chart comparing girl names from 2024 and 2025.

The model’s utility extends to various developer environments. It supports integration with tools like LiteRT-LM, which allows for the launch of OpenAI-compatible servers using the litert-lm serve command. It is also compatible with llama.cpp, Hugging Face, Ollama, and LM Studio, providing flexibility for local deployment.

Pro Tip: If you are looking for a deep technical analysis of the model’s structure, Maarten Grootendorst has published a detailed visual guide exploring the architecture and implementation of Gemma 4 12B.

What are users saying about performance?

Early feedback from the developer community on Reddit highlights a mix of excitement and practical testing. User LoveMind_AI noted that the encoder-free design is a significant development for local models, specifically praising the inclusion of native audio. Another user, few, reported success using the model to build a full-stack Python application with a server and client side, noting the model’s effective handling of long-context tasks.

View this post on Instagram about Hugging Face, Google Cloud
From Instagram — related to Hugging Face, Google Cloud

However, performance expectations vary by task. User triynizzles suggested that while the model excels at explaining code paths and fixing logic bugs, it may struggle with more ambiguous, complex tasks compared to larger models like Qwen 3.6. These real-world accounts suggest that while the 12B model is a powerful tool for localized agentic workflows, its output quality remains task-dependent.

Frequently Asked Questions

  • Does Gemma 4 12B require a high-end server? No. It is designed to run locally on laptops equipped with 16GB of VRAM or unified memory.
  • Can it process audio natively? Yes. It is the first medium-sized model in the Gemma family capable of native audio ingestion without a separate encoder.
  • Where can I download the model? It is available through platforms including Hugging Face, Ollama, LM Studio, and Google Cloud.

Ready to start building?

Explore the Google AI Edge Gallery to see how you can deploy these workflows on your own machine. Have you experimented with Gemma 4 12B yet? Share your findings or questions in the comments below.

NEW Google Gemma 4 12B AI Update 🤯

June 8, 2026 0 comments
0 FacebookTwitterPinterestEmail
Tech

Google LiteRT-LM Boosts Gemma 4 Inference Speed by 2.2x

by Chief Editor June 5, 2026
written by Chief Editor

The Future of On-Device AI: Why LiteRT-LM Changes Everything

For years, the promise of Artificial Intelligence has been shackled to the cloud. We’ve relied on massive server farms to process even the simplest queries, sacrificing privacy and speed for the sake of model size. However, the release of LiteRT-LM—the evolution of TensorFlow Lite—marks a definitive shift toward a “local-first” AI future.

By bringing native support for Gemma 4 Multi-Token Prediction (MTP) directly to mobile and edge hardware, developers can now achieve inference speeds up to 2.2x faster than previous iterations. This isn’t just an incremental update; it’s a fundamental rethinking of how Large Language Models (LLMs) interact with our devices.

Pro Tip: If you’re building mobile AI applications, prioritize hardware-accelerated kernels like XNNPACK. By keeping your KV cache and activations on the GPU, you can eliminate the latency bottlenecks caused by cross-IP data transfers.

Breaking the Latency Barrier with Speculative Decoding

The biggest hurdle for on-device LLMs has always been the “stutter”—the delay between a prompt and the generated output. LiteRT-LM tackles this through a specialized orchestration layer that enforces memory locality. By running both the primary model and the MTP drafter on the same hardware IP, the system avoids the costly penalties of moving data back and forth.

According to recent benchmarks, this architecture delivers remarkable performance gains:

  • Gemma 4 E2B: 1.6x faster decoding.
  • Gemma 4 E4B: 2.2x faster decoding.
  • Competitive Edge: 1.8x to 3.7x faster performance compared to frameworks like llama.cpp and ONNX.

Efficiency as a Competitive Advantage

High performance is meaningless if it drains your battery or hogs all your RAM. LiteRT-LM addresses this by treating memory efficiency as a first-class citizen. By dynamically loading image and audio encoders only when they are needed and keeping per-layer embeddings out of memory, the runtime remains incredibly lean.

Consider this: a ~2.58GB model can now function with a footprint of just 607MB on Apple mobile CPUs. This level of optimization ensures that sophisticated, agentic AI can run in the background without impacting the user’s ability to run other apps.

Did you know? LiteRT-LM allows for “Thinking Mode” and native function-calling. This means your phone’s AI can pause, handle a structured tool request, and resume execution seamlessly—bringing us one step closer to truly autonomous, helpful digital agents.

The Road Ahead: Agentic Capabilities and Beyond

The future of on-device AI isn’t just about faster text generation; it’s about agentic workflows. With native support for constrained decoding and function-calling, LiteRT-LM is paving the way for apps that can proactively manage tasks. Imagine a device that manages your calendar, processes sensitive financial data locally, and interacts with other apps—all without sending a single byte of data to a central server.

Gemma 4 12B – Google's Unified Multimodal Model Running Locally

As the framework expands its reach to Swift and JavaScript APIs, the barrier to entry for developers is falling. Whether you are working on Android, iOS, or web-based projects, the tools to build high-performance, private AI are now readily available on GitHub.

Frequently Asked Questions (FAQ)

What is the primary benefit of LiteRT-LM for mobile developers?

LiteRT-LM provides a highly optimized runtime that enables native support for Gemma 4, allowing for significantly faster inference speeds (up to 2.2x) and a reduced memory footprint on mobile devices.

Frequently Asked Questions (FAQ)
Token Prediction

Does LiteRT-LM require a cloud connection?

No. LiteRT-LM is designed specifically for on-device inference, allowing models to run locally on your hardware. This improves user privacy and ensures functionality even without an internet connection.

How does LiteRT-LM handle multi-token prediction?

It uses speculative decoding, where a lightweight “drafter” model predicts future tokens. These are verified by the primary model in a single pass, which significantly reduces the data movement between VRAM and compute units.

Can I use LiteRT-LM for complex agentic tasks?

Yes. The framework includes native support for function-calling and “Thinking Mode,” which allows models to handle structured outputs and pause/resume execution for tool-based interactions.


Are you experimenting with on-device LLMs? Share your experience with LiteRT-LM in the comments below, or subscribe to our newsletter for deep dives into the latest edge computing trends.

June 5, 2026 0 comments
0 FacebookTwitterPinterestEmail

Recent Posts

  • John Constable’s Cello to be Played Publicly After 100 Years

    June 9, 2026
  • Natalie Portman Turns 45: Love, Parenthood & Aging Gracefully

    June 9, 2026
  • Captain Assane Sarr to Diomaye Faye: “This Trophy Is Only the Beginning, Senegal Aims Higher

    June 9, 2026
  • Apple Users Win Again: The Latest Tech Update

    June 9, 2026
  • Neal Shipley and Vaughn Harber Qualify for U.S. Open

    June 9, 2026

Popular Posts

  • 1

    Maya Jama flaunts her taut midriff in a white crop top and denim jeans during holiday as she shares New York pub crawl story

    April 5, 2025
  • 2

    Saar-Unternehmen hoffen auf tiefgreifende Reformen

    March 26, 2025
  • 3

    Marta Daddato: vita e racconti tra YouTube e podcast

    April 7, 2025
  • 4

    Unlocking Success: Why the FPÖ Could Outperform Projections and Transform Austria’s Political Landscape

    April 26, 2025
  • 5

    Mecimapro Apologizes for DAY6 Concert Chaos: Understanding the Controversy

    May 6, 2025

Follow Me

Follow Me
  • Cookie Policy
  • CORRECTIONS POLICY
  • PRIVACY POLICY
  • TERMS OF SERVICE

Hosted by Byohosting – Most Recommended Web Hosting – for complains, abuse, advertising contact: o f f i c e @byohosting.com


Back To Top
Newsy Today
  • Business
  • Entertainment
  • Health
  • News
  • Sport
  • Tech
  • World