Idle GPUs: The New Goldmine for AI Infrastructure
Every GPU cluster experiences downtime. Training jobs complete, workloads shift, and powerful hardware sits unused even as electricity and cooling costs continue to accrue. For neocloud operators, these idle cycles represent lost revenue. Now, a new platform called InferenceSense, developed by FriendliAI, aims to turn that downtime into profit by leveraging the power of AI inference.
From Continuous Batching to InferenceSense: A Revolution in GPU Utilization
The core technology behind InferenceSense stems from research led by Byung-Gon Chun, founder of FriendliAI. Chun’s work on Orca introduced “continuous batching,” a technique that dynamically processes inference requests instead of waiting for fixed batches. This innovation became foundational to vLLM, the widely used open-source inference engine.
FriendliAI’s approach differs from simply renting out spare GPU capacity through spot markets. While spot instances allow cloud vendors to monetize unused hardware, they still require users to build and manage their own inference stack. InferenceSense, directly runs inference on the idle hardware, optimizing for token throughput and splitting the revenue with the operator. It’s akin to Google AdSense for GPU cycles.
How InferenceSense Works: A Seamless Integration
InferenceSense integrates with existing Kubernetes infrastructure, which is commonly used by neocloud operators for resource orchestration. Operators allocate a pool of GPUs to a Kubernetes cluster managed by FriendliAI, specifying conditions for reclaiming those resources. When GPUs are idle, InferenceSense spins up isolated containers to serve paid inference workloads, supporting open-weight models like DeepSeek, Qwen, Kimi, GLM, and MiniMax.
The system prioritizes the operator’s jobs; when a scheduler needs hardware, inference workloads are preempted within seconds. Demand is aggregated through FriendliAI’s clients and partners like OpenRouter. The operator provides the capacity, while FriendliAI handles demand, model optimization, and the serving stack. There are no upfront fees or minimum commitments, and operators receive a real-time dashboard tracking model usage, tokens processed, and revenue earned.
Token Throughput: The Key to Maximizing Revenue
The focus on token throughput, rather than raw capacity rental, is a crucial differentiator. Spot markets monetize GPU time, while InferenceSense monetizes the actual work done – the tokens processed. FriendliAI claims its engine delivers two to three times the throughput of a standard vLLM deployment, though performance varies depending on the workload.
This performance boost is achieved through a custom-built inference engine written in C++, utilizing custom GPU kernels instead of Nvidia’s cuDNN library. FriendliAI has also developed its own model representation layer and implementations of techniques like speculative decoding, quantization, and KV-cache management.
The Impact on AI Infrastructure Costs
Traditionally, AI engineers have chosen between neoclouds and hyperscalers based on price and availability. InferenceSense introduces a new factor: the economic incentive for neoclouds to maintain competitive token prices by monetizing idle capacity.
While it’s still early days, this could lead to downward pressure on API pricing for models like DeepSeek and Qwen over the next year. As Chun explains, “When we have more efficient suppliers, the overall cost will go down. With InferenceSense we can contribute to making those models cheaper.”
Pro Tip:
When evaluating inference costs, consider the potential benefits of neoclouds utilizing platforms like InferenceSense. The ability to monetize idle capacity could translate to lower prices for finish-users.
FAQ
Q: What is continuous batching?
A: Continuous batching is a technique for processing inference requests dynamically, rather than waiting to fill a fixed batch, improving efficiency.
Q: What is a token in the context of AI inference?
A: A token is a unit of text used by large language models. The cost of inference is often measured in tokens processed.
Q: Does InferenceSense require any upfront investment?
A: No, there are no upfront fees or minimum commitments required to use InferenceSense.
Q: How does InferenceSense handle priority jobs?
A: The operator’s jobs always take priority. InferenceSense preempts inference workloads when the scheduler needs to reclaim GPUs.
Q: What types of models does InferenceSense support?
A: InferenceSense supports a wide range of open-weight models, including DeepSeek, Qwen, Kimi, GLM, and MiniMax.
Want to learn more about optimizing your AI infrastructure? Explore our other articles or subscribe to our newsletter for the latest insights.
