Inferact Raises $150M to Power AI Inference with vLLM
What happens when one of the most widely used open source tools for running AI models becomes a company? The answer is Inferact—a new startup founded by the original creators of vLLM, which just raised $150 million in seed funding at a staggering $800 million valuation. With backing from top-tier investors like Andreessen Horowitz and Lightspeed Venture Partners, Inferact is poised to reshape how businesses deploy large language models (LLMs) in production—faster, cheaper, and more efficiently.
As AI shifts from flashy demos to real-world applications, the bottleneck isn’t training anymore—it’s inference: the process of actually using trained models to generate responses, analyze data, or power user-facing features. That’s where vLLM, and now Inferact, come in.
Why Inference Is the New Battleground in AI
For years, headlines focused on who could train the biggest model. But in 2026, the industry’s attention has pivoted decisively toward efficient inference—the moment an AI model delivers value to end users. Training might be expensive, but inference happens millions of times a day across apps, websites, and enterprise systems. If it’s slow or costly, adoption stalls.
Enter vLLM, an open source project launched in 2023 from UC Berkeley’s Sky Lab, co-founded by Databricks veteran Ion Stoica. Designed to dramatically speed up LLM inference while slashing cloud costs, vLLM quickly became a favorite among developers at companies like Amazon Web Services and major consumer apps. Its secret? A technique called PagedAttention, which optimizes memory usage similarly to how operating systems manage virtual memory—allowing models to serve more requests with less hardware.
Now, that same team is commercializing their breakthrough under the name Inferact, signaling a broader trend: the rise of infrastructure startups built around open source AI tooling.
From Open Source Project to $800M-Valued Startup
Inferact’s leap from academic lab to venture-backed company mirrors a growing playbook in the AI ecosystem. Just days before its announcement, another Berkeley-born project, SGLang, spun out as RadixArk with a $400 million valuation. Both projects emerged from the same research environment and share a mission: making LLMs practical for everyday use.
But Inferact stands out for its traction. Even before incorporation, vLLM was already integrated into production systems at scale. According to CEO Simon Mo—one of vLLM’s original creators—the tool is actively used by AWS and a major shopping app (believed to be Temu or Shein, though not officially named). That real-world validation gave investors confidence to back the team early and aggressively.
The $150 million seed round, co-led by a16z and Lightspeed, is unusually large for a pre-product company—but it reflects the urgency around inference optimization. “We’re not building another model,” Mo told Bloomberg. “We’re building the rails that let every model run better.”
How vLLM Changes the Economics of AI Deployment
Running LLMs in production is notoriously expensive. A single query can consume significant GPU memory, and inefficient systems often leave hardware underutilized. vLLM tackles this by introducing continuous batching and memory-efficient attention mechanisms, enabling servers to handle up to 24x more throughput than standard frameworks like Hugging Face Transformers.
For enterprises, that translates directly into cost savings. One internal benchmark showed vLLM reducing inference latency by 70% while cutting cloud spend by over half. In an era where CTOs are under pressure to justify AI budgets, such efficiency isn’t just nice-to-have—it’s essential.
Inferact plans to build on this foundation with a managed service that offers monitoring, auto-scaling, and enterprise-grade support—features absent in the open source version. The goal? To become the default inference layer for any company deploying LLMs, whether they’re fine-tuning open models or using proprietary APIs.
The Berkeley AI Pipeline: Where Research Meets Revenue
It’s no accident that both Inferact and RadixArk trace their roots to UC Berkeley’s Sky Lab. Under Ion Stoica—a serial entrepreneur who co-founded Databricks and Anyscale—the lab has become a launchpad for AI infrastructure startups that bridge academic innovation and commercial need.
Unlike pure research labs, Sky Lab encourages rapid prototyping and open source release, creating immediate feedback loops with developers. vLLM gained over 20,000 GitHub stars within a year, a rare feat for a low-level systems project. That community adoption de-risked the technology long before venture capital arrived.
This model—open source first, company second—is proving powerful in 2026. It builds trust, validates demand, and attracts talent who already know the codebase. For Inferact, it means hitting the ground running with a product that’s battle-tested and widely understood.
What’s Next for Inferact—and the Inference Ecosystem
With $150 million in the bank, Inferact isn’t just maintaining vLLM—it’s expanding it. The team plans to add support for multimodal models, real-time streaming responses, and tighter integrations with vector databases and orchestration platforms. They’re also hiring aggressively, particularly in systems engineering and developer experience.
Meanwhile, the inference market is heating up. Competitors like TensorRT-LLM (from NVIDIA) and llama.cpp offer alternative approaches, but vLLM’s framework-agnostic design and Python-friendly interface give it broad appeal. Crucially, Inferact isn’t locking users into a walled garden; the core remains open source, aligning with developer expectations in 2026.
As AI moves beyond chatbots into customer service, legal analysis, coding assistants, and real-time decision engines, the need for reliable, scalable inference will only grow. Inferact may have started as a research project—but it’s now positioned at the heart of AI’s next phase.
Why This Matters for Developers and Businesses
If you’re building with LLMs today, inference performance directly impacts your user experience and bottom line. Slow responses frustrate customers; high cloud bills scare off executives. Tools like vLLM aren’t just technical optimizations—they’re enablers of viable AI products.
For developers, Inferact’s emergence means continued investment in the tools they already rely on. For businesses, it signals that the AI stack is maturing: instead of reinventing the wheel, they can plug into optimized, supported infrastructure.
And for the broader ecosystem, Inferact’s success reinforces a key truth in 2026: the future of AI isn’t just about who has the best model—it’s about who can run it best. With $150 million and a proven foundation, Inferact is betting everything on that vision. So far, the market agrees.