ElastixAI Emerges From Stealth With FPGA Approach to Gen AI Supercomputing
Today, ElastixAI, a Seattle-based AI hardware startup founded by former Apple and Meta machine learning engineers, emerged from stealth to launch an FPGA-based inference platform it claims delivers up to 50 times lower total cost of ownership and 80% less power consumption than Nvidia GPU-based deployments for large language model inference.
The company, which raised an $18 million seed round in May 2025 led by Fuse VC, is positioning its Elastix Rack as a drop-in replacement for GPU server infrastructure, with first shipments planned for mid-2026.
In an exclusive interview with All About Circuits ahead of the launch, co-founders Mohammad Rastegari (CEO), Saman Naderiparizi (CTO), and Mahyar Najibi (CSO) spoke with us to lay out the technical case for why FPGAs are better suited to LLM inference than GPUs—and why they believe the timing is right.
AI Training vs. AI Inference
Their core argument is that GPUs were designed to handle computationally intensive workloads, such as LLM training. But when tasked with memory-bound workloads like LLM inference, GPUs become inefficient and exhibit much lower computer utilization. "Training is heavily compute-bound; inference is heavily memory-bound," Rastegari said. This mismatch results in low GPU compute utilization at inference.

How ElastixAI’s approach tackles key LLM inference challenges.
Hardware inflexibility compounds the problem: 4-bit quantization theoretically doubles throughput, but Rastegari noted that on hardware like the H100 that lacks native support, operators "had to build a software kernel around it that could just utilize 10% of its potential."
While top-tier accelerators rely on the fastest and most expensive forms of memory, ElastixAI optimizes for the metrics that actually drive TCO: cost-per-bandwidth and cost-per-capacity. By using ML-defined software specialization, ElastixAI extracts maximum performance from cost-effective hardware (for example, advanced DDR and HBM) running on commercial off-the-shelf FPGA servers. According to the team, that approach delivers the memory bandwidth required for high-performance inference at a significantly lower cost per gigabyte than the industry's most premium memory tiers.
Why FPGAs Over Custom Silicon
The case for FPGA over custom silicon comes down to the pace of ML advancement relative to silicon development cycles. Rastegari, who co-founded Xnor.ai—acquired by Apple in 2020 for around $200 million—and later led the inference optimization of Meta's Llama 405B model, pointed to Mixture-of-Experts as an previous example of the risk.
"Many companies were raising capital to build a chip based on the status quo at the time, but then Mixture-of-Experts showed up." he said. "Suddenly, these companies had to go back and redesign their silicon to support Mixture-of-Experts, which didn't exist at the onset of their design process." The problem is evident. Custom silicon takes more than three years from design to production; the ML landscape can shift that significantly in months.
Inference throughput demands illustrate the same point. When Rastegari joined Meta, 20 tokens per second was sufficient for voice interaction. "But with reasoning, you want to have generated tokens in the background faster; now 200 tokens per second is needed." FPGAs can be reconfigured as those requirements shift.
“There is a fundamental tradeoff between generality and efficiency. The moment you want to be more general, you are losing efficiency because you have to add extra silicon to cover many different diverse workloads."
Rastegari argued that transformer architecture is now structurally stable enough to make FPGA implementation tractable, while the underlying optimization layer continues evolving fast enough that locking in a fixed silicon design remains risky. On the question of eventually taping out custom silicon, he was measured: "What is going to define if and when we're going to tape out the silicon is literally the rate of change in the ML improvement."

ElastixAI’s approach has several key advantages over standard GPU rack implementations for AI computing.
Power, Cost, and Rack Compatibility
Naderiparizi was careful to qualify the headline performance figures. "Depending on what token rate we are, we could show ten times to even fifty times improvement at cost compared to [the Nvidia] B200," he said, noting the range reflects different "per user latencies" (or alternatively tokens/second/user) of a target.
Those numbers span both CapEx and OpEx across a full data center deployment and have been validated through partnerships with FPGA manufacturers and data center operators. On power, Naderiparizi put the figure at a five times reduction per token at equivalent throughput.
The Elastix Rack fits within standard 17-19 kW rack power envelopes and uses air cooling, whereas Nvidia's GB200 NVL72 requires between 120 kW and 200 kW and specialized liquid-cooled infrastructure that most existing data centers cannot support.
Drop-In Replacement
Integration is handled through a vLLM plug-in that swaps out the Nvidia CUDA back end while leaving the front-end OpenAI-compatible API unchanged, so operators migrating from GPU infrastructure don't need to modify their application stack.
ElastixAI plans to eventually open its model conversion tooling to ML researchers—a strategy Naderiparizi compared explicitly to how Nvidia built the CUDA ecosystem. "At the beginning, Nvidia was releasing their software for free to researchers. But the thing was that CUDA was for Nvidia—whatever people were developing for that CUDA framework would help Nvidia." ElastixAI intends to build the same compounding developer flywheel around its own platform.
The founding team also includes Najibi, who contributed to Apple Intelligence and was previously a lead scientist at Waymo. Among the company's board members is Jon Gelsey, who served as CEO of Xnor.ai and founding CEO of Auth0, which was acquired by Okta for $6.5 billion. Gelsey serves as head of strategy and marketing at ElastixAI.
ElastixAI is currently available to select enterprise partners and data center operators, with hardware shipments targeting mid-2026.
All images used courtesy of ElastixAI.