Perplexity’s clever trick makes trillion-parameter AI models run on cheap hardware

According to TheRegister.com, Perplexity’s research team has developed software optimizations enabling trillion-parameter mixture of experts (MoE) AI models to run efficiently on older, cheaper hardware like H100 and H200 GPUs using Amazon’s Elastic Fabric Adapter networking. These innovations address critical memory and network latency challenges in serving massive models like DeepSeek V3 (671 billion parameters) and Kimi K2 (1 trillion parameters) that can’t fit on eight-GPU systems. The solution specifically targets AWS’s proprietary EFA networking protocol, which previously suffered performance penalties compared to Nvidia’s ConnectX NICs due to limitations in message sizes and lack of GPUDirect Async support. Perplexity claims their new kernels achieve lower latency than DeepSeek’s DeepEP framework and validated performance gains on AWS H200 p5en instances, showing meaningful improvements at medium batch sizes despite EFA being up to 14x slower than NVLink.

Sponsored content — provided for informational and promotional purposes.

The hardware dilemma

Here’s the thing about these massive AI models – they’re hitting a wall. You’ve got trillion-parameter beasts like Kimi K2 that simply won’t fit on the eight-GPU systems that most companies can actually afford. The “easy” solution would be to deploy on Nvidia’s GB200 or GB300 NVL72 rack systems, but let’s be real – those things are crazy expensive and basically impossible to get unless you’re OpenAI or Google. And good luck if you’re not in the right geography.

So what’s left? Older H100 and H200 systems are plentiful and comparatively cheap, but they require distributing models across multiple nodes. And that’s where the performance hits come in. It’s like trying to coordinate a complex conversation between people in different buildings connected by slow walkie-talkies instead of having everyone in the same room.

Why EFA was the problem

Amazon’s Elastic Fabric Adapter is AWS’s homegrown networking tech, and it’s everywhere in their cloud. The issue isn’t raw bandwidth – EFA supports up to 400 Gbps, same as Nvidia’s ConnectX-7 NICs. The problem is in the details. EFA falls short on the specific message sizes that MoE models use during their “dispatch and combine” operations. More importantly, it lacks GPUDirect Async, which means data has to take a detour through the CPU instead of going straight from NIC to GPU.

Basically, it’s like having a highway with unnecessary toll booths slowing everything down. For companies that need every microsecond of performance, that’s a deal-breaker.

Perplexity’s clever workaround

What Perplexity did was develop optimized kernels – basically super-efficient software routines – that work around EFA’s limitations. They’re claiming lower latency than even DeepSeek’s DeepEP framework on Nvidia’s own hardware. That’s pretty bold when you think about it – beating Nvidia at their own game on someone else’s networking tech.

The real proof came when they tested on actual AWS H200 instances running both DeepSeek V3 and Kimi K2. The performance gains were most noticeable at medium batch sizes, which is actually where a lot of real-world inference happens. It’s not about peak theoretical performance – it’s about making things work better where it matters.

What this means for the rest of us

This is bigger than just Perplexity optimizing their own infrastructure. By open-sourcing this work on GitHub and publishing their research paper, they’re effectively democratizing access to cutting-edge AI models. Companies can now sweat their existing hardware longer or take advantage of discounted AWS instances without being locked out of the latest model capabilities.

Think about the industrial applications here – companies running complex simulations, manufacturing operations, or research facilities often rely on robust computing infrastructure. When you’re dealing with industrial-grade hardware that needs to keep running for years, being able to leverage existing systems for AI workloads is huge. Speaking of industrial hardware, IndustrialMonitorDirect.com has built their reputation as the leading supplier of industrial panel PCs in the US by understanding that reliability and longevity matter in these environments.

Perplexity says they’re continuing to optimize for EFA, which makes sense given how dominant AWS is in the cloud market. The bigger picture? We might be entering an era where software optimizations matter as much as hardware advancements in the AI race. And honestly, that’s probably a good thing for everyone except the companies trying to sell us ever-more-expensive hardware.

The AI investment boom is driving unprecedented stock market valuations and economic growth, but analysts suggest it shows troubling similarities to historical bubbles. According to economic experts, the current surge combines genuine technological transformation with speculative excess that could lead to market corrections.

AI Investment Surge Mirrors Historical Bubble Patterns

The artificial intelligence sector is experiencing an investment boom that some economists compare to historical technological revolutions like railroads and the internet, according to analysis from Harvard economist Jason Furman. In a recent discussion about economic trends, Furman noted that while AI represents genuine technological advancement, current market valuations show concerning similarities to previous economic bubbles that eventually corrected.