Cloudflare's Edge Stack Optimization for High-Core CPUs

Prev Article Next Article

The landscape of distributed computing is currently undergoing a seismic shift that moves away from traditional hardware reliance toward a more integrated approach. For years, the industry standard for maintaining low latency involved stacking increasingly massive CPU caches to mask inefficiencies in software execution. However, a new era of infrastructure is emerging, one where the intelligence lies in how software interacts with massive parallelism rather than how much data a processor can hold in its immediate memory buffer. This evolution is best exemplified by recent breakthroughs in edge stack optimization, where the focus has moved from “bigger caches” to “smarter cores.”

edge stack optimization

The Architectural Pivot: From Cache Dependency to Massive Parallelism

In the traditional model of server design, engineers often faced a dilemma. Software was frequently written in ways that were difficult to scale across dozens or hundreds of processor cores. To compensate for this, hardware manufacturers focused on expanding the L3 cache—a high-speed memory area located on the CPU itself. A larger cache allowed the processor to keep more data close to the execution engine, reducing the time spent waiting for data to arrive from the much slower system RAM. This approach worked well for legacy software, but it hit a wall of diminishing returns in terms of power consumption and physical space.

Imagine a DevOps engineer managing a global network. They might notice that while their servers are powerful, they are consuming massive amounts of electricity and generating intense heat just to maintain a steady response time. This is because the system is essentially using “brute force” memory to make up for software that cannot efficiently distribute its workload. The transition toward edge stack optimization involves breaking this cycle. Instead of trying to hide software inefficiencies with massive hardware buffers, the goal is to rewrite the software to thrive on a multitude of smaller, faster, and more efficient cores.

This shift is driven by the arrival of next-generation silicon, such as the AMD EPYC Turin series. These processors offer a staggering number of cores—up to 192 in a single chip—but they do not necessarily come with the gargantuan L3 caches found in previous specialized generations. For a company like Cloudflare, this presented a massive engineering challenge. Initial tests with these high-core-count, lower-cache density chips actually resulted in a 50% spike in latency. This was a clear signal that the old software stack, which relied on the “safety net” of a large cache, was fundamentally incompatible with the new hardware reality.

Why Abandon Large L3 Caches?

It might seem counterintuitive to move away from more memory. After all, more cache generally means faster access to frequently used data. However, there are several technical and economic reasons for this pivot. First, large caches are incredibly expensive to manufacture and occupy significant physical real estate on the silicon die. This space could instead be used to add more CPU cores, which provides a much higher ceiling for total throughput.

Second, there is the issue of power density. As data centers strive for better performance-per-watt, the energy required to maintain and manage massive L3 caches becomes a liability. By shifting the burden of performance from the hardware cache to the software’s ability to manage data, providers can achieve much higher density in their server racks. This allows for more computing power in the same physical footprint without a corresponding explosion in electricity costs.

The Engineering Battle: Hardware-Software Co-Design

The true breakthrough in modern edge stack optimization is not found in a single chip or a single line of code, but in the intersection of both. This concept, known as hardware-software co-design, suggests that software should be built with the specific architectural quirks of the underlying silicon in mind. When Cloudflare encountered that 50% latency penalty, they didn’t just wait for a better chip; they rebuilt their entire software foundation.

The move to a Rust-based stack, specifically the FL2 architecture, was the decisive factor. Rust is a systems programming language that provides memory safety without the need for a garbage collector. In high-performance computing, a garbage collector is a major enemy because it introduces unpredictable pauses in execution. By using Rust, developers can manage memory with extreme precision, ensuring that data is laid out in a way that is “cache-friendly” even if the cache itself is smaller.

Consider a software architect trying to write code that scales across 192 cores. In an old, cache-dependent model, the code might frequently jump around in memory, hoping the L3 cache will catch the data. In a modern, optimized model, the code is designed to follow strict, predictable patterns. This is often referred to as improving spatial and temporal locality. By ensuring that the data the CPU needs next is almost always exactly where the CPU expects it to be, the software can effectively “simulate” the benefits of a large cache through sheer efficiency.

The Role of Rust in High-Core Environments

A common question arises: Is it just the language, or is it the way the developers use it? While Rust’s ownership model prevents many common bugs, its real power in edge stack optimization lies in how it encourages better memory access patterns. Because Rust forces developers to be explicit about how data is moved and shared, it naturally leads to architectures that avoid “contention”—a situation where multiple cores fight over the same piece of memory, causing the whole system to slow down.

In the new FL2 stack, the architecture focuses on reducing dynamic memory allocation. Every time a program asks the system for more memory during execution, it risks a delay. By pre-allocating memory and using more efficient data structures, the software can maintain a steady, high-speed flow of requests, making it perfectly suited for the massive parallelism offered by the AMD EPYC Turin 9965 processors.

Breaking Down the Gen 13 Specifications

To understand the scale of this achievement, one must look at the raw numbers that define the Gen 13 server architecture. These aren’t just incremental upgrades; they represent a fundamental reimagining of what an edge server should be. The hardware is built to feed a hungry, highly parallel software engine.

Processor: The heart of the system is the 192-core AMD EPYC Turin 9965. This provides a massive amount of raw compute power, allowing for unprecedented levels of concurrency.
Memory: With 768 GB of DDR5-6400 RAM, the system ensures that the massive number of cores is never “starved” for data. The high frequency of DDR5 is crucial for maintaining the bandwidth required by 192 individual cores.
Storage: The inclusion of 24 TB of PCIe 5.0 NVMe storage allows for lightning-fast data retrieval, which is essential for edge applications that need to serve content or process data locally.
Networking: A dual 100 GbE network interface card ensures that the massive throughput generated by the CPU can actually reach the outside world without creating a bottleneck at the network layer.

When these components work in harmony, the results are transformative. The Gen 13 platform can handle up to twice the amount of traffic compared to the previous Gen 12 architecture, all while maintaining the same strict response-time targets. This is the holy grail of infrastructure: increasing capacity without sacrificing speed.

Efficiency and Rack Density

One of the most impressive metrics is the 60% increase in capacity per rack. In a data center environment, space is at a premium. If you can fit 60% more computing power into the same rack without increasing the power draw, you have fundamentally changed the economics of the cloud. This is achieved through the synergy of high-core-count processors and the optimized software stack, which allows each watt of electricity to perform more meaningful work.

This efficiency also extends to thermal management. High-performance chips can be notoriously difficult to cool. However, by spreading the workload across 192 cores rather than concentrating it on a few “hot” cores with massive caches, the thermal load is more evenly distributed. This makes it easier to maintain stable operating temperatures, which in turn extends the lifespan of the hardware and reduces the energy spent on cooling systems.

You may also enjoy reading: Japanet Quadruples $200M Fund After Smart Bets on Anthropic and xAI Pay Off.

Practical Implementation: How to Approach High-Parallelism Software

For developers and engineers working on their own distributed systems, the lessons from this architectural shift are highly applicable. If you are facing challenges with scaling or latency, you might consider the following steps to align your software with modern high-core hardware.

1. Audit Your Memory Access Patterns

The first step is to identify where your application is “cache-missing.” Use profiling tools to see how often your CPU is waiting for data from the main RAM. If you see high latency, it is likely because your data structures are scattered across memory. Try to group related data together so that when one piece is loaded into the cache, the next piece is already sitting right next to it.

2. Minimize Shared State and Locking

In a 192-core environment, traditional “mutexes” or locks can become a massive bottleneck. If every core has to wait for a single lock to access a piece of data, you effectively turn your supercomputer into a single-core machine. Instead, look into lock-free data structures or “actor models” where each core or thread manages its own local state and communicates through message passing.

3. Embrace Memory Safety and Predictability

If possible, move critical paths of your application to languages like Rust or C++. These languages allow for the level of control required to manage memory manually and predictably. Avoid heavy reliance on managed runtimes that use garbage collection if your primary goal is consistent, low-latency performance at the edge.

4. Leverage Hardware Accelerators

Modern server architectures are increasingly including specialized hardware for specific tasks, such as PCIe encryption or AI acceleration. Rather than trying to do everything in the general-purpose CPU, offload these heavy, repetitive tasks to the dedicated hardware. This frees up your CPU cores to handle the complex logic of your application.

The Future of the Edge: Beyond the Core

As we look toward the future, the trend of edge stack optimization is only going to accelerate. We are moving toward an era where the distinction between “hardware” and “software” becomes increasingly blurred. We will see more “software-defined hardware” where the architecture of the silicon is specifically tuned to the needs of the workloads it will run.

The introduction of improved support for thermally demanding PCIe accelerators and native PCIe encryption hardware in the Gen 13 servers is a precursor to this. As edge computing moves into more complex territory—such as real-time AI inference and massive-scale cybersecurity filtering—the ability to seamlessly integrate specialized hardware with highly parallel software will be the defining characteristic of successful platforms.

The shift from relying on large L3 caches to leveraging massive parallelism represents more than just a hardware upgrade; it is a fundamental change in how it’s worth noting about computing efficiency. By prioritizing smarter software and more granular hardware, the industry is paving the way for a faster, greener, and more scalable internet.

The journey from a 50% latency penalty to a 60% increase in rack capacity is a testament to the power of engineering when hardware and software are treated as a single, unified system. As these technologies continue to mature, the edge will become even more capable of handling the increasingly complex demands of our digital world.