What is NUMA and Why is it Important for Inference Optimization?
What is NUMA and Why is it Important for Inference Optimization?
When running a model on a powerful server, we tend to think that all CPUs and memory are available “at the same speed.” But the truth is far from it. This is where the architecture called NUMA - Non-Uniform Memory Access - comes into play.
What Does This Mean in Practice?
Modern servers have multiple processors (sockets), and each of them has:
- Its own cores.
- Memory (RAM) that is “physically close” to it.
Access to memory directly connected to your processor is very fast. But if the processor needs to access the memory of a “neighbor” - that is, memory connected to another processor - it does so via an internal interconnect, which is slower.
Why is This Important for Inference?
In inference, every small delay adds up: If a process running on one socket repeatedly accesses memory located on another socket, it creates unnecessary latency that can account for tens of percent of the total time.
This happens, for example, when:
- A model runs on cores in one CPU, but its data resides in the memory of another CPU.
- The operating system automatically distributes processes without understanding the implications.
How Do You Solve This?
- Place processes and memory on the same node - what’s called “NUMA-aware allocation.”
- Use tools like
numactlto assign each process to a specific socket and core. - Maintain locality - ensure that all computation and memory access happen as “close to home” as possible.
Bottom Line
NUMA is not just an architectural trivia point - it’s a direct factor in latency. Awareness of NUMA and proper binding of processes to the correct memory can make the difference between a sluggish system and one that delivers predictions at nearly double the speed.
📚 More in this Series: Hardware Inference Optimization
- Part 1 Why Do We Need to Understand Hardware for Inference Optimization?
- Part 3 What are Cores and Threads?
- Part 4 What is Cache and Why Does It Change Everything?
- Part 5 Core Management - How to Properly Manage Your Processing Power
- Part 6 Thread Affinity - How to Bind Cores Smartly
- Part 7 Divided Resources - How to Allocate Resources Between Models or Processes
- Part 8 Resource Optimization - How All Factors Impact Latency and TPS
- Part 9 Series Summary: From NUMA to Throughput - How Optimization Turns Hardware into Performance