What is Cache and Why Does It Change Everything?
What is Cache and Why Does It Change Everything?
Alongside the cores, there’s a critical part of the processor called Cache - and it’s the secret to speed.
Why Do We Even Need It?
Accessing main memory (RAM) is much slower than the computations the processor performs. That’s why every processor includes a small, extremely fast memory - the cache - where it stores data it uses repeatedly.
In fact, when a model performs inference, it doesn’t access main memory every time. The data and variables it needs most are stored in the cache, saving valuable time.
Types of Cache
There are several “layers” of cache:
- L1 - The smallest and fastest, located directly on the core.
- L2 - Larger but slightly slower.
- L3 - Shared across all cores, used for sharing information between them.
Why is This Important for Inference?
If threads migrate between cores, they lose their cache - and this causes performance to fluctuate up and down.
This is one of the reasons why understanding Thread Affinity (binding tasks to a fixed core) is crucial - a topic we’ll dive into in the next post.
📚 More in this Series: Hardware Inference Optimization
- Part 1 Why Do We Need to Understand Hardware for Inference Optimization?
- Part 2 What is NUMA and Why is it Important for Inference Optimization?
- Part 3 What are Cores and Threads?
- Part 5 Core Management - How to Properly Manage Your Processing Power
- Part 6 Thread Affinity - How to Bind Cores Smartly
- Part 7 Divided Resources - How to Allocate Resources Between Models or Processes
- Part 8 Resource Optimization - How All Factors Impact Latency and TPS
- Part 9 Series Summary: From NUMA to Throughput - How Optimization Turns Hardware into Performance