Resource Optimization - How All Factors Impact Latency and TPS

👤 Efrat Bdil 📅 1/7/2026 ⏱️ 2 min read

📚 Hardware Inference Optimization - Part 8 Machine Learning Hardware #Resource Optimization #NUMA #Inference

Table of Contents

Resource Optimization - How All Factors Impact Latency and TPS

In any inference system, every small change - in cores, memory, or the way resources are allocated - impacts two main metrics:

Latency - How long it takes for a single request to complete.
TPS (Transactions Per Second) - How many predictions the system can process per second.

To improve one, you often sacrifice the other. Understanding this balance is the essence of true optimization.

How NUMA, Core Allocation, and Resource Division Work Together

NUMA (Non-Uniform Memory Access) - When a model accesses “remote” memory, it wastes time on inter-processor communication. The result: increased latency even if processing power remains the same.
Core Allocation (Thread Affinity) - When threads migrate between cores, they lose their local cache, reducing efficiency and adding unnecessary delays. Maintaining affinity improves stability and prevents “spikes” in latency.
Resource Division (Allocating Resources Between Models) - If two models compete for the same resources, each gets less of the CPU or memory it needs. This increases latency and can dramatically lower throughput. Smart management (proper isolation or load balancing) restores balance.

How to Measure Correctly and Identify Bottlenecks

Superficial runtime measurement doesn’t tell the whole story. To understand where the bottleneck is, you need to examine several layers:

CPU Utilization - Are all cores truly busy, or are some “wasting time” waiting?
Memory Bandwidth - Is there access contention between NUMA nodes?
Thread Scheduling - Are threads jumping between cores?
Latency Distribution - Not just the average, but also the standard deviation: instability is a sign of improper allocation.

Principles for Experimentation and Optimization Testing

Change only one parameter per experiment - otherwise, you can’t know what really had an impact.
Measure over time - inference systems are also affected by load fluctuations.
Track consistent metrics (TPS, P99 latency, CPU usage).
Compare NUMA/affinity/isolation configurations under identical conditions.
Document every change - even small optimizations can appear reversed when not tested in the right context.

Bottom Line

No component - NUMA, cores, or resources - stands alone. Only by understanding how they all interact can you achieve true optimization: A system where every computation is utilized, every thread is in the right place, and every prediction is delivered on time - without wasting a single microsecond.

Resource Optimization - How All Factors Impact Latency and TPS

Resource Optimization - How All Factors Impact Latency and TPS

How NUMA, Core Allocation, and Resource Division Work Together

How to Measure Correctly and Identify Bottlenecks

Principles for Experimentation and Optimization Testing

Bottom Line

📚 More in this Series: Hardware Inference Optimization

🔗 Related Posts

Comments

Resource Optimization - How All Factors Impact Latency and TPS

How NUMA, Core Allocation, and Resource Division Work Together

How to Measure Correctly and Identify Bottlenecks

Principles for Experimentation and Optimization Testing

Bottom Line

📚 More in this Series: Hardware Inference Optimization

🔗 Related Posts

What is a Model Pipeline?

NVIDIA - How a Graphics Card Company Became the Queen of AI

What is an Accelerator?

Comments