Why Does Your Model “Feel Slow”?

Table of Contents

Why Does Your Model “Feel Slow”?

It’s time to talk about profiling - the tool that will reveal why this happens.

When we run a model, sometimes it works - but not fast enough. Then the question arises: what exactly is slowing it down? This is where profiling comes in - a process that measures where the performance “disappears.”

What is Profiling?

Profiling is an accurate analysis of the execution time of each stage in the system. The goal: to identify bottlenecks - stages that consume unnecessary time or resources.

In inference systems, this includes, for example:

  • How long the Prefill stage takes compared to Decoding.
  • Whether the accelerator (GPU) is utilized 100% or waiting for data from the CPU.
  • How much time is wasted on data transfer to memory (Memory I/O).
  • What percentage of the time the model is actually “thinking” versus “waiting.”

What Does It Look Like in Practice?

The result of profiling is usually a time graph or a flame graph where you can see each operation - and how long it took.

For example:

  • 40% of the time is dedicated to Matrix Multiplication.
  • 25% to building the KV Cache.
  • 20% to data transfer to memory.
  • 15% waiting for the CPU Scheduler.

This data allows for pinpoint optimization, instead of “guessing.”

Common Tools

  • NVIDIA Nsight / CUPTI - for measuring GPU loads.
  • PyTorch Profiler / TensorBoard - for analyzing deep learning code.
  • Perf / VTune - for general CPU and memory analysis tools.

Each one answers the same question from a different angle: Where is your time being lost?

Why It’s Important

Without profiling, it’s hard to know whether to improve the code, shift load to the accelerator, or change the batch size. With profiling, you see the picture - and decide the next step with confidence.

Comments