Serving - How a Model Starts “Talking to the World”
Serving - How a Model Starts “Talking to the World”
You’ve trained a model? Great. But now comes the stage where it needs to start responding to real people. This is where Serving comes in - the way to turn a trained model into a live service.
What is Serving?
When we talk about “Serving,” we mean the stage where the model:
- Is loaded into memory (like a running application).
- Listens for requests (e.g., “Give me a prediction”).
- Returns a response (quickly and accurately).
It’s essentially the service that allows any other system to use the model - through a simple API.
Two Types of Serving
Real-Time Serving
When you need an answer now. For example: A user asks a chatbot → the model responds immediately. Focus: Response speed (Latency).
Batch Serving
When processing large amounts of data at once. For example: Updating predictions for all users once a day. Focus: Efficiency and high Throughput.
Why is This Important?
Because a great model without Serving - is just a nice file. Serving is what turns it into part of a real system, serving people, applications, and organizations.
How Does It Work in Practice?
- API Layer - Receives requests from the outside (e.g., via HTTP or gRPC).
- Model Engine - Performs the actual computation (CPU or GPU).
- Scheduler / Load Balancer - Distributes requests to prevent system overload.
- Cache - Stores repeated results to avoid recalculating.
Conclusion
Serving is the stage where AI becomes a product. Without it, the model stays in the lab. With it - it talks, responds, and delivers real value.