1. 问题背景

1) At low loads, most of the GPU memory is allocated but not used, occupying the GPU memory and preventing it from being used by other services;

显存已分配未使用，其他服务也用不了 -> 论文提到了，他们用这个技术来做离线混部。

2) At high loads, due to the GPU memory allocation threshold set by the inference engine, up to 10%-20% of the GPU memory remains unused and idle. Hence, the current GPU memory management is inefficient;

由于推理过程存在不确定性，所以通常会预留一部分显存，导致 10%~20% 的浪费

3) The prefill and decode stages of the inference process have significantly different demands on GPU memory

prefill 和 decode 阶段的显存需求有显著差别（PD分离架构下，P节点不需要存储 kv cache，D节点需要）

2. 解决方案

整体架构

Prefill Agent 和 Decode Agent 都有：
1. Queue，管理请求队列，Request Router 和 Schedule Queue 都是一个 Queue
2. Memory Predictor，用来预测内存需求，然后调用 CUDA API 去分配显存
3. Dynamic block manager，应该是用来管理虚拟地址空间的
Executor：比较类似千帆的 ModelServer
1. KV Cache 就是 kv 缓存，类似我们的 AttentionStore
2. Work 就是推理引擎，负责 token 生成

一	二	三	四	五	六	日
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30

成功，源于对美学的执著追求

月度归档： 2025 年 4 月