1) At low loads, most of the GPU memory is allocated but not used, occupying the GPU memory and preventing it from being used by other services;
显存已分配未使用,其他服务也用不了 -> 论文提到了,他们用这个技术来做离线混部。
2) At high loads, due to the GPU memory allocation threshold set by the inference engine, up to 10%-20% of the GPU memory remains unused and idle. Hence, the current GPU memory management is inefficient;
由于推理过程存在不确定性,所以通常会预留一部分显存,导致 10%~20% 的浪费
3) The prefill and decode stages of the inference process have significantly different demands on GPU memory
Note that Visual Profiler and nvprof will be deprecated in a future CUDA release.The NVIDIA Volta platform is the last architecture on which these tools are fully supported. It is recommended to use next-generation toolsNVIDIA Nsight Systemsfor GPU and CPU sampling and tracing andNVIDIA Nsight Computefor GPU kernel profiling.