-
Notifications
You must be signed in to change notification settings - Fork 59
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
fix: Reduce model image sizes (#225)
This PR proposes changing the base image from `nvcr.io/nvidia/pytorch:23.10-py3` to the smaller `python:3.8-slim`, aiming to reduce our container images sizes. I came to this conclusion because: 1. AKS Preinstallation: On each GPU node, AKS preinstalls NVIDIA Driver (which comes with basic CUDA runtime functionality). AKS also preinstalls nvidia-container-toolkit (which includes nvidia-container-runtime), and a couple other necessary nvidia libraries that can be verified on an AKS node using: ``` ls /usr/bin | grep nvidia ``` 2. NVIDIA Device Plugin for K8s: Enabled by default on AKS clusters with N-series node pools, this DaemonSet advertises GPU resources to the K8s scheduler, allowing pods to request and be allocated GPUs. It's important to note that this plugin doesn't install NVIDIA drivers, CUDA, or the NVIDIA Container Toolkit on the nodes; these components are preconfigured by AKS (part 1). 3. Container Runtime: AKS nodes come preinstalled with `nvidia-container-toolkit` which includes `nvidia-container-runtime`. This supports GPU passthrough to containers via containerd. I learned this configuration allows containers to access the necessary NVIDIA drivers and libraries at the host level, thereby removing the need to bundle these components within individual container images. This is confirmed via the container runtime configuration (`cat /etc/containerd/config.toml`). 4. PyTorch Installation: Learned PyTorch, when installed via pip, includes additional essential GPU acceleration libraries within its binaries automatically (CUDA, cuBLAS, cuDNN, NCCL, etc), eliminating the reliance on the nvcr.io/nvidia/pytorch image for these. 5. dockignore: Added to ignore .git lfs files which were taking too much space. Based on these findings, I learned the `python:3.8-slim` base image should suffice for our requirements. I have validated this working locally, with further testing planned for built images. Aside: I found the NVIDIA GPU Operator offers additional functionalities like DCGM metrics, runtime validation, and dynamic MIG profile management, though not required for our current needs. Sources: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/microsoft-aks.html https://catalog.ngc.nvidia.com/orgs/nvidia/containers/gpu-operator https://github.com/Azure/aks-engine/blob/master/examples/addons/nvidia-device-plugin/README.md https://hub.docker.com/r/nvidia/cuda/tags?page=1&name=%25-base https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-23-10.html https://stackoverflow.com/questions/45293580/whats-the-relation-between-nvidia-driver-cuda-driver-and-cuda-toolkit https://earthly.dev/blog/buildingrunning-nvidiacontainer/ https://discuss.pytorch.org/t/how-to-check-if-torch-uses-cudnn/21933 https://discuss.pytorch.org/t/is-nvidia-driver-already-included-cuda-and-cuda-toolkit/184411 https://learn.microsoft.com/en-us/azure/aks/gpu-cluster?tabs=add-ubuntu-gpu-node-pool
- Loading branch information
1 parent
fbdd2ec
commit 7201161
Showing
6 changed files
with
66 additions
and
23 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,4 @@ | ||
FROM nvcr.io/nvidia/pytorch:23.10-py3 | ||
FROM python:3.8-slim | ||
|
||
ARG WEIGHTS_PATH | ||
ARG MODEL_TYPE | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters