Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: quick deploy using terraform #634

Merged
merged 1 commit into from
Oct 16, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ The above figure presents the Kaito architecture overview. Its major components

## Installation

Please check the installation guidance [here](./docs/installation.md).
Please check the installation guidance [here](./docs/installation.md) for deployment using Azure CLI and [here](./terraform/README.md) for deployment using Terraform.

## Quick start

Expand Down
34 changes: 34 additions & 0 deletions terraform/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# Local .terraform directories
**/.terraform/*

# .tfstate files
*.tfstate
*.tfstate.*

# Crash log files
crash.log
crash.*.log

# Exclude all .tfvars files, which are likely to contain sensitive data, such as
# password, private keys, and other secrets. These should not be part of version
# control as they are data points which are potentially sensitive and subject
# to change depending on the environment.
*.tfvars
*.tfvars.json

# Ignore override files as they are usually used to override resources locally and so
# are not checked in
override.tf
override.tf.json
*_override.tf
*_override.tf.json

# Include override files you do wish to add to version control using negated pattern
# !example_override.tf

# Include tfplan files to ignore the plan output of command: terraform plan -out=tfplan
# example: *tfplan*

# Ignore CLI configuration files
.terraformrc
terraform.rc
81 changes: 81 additions & 0 deletions terraform/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
# Deploy KAITO on AKS using Terraform

This is a sample of how to deploy an Open Source KAITO on a new Azure Kubernetes Service (AKS) using Terraform. This sample will deploy the following resources:

- Azure Kubernetes Service (AKS)
- Azure Container Registry (ACR) with short lived, repo scoped token
- Azure Managed Identity with Federated Credential and Role Assignment for GPU Provisioner
- Install the KAITO GPU Provisioner Helm Chart
- Install the KAITO Workspace Helm Chart
- Kubernetes Secret for the ACR token

## Prerequisites

- Terraform 1.9.7 or later
- Azure CLI 2.65.0 or later
- kubectl 1.30.5 or later
- Helm 3.16.2 or later

## Setup

To deploy this sample, you will to use the Azure CLI to login to your Azure account and set the subscription you want to use, then use the Terraform CLI to provision the Azure resources and execute the Helm installations for the KAITO operators.

Login to your Azure account and set the subscription you want to use.

```bash
az login
az account set -s <subscription-id>
```

Export the subscription ID for Terraform to use.

```bash
export ARM_SUBSCRIPTION_ID=$(az account show --query id -o tsv)
```

Initialize the Terraform providers.

```bash
terraform init
```

> [!NOTE]
> The following variables in the [variables.tf](./variables.tf) file are available for customization:
>
> - `location` - The Azure region to deploy the resources. Be sure you have the necessary quota in the region.
> - `kaito_gpu_provisioner_version` - The version of the KAITO GPU Provisioner.
> - `kaito_workspace_version` - The version of the KAITO Workspace.

Run the Terraform apply command and enter `yes` when prompted to deploy the Azure resources.

```bash
terraform apply
```

Log into the AKS cluster.

```bash
az aks get-credentials -g $(terraform output -raw rg_name) -n $(terraform output -raw aks_name)
```

Verify installation of the KAITO operators.

```bash
helm list -n gpu-provisioner
helm list -n kaito-workspace
```

Check status of the KAITO pods.

```bash
kubectl get po -n gpu-provisioner
kubectl get po -n kaito-workspace
```

## Cleanup

Run the Terraform destroy command and enter `yes` when prompted to delete the Azure resources.

```bash
terraform destroy
```
20 changes: 20 additions & 0 deletions terraform/gpu-provisioner-values.tmpl
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
controller:
env:
- name: ARM_SUBSCRIPTION_ID
value: ${AZURE_SUBSCRIPTION_ID}
- name: LOCATION
value: ${LOCATION}
- name: AZURE_CLUSTER_NAME
value: ${AKS_NAME}
- name: AZURE_NODE_RESOURCE_GROUP
value: ${AKS_NRG_NAME}
- name: ARM_RESOURCE_GROUP
value: ${RG_NAME}
- name: LEADER_ELECT
value: "false"
workloadIdentity:
clientId: ${KAITO_IDENTITY_CLIENT_ID}
tenantId: ${AZURE_TENANT_ID}
settings:
azure:
clusterName: ${AKS_NAME}
74 changes: 74 additions & 0 deletions terraform/kaito.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
# Create managed identity that the gpu-provisioner will use to interact with Azure
resource "azurerm_user_assigned_identity" "kaito" {
resource_group_name = azurerm_resource_group.example.name
location = azurerm_resource_group.example.location
name = "kaitoprovisioner"
}

# Grant the managed identity the Contributor role to create new AKS nodes
resource "azurerm_role_assignment" "kaito_aks_contributor" {
principal_id = azurerm_user_assigned_identity.kaito.principal_id
scope = azurerm_kubernetes_cluster.example.id
role_definition_name = "Contributor"
skip_service_principal_aad_check = true
}

# Create a federated identity credential for the managed identity to be used by the gpu-provisioner via workload identity
resource "azurerm_federated_identity_credential" "kaito" {
resource_group_name = azurerm_resource_group.example.name
parent_id = azurerm_user_assigned_identity.kaito.id
name = "kaitoprovisioner"
issuer = azurerm_kubernetes_cluster.example.oidc_issuer_url
audience = ["api://AzureADTokenExchange"]
subject = "system:serviceaccount:gpu-provisioner:gpu-provisioner"
}

# Install the gpu-provisioner chart
resource "helm_release" "gpu_provisioner" {
name = "gpu-provisioner"
chart = "https://raw.githubusercontent.com/Azure/kaito/refs/heads/gh-pages/charts/kaito/gpu-provisioner-${var.kaito_gpu_provisioner_version}.tgz"
namespace = "gpu-provisioner"
create_namespace = true

values = [
templatefile("${path.module}/gpu-provisioner-values.tmpl",
{
AZURE_TENANT_ID = data.azurerm_client_config.current.tenant_id
AZURE_SUBSCRIPTION_ID = data.azurerm_client_config.current.subscription_id
RG_NAME = azurerm_resource_group.example.name
LOCATION = azurerm_resource_group.example.location
AKS_NAME = azurerm_kubernetes_cluster.example.name
AKS_NRG_NAME = azurerm_kubernetes_cluster.example.node_resource_group
KAITO_IDENTITY_CLIENT_ID = azurerm_user_assigned_identity.kaito.client_id
}
)
]
}

# Install the kaito-workspace chart
resource "helm_release" "kaito_workspace" {
name = "kaito-workspace"
chart = "https://raw.githubusercontent.com/Azure/kaito/refs/heads/gh-pages/charts/kaito/workspace-${var.kaito_workspace_version}.tgz"
namespace = "kaito-workspace"
create_namespace = true
}

# Create a secret to store the Azure Container Registry credentials for the workspace to refer to when pushing and pulling images from the registry
resource "kubernetes_secret" "example" {
metadata {
name = "myregistrysecret"
}

type = "kubernetes.io/dockerconfigjson"

data = {
".dockerconfigjson" = jsonencode({
auths = {
"${azurerm_container_registry.example.login_server}" = {
"username" = azurerm_container_registry_token.example.name
"password" = azurerm_container_registry_token_password.example.password1
}
}
})
}
}
31 changes: 31 additions & 0 deletions terraform/kubernetes.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
resource "azurerm_kubernetes_cluster" "example" {
resource_group_name = azurerm_resource_group.example.name
location = azurerm_resource_group.example.location
name = "aks-${local.random_name}"
dns_prefix = "aks-${local.random_name}"
oidc_issuer_enabled = true
workload_identity_enabled = true

default_node_pool {
name = "default"
node_count = 1
vm_size = "Standard_D2_v2"

upgrade_settings {
drain_timeout_in_minutes = 0
max_surge = "10%"
node_soak_duration_in_minutes = 0
}
}

identity {
type = "SystemAssigned"
}
}

resource "azurerm_role_assignment" "aks_acr_pull" {
principal_id = azurerm_kubernetes_cluster.example.kubelet_identity[0].object_id
scope = azurerm_container_registry.example.id
role_definition_name = "AcrPull"
skip_service_principal_aad_check = true
}
3 changes: 3 additions & 0 deletions terraform/locals.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
locals {
random_name = "kaitodemo${random_integer.example.result}"
}
63 changes: 63 additions & 0 deletions terraform/main.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
terraform {
required_providers {
azurerm = {
source = "hashicorp/azurerm"
version = "=4.5.0"
}

random = {
source = "hashicorp/random"
version = "=3.6.3"
}

kubernetes = {
source = "hashicorp/kubernetes"
version = "=2.33.0"
}

helm = {
source = "hashicorp/helm"
version = "=2.16.1"
}
}
}

provider "azurerm" {
features {
resource_group {
prevent_deletion_if_contains_resources = false
}
}
}

provider "kubernetes" {
host = azurerm_kubernetes_cluster.example.kube_config.0.host
username = azurerm_kubernetes_cluster.example.kube_config.0.username
password = azurerm_kubernetes_cluster.example.kube_config.0.password
client_certificate = base64decode(azurerm_kubernetes_cluster.example.kube_config.0.client_certificate)
client_key = base64decode(azurerm_kubernetes_cluster.example.kube_config.0.client_key)
cluster_ca_certificate = base64decode(azurerm_kubernetes_cluster.example.kube_config.0.cluster_ca_certificate)
}

provider "helm" {
kubernetes {
host = azurerm_kubernetes_cluster.example.kube_config.0.host
username = azurerm_kubernetes_cluster.example.kube_config.0.username
password = azurerm_kubernetes_cluster.example.kube_config.0.password
client_certificate = base64decode(azurerm_kubernetes_cluster.example.kube_config.0.client_certificate)
client_key = base64decode(azurerm_kubernetes_cluster.example.kube_config.0.client_key)
cluster_ca_certificate = base64decode(azurerm_kubernetes_cluster.example.kube_config.0.cluster_ca_certificate)
}
}

data "azurerm_client_config" "current" {}

resource "random_integer" "example" {
min = 10
max = 99
}

resource "azurerm_resource_group" "example" {
name = "rg-${local.random_name}"
location = var.location
}
7 changes: 7 additions & 0 deletions terraform/outputs.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
output "rg_name" {
value = azurerm_resource_group.example.name
}

output "aks_name" {
value = azurerm_kubernetes_cluster.example.name
}
34 changes: 34 additions & 0 deletions terraform/registry.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
resource "azurerm_container_registry" "example" {
resource_group_name = azurerm_resource_group.example.name
location = azurerm_resource_group.example.location
name = "acr${local.random_name}"
sku = "Standard"
admin_enabled = false
anonymous_pull_enabled = false
}

resource "azurerm_container_registry_scope_map" "example" {
name = "default"
container_registry_name = azurerm_container_registry.example.name
resource_group_name = azurerm_resource_group.example.name

actions = [
"repositories/${var.registry_repository_name}/content/read",
"repositories/${var.registry_repository_name}/content/write"
]
}

resource "azurerm_container_registry_token" "example" {
name = "default"
container_registry_name = azurerm_container_registry.example.name
resource_group_name = azurerm_resource_group.example.name
scope_map_id = azurerm_container_registry_scope_map.example.id
}

resource "azurerm_container_registry_token_password" "example" {
container_registry_token_id = azurerm_container_registry_token.example.id

password1 {
expiry = timeadd(timestamp(), "168h") # 7 days
}
}
23 changes: 23 additions & 0 deletions terraform/variables.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
variable "location" {
type = string
default = "brazilsouth"
description = "value of location"
}

variable "kaito_gpu_provisioner_version" {
type = string
default = "0.2.0"
description = "kaito gpu provisioner version"
}

variable "kaito_workspace_version" {
type = string
default = "0.3.1"
description = "kaito workspace version"
}

variable "registry_repository_name" {
type = string
default = "fine-tuned-adapters/kubernetes"
description = "container registry repository name"
}
Loading