Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VC-36444: Add a health check and liveness probe #580

Merged
merged 2 commits into from
Oct 3, 2024

Conversation

maelvls
Copy link
Member

@maelvls maelvls commented Oct 1, 2024

Ref: VC-36444

Why? Customers don’t like the fact that OpenShift shows warnings whenever they install Venafi Kubernetes Agent. These warnings say that this pod doesn’t have a liveness and a readiness configuration in the Pod specification, and that the best practice is to have one for each pod.

What? The idea isn’t to define what “ready” means in the agent. The agent doesn’t receive traffic, so it wouldn’t make sense to configure a readiness probe (readiness probes are used by Kubernetes to know when to start routing traffic to that pod). And regarding the liveness probe, although it could be useful to indicate that the agent isn’t able to push data to the API, defining a “real” liveness probe is out of scope for this ticket. Instead, the idea is to implement a “ping pong” liveness and readiness probe so that OpenShift customers stop seeing warnings.

Note to the reviewer: I've chosen to combine the metrics endpoint (/metrics) with the readiness and liveness probes because I don't think there will be problems with combining the three.

Testing

I've used the tenant https://ven-tlspk.venafi.cloud/. To access the API key, use the user system.admin@tlspk.qa.venafi.io and the password is visible in the page Production Accounts (private to Venafi). Then go to the settings and find the API key.

export APIKEY=...

Then, run:

venctl iam service-account agent create --name "$USER temp" \
  --vcp-region US \
  --output json \
  --owning-team $(curl -sS https://api.venafi.cloud/v1/teams -H "tppl-api-key: $APIKEY" | jq '.teams[0].id') \
  --output-file /tmp/agent-credentials.json \
  --api-key $APIKEY
make oci-push-preflight oci_preflight_image_name=ttl.sh/mael/venafi-agent oci_preflight_image_tag=v0.0.0-dev oci_platforms=linux/arm64
make helm-chart oci_preflight_image_tag=v0.0.0-dev helm_chart_version=0.0.0-dev oci_preflight_image_name=ttl.sh/mael/venafi-agent
helm push _bin/scratch/image/venafi-kubernetes-agent-0.0.0-dev.tgz oci://ttl.sh/mael/charts
helm upgrade -i -n venafi --create-namespace venafi-kubernetes-agent oci://ttl.sh/mael/charts/venafi-kubernetes-agent --version 0.0.0-dev \
  --set config.clusterName="$USER temp" --set config.clientId="$(jq -r .private_key /tmp/agent-credentials.json)"
kubectl create secret generic -n venafi agent-credentials --from-literal=privatekey.pem="$(jq -r .private_key /tmp/agent-credentials.json)" \
  --dry-run=client -o yaml | kubectl apply -f -

You should see that the pod now has a liveness and readiness probe:

$ k get -n venafi pod -l app.kubernetes.io/name=venafi-kubernetes-agent -oyaml | yq '.items[].spec.containers[] | with_entries(select(.key | test(".*Probe.*")))'  
livenessProbe:
  failureThreshold: 3
  httpGet:
    path: /healthz
    port: 8081
    scheme: HTTP
  initialDelaySeconds: 15
  periodSeconds: 20
  successThreshold: 1
  timeoutSeconds: 1
readinessProbe:
  failureThreshold: 3
  httpGet:
    path: /readyz
    port: 8081
    scheme: HTTP
  initialDelaySeconds: 5
  periodSeconds: 10
  successThreshold: 1
  timeoutSeconds: 1

To check that the server works, I used telepresence (can't rely on k proxy since it doesn't have a service, can't reach the pod IP since I'm on macOS)

curl -fL https://app.getambassador.io/download/tel2oss/releases/download/v2.19.1/telepresence-darwin-arm64 -o ~/bin/telepresence
telepresence helm install
telepresence connect -n venafi

Then:

$ IP=$(k get -n venafi pod -l app.kubernetes.io/name=venafi-kubernetes-agent -oyaml | yq '.items[].status.podIP')

$ curl -D/dev/stderr http://$IP:8081/healthz
HTTP/1.1 200 OK
Date: Tue, 01 Oct 2024 14:17:50 GMT
Content-Length: 0

$ curl -D/dev/stderr http://$IP:8081/readyz
HTTP/1.1 200 OK
Date: Tue, 01 Oct 2024 14:18:13 GMT
Content-Length: 0

The pod should be ready:

$ k get -n venafi pod -l app.kubernetes.io/name=venafi-kubernetes-agent
NAME                                       READY   STATUS    RESTARTS        AGE
venafi-kubernetes-agent-669d9bc4d4-cmtlh   1/1     Running   19 (9m8s ago)   4h2m

pkg/agent/run.go Outdated Show resolved Hide resolved
@inteon
Copy link
Contributor

inteon commented Oct 2, 2024

@maelvls do we also need the readiness probe?
The error message that we were seeing in OpenShift complains about missing health checks. I think that the liveness probe alone might be enough.
image

@maelvls
Copy link
Member Author

maelvls commented Oct 3, 2024

@maelvls do we also need the readiness probe? The error message that we were seeing in OpenShift complains about missing health checks. I think that the liveness probe alone might be enough.

You are right, we both need the readiness and liveness probe. I've already added both to the deployment, so we should be good to go.

@maelvls
Copy link
Member Author

maelvls commented Oct 3, 2024

By the way, how did you get this screenshot? Do you have an OpenShift cluster around? I'd be interested to test this new feature with a real OpenShift cluster.

@inteon
Copy link
Contributor

inteon commented Oct 3, 2024

By the way, how did you get this screenshot? Do you have an OpenShift cluster around? I'd be interested to test this new feature with a real OpenShift cluster.

I copied it from the Jira ticket 😄

@maelvls maelvls merged commit 21c0274 into master Oct 3, 2024
2 checks passed
@maelvls maelvls deleted the VC-36444-health-check branch October 3, 2024 08:23
@wallrj wallrj mentioned this pull request Nov 20, 2024
@wallrj
Copy link
Member

wallrj commented Nov 20, 2024

@maelvls do we also need the readiness probe? The error message that we were seeing in OpenShift complains about missing health checks. I think that the liveness probe alone might be enough.

You are right, we both need the readiness and liveness probe. I've already added both to the deployment, so we should be good to go.

Surely you were talking across purposes here. Tim said "I think that the liveness probe alone might be enough."

I do not think a readiness probe was required.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants