Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VC-36510: Key Pair and Venafi Connection modes now use gzip compression #594

Merged
merged 2 commits into from
Oct 28, 2024

Conversation

maelvls
Copy link
Member

@maelvls maelvls commented Oct 15, 2024

Summary:

  • This PR is a remediation for the 413s returned by our NGINX gateway due to the fact that the request limit was set too low (smth around 100MB). See VC-36510 to know more.
  • On 26 Sept, as a first attempt at remediating this, we bumped NGINX's client_max_body_size from 256 MB to 1024 MB, and client_body_buffer_size from 128 MB to 750 MB (gitlab MR, message).
    • Before this fix, NGINX used to return a 413 as soon as the Content-Length was above 256 MB. When the request body goes above 750 MB, the body is written to a temporary file to disk rather than buffered in memory.
    • After this this, NGINX hasn't returned any 413s, even when stress-testing it.
  • Side note: buffering in NGINX is still happening. Keeping buffering on puts stress in terms of memory and disk on NGINX, and I don’t think there is a reason in buffering these requests. The setting in question is proxy_request_buffering and is on by default.
  • The Venafi Control Plane API doesn't publicly state that GZIP compression is possible; I have tested it and it works, but we shall maybe talk to Prod Engineering.
  • Compression hasn't been added for Standalone (Jetstack Secure OAuth and Jetstack Secure API token modes). In this PR, I only added compression for the Key Pair and Venafi Connection modes.
  • Compression aims to lower the load on our intermediate NGNIX (because it buffers requests) as well as lower the egress cost for customers. Sending over 100MBs over the wire looked somewhat bad, enabling compression will improve that.
  • To ease the pain that might be caused by adding compression, we decided to add --disable-compression and let the Support team tell customers. I'm also in favor of adding --disable-compression for development purposes (e.g., when using mitmproxy).

Testing

  • Test 1: (manual) test that the compression works in Venafi Connection mode by using ./hack/e2e/test.sh. (see results below)
  • Test 2: (manual) test that the compression works in Key Pair mode. (see results below)
  • Test 3: test that the headers are correctly set in Key Pair mode. → unit test written.
  • Test 4: test that the headers are correctly set in Venafi Connection mode. → unit test written.
  • Test 5: test that no compression takes place when --disable-compression is used in Key Pair mode → unit test written
  • Test 6: test that no compression takes place when --disable-compression is used in Venafi Connection mode. → unit test written
  • Test 7: (manual) test that the data is actually in the S3 bucket. @tfadeyi will check this.

Rollout Plan:

  • If Olu or somebody in Cloud Services is taking over the feature and releasing v1.2.0, they will assess how confident they are about releasing. Although this feature is important, it isn't critical, so they might wait until Richard is back (29th Oct) or Mael is back (4th Nov).
  • Release v1.2.0 for all customers,
  • venctl will not immediately get v1.2.0 for the reason explained in the #venctl channel. You can contact @SgtCoDFish or @tfadeyi to know more about that.
  • Let the Support Team (@hawksight) know that this change might affect customers, and that they can tell them to use --disable-compression if a problem occurs.

Before: request body weights 49 MB:

POST https://api.venafi.cloud/v1/tlspk/upload/clusterdata/no?description=S2luZCBjbHVzdGVyIG9uIE1hZWwmIzM5O3MgQW9ydXMgaG9tZSBtYWNoaW5l&name=kind-mael HTTP/2.0
accept: application/json
content-type: application/json

authorization: Bearer REDACTED
content-length: 49000269
accept-encoding: gzip
user-agent: Go-http-client/2.0

{"agent_metadata":{...

HTTP/2.0 200 
date: Tue, 15 Oct 2024 08:42:45 GMT
content-type: application/json
www-authenticate: Bearer realm="Echo"
vary: Accept-Encoding
server: envoy
content-length: 70

{"status":"ok","organization":"756db001-280e-11ee-84fb-991f3177e2d0"}

After: request body weights 5 MB:

POST https://api.venafi.cloud/v1/tlspk/upload/clusterdata/no?description=S2luZCBjbHVzdGVyIG9uIE1hZWwmIzM5O3MgQW9ydXMgaG9tZSBtYWNoaW5l&name=kind-mael HTTP/2.0
authorization: Bearer REDACTED
accept: application/json
content-type: application/json
content-encoding: gzip
content-length: 5125690
user-agent: Go-http-client/2.0

�\x8b�������\xff̽ͮh\xb9\x8d...

HTTP/2.0 200 
date: Tue, 15 Oct 2024 08:42:45 GMT
content-type: application/json
www-authenticate: Bearer realm="Echo"
vary: Accept-Encoding
server: envoy
content-length: 70

{"status":"ok","organization":"756db001-280e-11ee-84fb-991f3177e2d0"}

^ note that the above request was initially missing the Content-Encoding: gzip header. I fixed this in 22def46.

(see the section below to know how to reproduce this)

Manual Tests

Test 1

For this test, I've used the tenant https://ven-tlspk.venafi.cloud/. To access the API key, use the user system.admin@tlspk.qa.venafi.io and the password is visible in the page Production Accounts (private to Venafi). Then go to the settings and find the API key, and set it in the env var APIKEY.

The tenant https://ven-tlspk.venafi.cloud/ doesn't have the right tier to pull images, so I use an API key from the tenant https://glow-in-the-dark.venafi.cloud/, that's why I set the env var APIKEY_GLOW_IN_THE_DARK. Ask Atanas to get access to the tenant https://glow-in-the-dark.venafi.cloud/.

export APIKEY=...
export APIKEY_GLOW_IN_THE_DARK=...

Then:

export \
 OCI_BASE=ttl.sh/maelvls \
 VEN_API_KEY=$APIKEY \
 VEN_API_KEY_PULL=$APIKEY_GLOW_IN_THE_DARK \
 VEN_API_HOST=api.venafi.cloud \
 VEN_VCP_REGION=us \
 VEN_ZONE='tlspk-bench\Default' \
 CLOUDSDK_CORE_PROJECT=jetstack-mael-valais \
 CLOUDSDK_COMPUTE_ZONE=europe-west1-b \
 CLUSTER_NAME=test-secretless

Finally, run the smoke test Venafi Connection mode:

./hack/e2e/test.sh

Result:

2024/10/18 19:18:35 Posting data to:
2024/10/18 19:18:36 Data sent successfully.

Test 2

For this test, I've decided to use mitmproxy to see exactly what is being sent on the wire.

First, copy the following script to a file called bigjsonbody.go (written in Go because bash is super slow):

///usr/bin/true; exec /usr/bin/env go run "$0" "$@"

package main

import (
	"encoding/json"
	"flag"
	"fmt"
	"time"

	"github.com/google/uuid"
)

var (
	count = flag.Int("count", 5000, "Number of items to generate")
)

type Metadata struct {
	Name              string            `json:"name"`
	Namespace         string            `json:"namespace"`
	UID               string            `json:"uid"`
	CreationTimestamp string            `json:"creationTimestamp"`
	Labels            map[string]string `json:"labels"`
}

type Resource struct {
	Kind       string   `json:"kind"`
	APIVersion string   `json:"apiVersion"`
	Metadata   Metadata `json:"metadata"`
}

type Item struct {
	Resource Resource `json:"resource"`
}

type Data struct {
	Items []Item `json:"items"`
}

type Output struct {
	ClusterID    string    `json:"cluster_id"`
	DataGatherer string    `json:"data-gatherer"`
	Timestamp    time.Time `json:"timestamp"`
	Data         Data      `json:"data"`
}

func generateItem() Item {
	randomUID := uuid.New().String()
	return Item{
		Resource: Resource{
			Kind:       "Pod",
			APIVersion: "v1",
			Metadata: Metadata{
				Name:              "test-" + randomUID,
				Namespace:         "test",
				UID:               randomUID,
				CreationTimestamp: "2024-09-20T18:03:51Z",
				Labels: map[string]string{
					"k8s-app": "test",
				},
			},
		},
	}
}

func main() {
	flag.Parse()
	items := make([]Item, *count)

	for i := 0; i < *count; i++ {
		items[i] = generateItem()
	}

	output := Output{
		ClusterID:    "mael temp",
		DataGatherer: "k8s/pods",
		Timestamp:    time.Date(2024, time.September, 30, 16, 47, 32, 0, time.FixedZone("CET", 2*3600)),
		Data:         Data{Items: items},
	}

	outputJSON, err := json.MarshalIndent(output, "", "  ")
	if err != nil {
		fmt.Println("Error:", err)
		return
	}

	fmt.Println(string(outputJSON))
}

Then:

chmod +x bigjsonbody.go

Finally:

$ ./bigjsonbody.go --count=200000 >input-data.json
du -h input-data.json
87M    input-data.json

For this tests, I've used the tenant https://ven-tlspk.venafi.cloud/. To access the API key, use the user system.admin@tlspk.qa.venafi.io and the password is visible in the page Production Accounts (private to Venafi). Then go to the settings and find the API key, and set it as an env var:

export APIKEY=...

Create the service account and key pair:

venctl iam service-account agent create --name "$USER temp" \
  --vcp-region US \
  --output json \
  --owning-team $(curl -sS https://api.venafi.cloud/v1/teams -H "tppl-api-key: $APIKEY" | jq '.teams[0].id') \
  --output-file /tmp/agent-credentials.json \
  --api-key $APIKEY

Now, make sure to have 127.0.0.1 me in your /etc/hosts.

Then, run mitmproxy with:

curl -L https://raw.githubusercontent.com/maelvls/kubectl-incluster/main/watch-stream.py >/tmp/watch-stream.py
mitmproxy --mode regular@9090 --ssl-insecure -s /tmp/watch-stream.py --set client_certs=$(kubectl incluster --print-client-cert >/tmp/me.pem && echo /tmp/me.pem)

Run this:

cat <<EOF >minimal-config.yaml
cluster_id: "kind-mael"
cluster_description: "Kind cluster on Mael's machine"
server: "https://api.venafi.cloud/"
venafi-cloud:
  uploader_id: "no"
  upload_path: "/v1/tlspk/upload/clusterdata"
data-gatherers: []
period: 3m
EOF

Finally, run the Agent with:

go install github.com/maelvls/kubectl-incluster@latest
export HTTPS_PROXY=http://localhost:9090 KUBECONFIG=/tmp/kube && KUBECONFIG= HTTPS_PROXY= kubectl incluster --replace-ca-cert ~/.mitmproxy/mitmproxy-ca-cert.pem --sa=venafi/venafi-kubernetes-agent | sed 's|127.0.0.1|me|' >/tmp/kube

go run . agent -c minimal-config.yaml \
  --client-id $(jq -r .client_id /tmp/agent-credentials.json) \
  --private-key-path <(jq -r .private_key /tmp/agent-credentials.json) \
  --install-namespace=venafi \
  --input-path=input-data.json \
  --one-shot

Result:

2024/10/18 19:18:35 Posting data to: https://api.venafi.cloud/
2024/10/18 19:18:36 Data sent successfully.

@maelvls
Copy link
Member Author

maelvls commented Oct 15, 2024

By the way, I don't understand why we have set a client timeout of 1 minute in Key Pair service account mode... Timeouts are a good practice in most cases, but I don't think this is such a case:

Client: &http.Client{Timeout: time.Minute},

I also noted that Venafi Kubernetes Agent's user agent is off:

user-agent: Go-http-client/2.0

It should be smth like

user-agent: venafi-kubernetes-agent/1.1.0

@maelvls maelvls marked this pull request as ready for review October 15, 2024 16:35
@maelvls maelvls requested a review from wallrj October 15, 2024 16:36
wallrj
wallrj previously requested changes Oct 16, 2024
Copy link
Member

@wallrj wallrj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Add some comments in the code, explaining why the collected data is compressed using GZIP....to reduce the likelihood of exceeding the maximum request body size limit imposed by the server or by loadbalancers and proxies in between.
  2. Show evidence that this still works with the old Jetstack TLSPK API server (does it handle requests with Content-Encoding: gzip?), or explain why that is not important.
  3. Did you consider adding a flag to allow users to turn off the gzip content encoding, in case it doesn't work with a corporate proxy?
  4. Or did you consider checking for HTTP server error 415 Unsupported Media Type and resending the request uncompressed?
  5. Does the API documentation explain that the API supports gzip encoded requests?

@maelvls
Copy link
Member Author

maelvls commented Oct 17, 2024

Add some comments in the code, explaining why the collected data is compressed
using GZIP....to reduce the likelihood of exceeding the maximum request body
size limit imposed by the server or by loadbalancers and proxies in between.

Good point, I forgot to explain why Content-Encoding was added to the Key Pair
mode and the VenafiConnection mode, but not the Jetstack Secure OAuth mode nor
the Jetstack Secure API Token mode.

I've added a few comments in 22def46 to explain that.

Show evidence that this still works with the old Jetstack TLSPK API server
(does it handle requests with Content-Encoding: gzip?), or explain why that is
not important.

I haven't added Content-Encoding: gzip to the Jetstack Secure OAuth mode nor
the Jetstack Secure API Token mode. I haven't added them in these two modes
because I have heard that the 413 Entity Too Large error was only happening with
the VCP API. I don't think it's worth changing how the way data is pushed to the
Jetstack Secure API ("Standalone") when we found no issue with it.

Did you consider adding a flag to allow users to turn off the gzip content
encoding, in case it doesn't work with a corporate proxy?

Good question... I haven't thought about that. I've looked around and found one
instance of a person struggling with using IIS as a reverse proxy in front of
SABnzbd; the problem seemed to be that SABnzbd's responses are GZIP-encoded; IIS
would respond "406 Not Acceptable":
sabnzbd/sabnzbd#915.

I haven't found other evidence of people struggling with GZIP-encoded requests
being a problem in proxies. I conclude that the large majority of transparent
proxies, HTTP proxies, or SOCKS proxies, and the like found in companies handle
GZIP-encoded requests well.

I don't think there is a need to add a flag to disable GZIP encoding.

Did you consider checking for HTTP server error 415 Unsupported Media Type
and resending the request uncompressed?

I haven't considered this. It seems like some work to add this logic, and I'm
not sure how useful it will be.

Note: Looks like 415 Unsupported Media Type isn't the only possible error
that may show up: 406 Not Acceptable may also be one (as seen in the SABnzbd
issue I linked above).

Does the API documentation explain that the API supports gzip encoded
requests?

I had not thought of doing that. I just tried the google search
site:developer.venafi.com gzip and didn't find anything; I also tried
site:docs.venafi.cloud gzip, and I didn't find anything either.

I also maintain an ad-hoc internal documentation at
API Documentation for /v1/tlspk/upload/clusterdata, nothing there either.

I conclude that the VCP API doesn't officially support gzip-encoded requests... should I let the Production Engineering team know about this? I'm not sure what to do about it.

@maelvls
Copy link
Member Author

maelvls commented Oct 17, 2024

I realize that this PR is trickier than anticipated. @j-fuentes told "if it is low hanging fruit, just enable compression; if it is not low hanging fruit, discuss further actions". I'll have a chat with Jose to know what to do next.

Update: we have decided to continue even though it is a little more difficult than anticipated. We have decided to go with --disable-compression instead of trying to trying to retry in plain text on 415 errors (if they ever happen).

@maelvls maelvls force-pushed the VC-36510-compress-requests branch from 22def46 to fa29555 Compare October 18, 2024 19:01
@maelvls
Copy link
Member Author

maelvls commented Oct 18, 2024

@tfadeyi I'm done with the PR! I have added the necessary unit tests and have run a smoke test in Venafi Connection mode as well as in Key Pair mode. I am confident that this will work and will not break customers. You can proceed with the review and with merging and releasing v1.2.0 once you feel confident.

Before releasing, can you proceed with the "Test 7"? I've sent data to the tenant https://ven-tlspk.venafi.cloud/ at around 2024/10/18 19:21:37 UTC, can you check that the bucket contains the data?

For context, we have added gzip compression for the request bodies of
data sent to the Venafi Control Plane API because we have hit 413 errors
in NGINX and want to reduce the network load caused by the Agent for
intermediate proxies and for our NGINX frontend (it buffers requests at
the moment).
@maelvls maelvls force-pushed the VC-36510-compress-requests branch from fa29555 to 179b748 Compare October 18, 2024 19:42
@tfadeyi
Copy link
Contributor

tfadeyi commented Oct 23, 2024

Before releasing, can you proceed with the "Test 7"? I've sent data to the tenant https://ven-tlspk.venafi.cloud/ at around 2024/10/18 19:21:37 UTC, can you check that the bucket contains the data?

I've tested e2e (pushing data from this branch of the agent to checking the blob storage) and the data is successfully uploaded to the backend side, with both --disable-compression and without.
Note: the backend already compresses data before storing it to the storage.

image

I'm going to look into the failing unit tests.

When setting --install-namespace venafi I still see the following line at the start of the agent logs:

2024/10/23 13:57:21 error messages will not show in the pod's events because the POD_NAME environment variable is empty

Signed-off-by: Oluwole Fadeyi <oluwole.fadeyi@venafi.com>
@tfadeyi tfadeyi dismissed wallrj’s stale review October 28, 2024 09:48

Change request has been addressed

@tfadeyi tfadeyi merged commit 1f00f09 into master Oct 28, 2024
2 checks passed
@wallrj wallrj deleted the VC-36510-compress-requests branch October 29, 2024 16:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants