Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(target-gh): Check asset MD5 hash by downloading the asset #328

Merged
merged 6 commits into from
Nov 18, 2021
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
87 changes: 58 additions & 29 deletions src/targets/github.ts
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ import { Octokit, RestEndpointMethodTypes } from '@octokit/rest';
import { RequestError } from '@octokit/request-error';
import { readFileSync, promises, statSync } from 'fs';
import { basename } from 'path';
import { createHash } from 'crypto';
import { BinaryLike, createHash } from 'crypto';

import { getConfiguration } from '../config';
import {
Expand Down Expand Up @@ -378,12 +378,33 @@ export class GithubTarget extends BaseTarget {
);
}

// XXX: This is a bit hacky as we rely on two things:
// 1. GitHub issuing a redirect to S3, where they store the artifacts,
// or at least pass those request headers unmodified to us
// 2. AWS S3 using the MD5 hash of the file for its ETag cache header
// when we issue a HEAD request.
const response = await this.github.request(`HEAD ${url}`, {
const remoteChecksum = await this.checksumFromUrl(url);
const localChecksum = this.checksumFromData(file);
if (localChecksum !== remoteChecksum) {
throw new Error(
`Uploaded asset checksum does not match local asset checksum for "${name} (${localChecksum} != ${remoteChecksum})`
);
}
uploadSpinner.succeed(`Uploaded asset "${name}".`);
return url;
} catch (e) {
uploadSpinner.fail(`Cannot upload asset "${name}".`);

throw e;
}
}

private async checksumFromUrl(url: string): Promise<string> {
// XXX: This is a bit hacky as we rely on various things:
// 1. GitHub issuing a redirect to AWS S3.
// 2. S3 using the MD5 hash of the file for its ETag cache header.
// 3. The file being small enough to fit in memory.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calculating MD5 hash from a streaming data is quite trivial in Node so I'd encourage you to revisit this approach in a follow up PR as it causes unnecessary memory usage. We may have rather large files.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, @iker-barriocanal and I talked extensively about this.

Trade-offs:

  • Octokit's request method seems to buffer the full response payload, so if we wanted to work with streams we'd need to depart from using request. I believe we're originally using Octokit in this code path to benefit from the different rate limits given to authenticated requests -- @BYK can you confirm that?
  • We don't work with streams when doing the upload. We read (with readSync!) full asset files from disk into memory before uploading. We use the same bytes in memory to calculate the "local checksum". So the assumption that files are small enough was implicitly already there.
  • If memory usage becomes a concern, then certainly we need to revisit this, but probably then not only the fallback download step.
  • We've filed issues here on GitHub and internal tracking tickets for future improvements. Perhaps none specifically about streaming -- now done in GitHub target: reduce memory consumption by streaming uploads/downloads #329

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Octokit's request method seems to buffer the full response payload, so if we wanted to work with streams we'd need to depart from using request. I believe we're originally using Octokit in this code path to benefit from the different rate limits given to authenticated requests

Octokit takes care of a bunch of things from authentication to rate limiting to parallelization to automatic retries. So I'd say stick with it as long as it can. If solving this issue requires a departure from Octokit, delay that as long as possible and file an issue with them ahead of time so they might be able to bring this into the library itself.

We don't work with streams when doing the upload. We read (with readSync!) full asset files from disk into memory before uploading. We use the same bytes in memory to calculate the "local checksum". So the assumption that files are small enough was implicitly already there.

Great catch. This is again due to Octokit limitations tho as it does not support streams or piping, only buffers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

//
// Note that if assets are large (5GB) assumption 2 is not correct. See
// https://github.com/getsentry/craft/issues/322#issuecomment-964303174
let response;
try {
response = await this.github.request(`HEAD ${url}`, {
headers: {
// WARNING: You **MUST** pass this accept header otherwise you'll
// get a useless JSON API response back, instead of getting
Expand All @@ -395,32 +416,40 @@ export class GithubTarget extends BaseTarget {
Accept: DEFAULT_CONTENT_TYPE,
},
});
} catch (e) {
throw new Error(
`Cannot get asset on GitHub. Status: ${(e as any).status}\n` + e
);
}

const etag = response.headers['etag'];
if (etag && etag.length > 0) {
// ETag header comes in quotes for some reason so strip those
const remoteChecksum = response.headers['etag']?.slice(1, -1);
if (remoteChecksum) {
// XXX: This local checksum calculation only holds for simple cases:
// https://github.com/getsentry/craft/issues/322#issuecomment-964303174
const localChecksum = createHash('md5').update(file).digest('hex');
if (localChecksum !== remoteChecksum) {
logger.trace(`Checksum mismatch for "${name}"`, response.headers);
throw new Error(
`Uploaded asset MD5 checksum does not match local asset checksum for "${name} (${localChecksum} != ${remoteChecksum})`
);
}
uploadSpinner.succeed(`Uploaded asset "${name}".`);
} else {
logger.warn(`Response headers:`, response.headers);
uploadSpinner.succeed(
`Uploaded asset "${name}, but failed to find a remote checksum. Cannot verify uploaded file is as expected.`
);
}
return url;
} catch (e) {
uploadSpinner.fail(`Cannot upload asset "${name}".`);
return etag.slice(1, -1);
}

throw e;
return await this.md5FromUrl(url);
}

private async md5FromUrl(url: string): Promise<string> {
this.logger.debug('Downloading asset from GitHub to check MD5 hash: ', url);
let response;
try {
response = await this.github.request(`GET ${url}`, {
headers: {
Accept: DEFAULT_CONTENT_TYPE,
},
});
} catch (e) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if catching the original exception and throwing a new one is actually good here as you obscure the original issue. In debug mode, we already log the response status etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @BYK !!

In this case here, we were seeing something like "Failed: Not Found" mixed up in the logs. So the RequestError (the e) from Octokit didn't provide much context, and not knowing clearly if the problem was in HEAD (the original ETag method) or GET (the new download method added in this PR) was not a great prospect.

The original error is still printed along with a custom message. Do you think we could do better?

throw new Error(
`Cannot download asset from GitHub. Status: ${(e as any).status}\n` + e
chadwhitacre marked this conversation as resolved.
Show resolved Hide resolved
);
}
return this.checksumFromData(Buffer.from(response.data));
}

private checksumFromData(data: BinaryLike): string {
return createHash('md5').update(data).digest('hex');
}

private async handleGitHubUpload(
Expand Down