perf: avoid S3 HEAD requests when downloading job outputs#924
perf: avoid S3 HEAD requests when downloading job outputs#924crowecawcaw wants to merge 4 commits intoaws-deadline:mainlinefrom
Conversation
328e747 to
bc860f9
Compare
b99c6fb to
3ecdb28
Compare
Signed-off-by: Stephen Crowe <6042774+crowecawcaw@users.noreply.github.com>
3ecdb28 to
1d79be9
Compare
Signed-off-by: Stephen Crowe <6042774+crowecawcaw@users.noreply.github.com>
|
| # Provide a dummy etag to skip HEAD request if the method exists (added in s3transfer 0.6.0). | ||
| # For downloads from CAS, we don't need etag validation since files are content-addressed. | ||
| # Older s3transfer versions don't have this method, so we check before calling. | ||
| if hasattr(future.meta, "provide_object_etag"): |
There was a problem hiding this comment.
Is there a way to just check this once and cache it? (Or just do a version check of some sort) just to avoid looking it up for every on_queued where it shouldn't be changing?
There was a problem hiding this comment.
Each future is a distinct instance, though surely if one has this attribute, they all would. Do you think this check would be slow enough to warrant optimization though? There will be 1 future per file.
There was a problem hiding this comment.
Yeah I was wondering that as well. It might not be terribly noticeable tbh? It seems a bit inefficient when every future will be the same but maybe not enough to really care about a hash/lookup per file
| # For downloads from CAS, we don't need etag validation since files are content-addressed. | ||
| # Older s3transfer versions don't have this method, so we check before calling. | ||
| if hasattr(future.meta, "provide_object_etag"): | ||
| future.meta.provide_object_etag("dummy-etag") |
There was a problem hiding this comment.
Setting this value skips HeadObject requests? Can we link to docs for reference in the code comment above?
There was a problem hiding this comment.
The s3transfer library actually does not have documentation: https://github.com/boto/s3transfer/
This behavior was discovered in the Python version matrix testing. In lieu of documentation, this change has unit tests which verify the behavior we want (i.e. not calling HEAD).
There was a problem hiding this comment.
One concern from the team is this seems to be an internal implementation detail of S3 transfer. I appreciate the improvements, but also need to be careful end to end (JA, worker running jobs) to avoid another regression.
JA is a bit fragile as you have noticed so we want to check everything first.
There was a problem hiding this comment.
Can we do what was implemented for Incremental downloads here: https://github.com/aws-deadline/deadline-cloud/blob/mainline/src/deadline/job_attachments/_incremental_downloads/_manifest_s3_downloads.py#L533-L537
It will not depend on an internal API.
There was a problem hiding this comment.
Sure, that would work. It adds some complexity of having two download paths and two download managers (s3transfer and an internal thread pool), but it does avoid touching s3transfer details. If the JA team prefers that, I can close this PR.
| stubber = Stubber(s3_client) | ||
| stubber.add_client_error( | ||
| "head_object", | ||
| "get_object", |
There was a problem hiding this comment.
Nit - is this due to small size at like 1797?
There was a problem hiding this comment.
s3transfer calls HEAD to get the file size to determine whether it's small enough to get in a signle request or if it should use its multirequest approach. Since we provide the size, it never makes a HEAD request. The first request that fails when permissions are missing is the GET.
| download_logger = getLogger("deadline.job_attachments.download") | ||
|
|
||
|
|
||
| class _FileSizeSubscriber(_BaseSubscriber): |
There was a problem hiding this comment.
Can we please run the integration tests? This code path does change both deadline job download, incremental downloads.
I need to check how this impacts the worker agent too, if its in the call path we need to run the integration tests there.
There was a problem hiding this comment.
I ran the integration tests, they passed.
|
Closing this PR in favor of one on s3transfer itself: boto/s3transfer#363 |
Pull request was closed



What was the problem/requirement? (What/Why)
Job attachments uses the s3transfer library to download files from S3. When it estimates whether to use multipart downloads or a single download, it makes a HEAD request to get the file size. Job attachments already knows the file size though from the manifest. For small files, the HEAD requests take about half the total download time.
What was the solution? (How)
Provide the file size to s3transfer.
What is the impact of this change?
Roughly halves the transfer time for small files.
How was this change tested?
I created a job that generates 10k small files. It took 3m52s to download before this change and 2m04s after.
Was this change documented?
n/a
Does this PR introduce new dependencies?
No
Is this a breaking change?
No
Does this change impact security?
No
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.