Skip to content

perf: avoid S3 HEAD requests when downloading job outputs#924

Closed
crowecawcaw wants to merge 4 commits intoaws-deadline:mainlinefrom
crowecawcaw:smallfiles
Closed

perf: avoid S3 HEAD requests when downloading job outputs#924
crowecawcaw wants to merge 4 commits intoaws-deadline:mainlinefrom
crowecawcaw:smallfiles

Conversation

@crowecawcaw
Copy link
Contributor

What was the problem/requirement? (What/Why)

Job attachments uses the s3transfer library to download files from S3. When it estimates whether to use multipart downloads or a single download, it makes a HEAD request to get the file size. Job attachments already knows the file size though from the manifest. For small files, the HEAD requests take about half the total download time.

What was the solution? (How)

Provide the file size to s3transfer.

What is the impact of this change?

Roughly halves the transfer time for small files.

How was this change tested?

I created a job that generates 10k small files. It took 3m52s to download before this change and 2m04s after.

Was this change documented?

n/a

Does this PR introduce new dependencies?

No

Is this a breaking change?

No

Does this change impact security?

No

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@crowecawcaw crowecawcaw requested a review from a team as a code owner November 19, 2025 16:44
@crowecawcaw crowecawcaw requested a review from mwiebe November 19, 2025 16:44
@github-actions github-actions bot added the waiting-on-maintainers Waiting on the maintainers to review. label Nov 19, 2025
@epmog epmog added the job attachments For an issue with job attachments label Nov 19, 2025
@crowecawcaw crowecawcaw force-pushed the smallfiles branch 2 times, most recently from 328e747 to bc860f9 Compare November 19, 2025 17:41
@crowecawcaw crowecawcaw requested a review from a team as a code owner November 19, 2025 17:41
@crowecawcaw crowecawcaw force-pushed the smallfiles branch 2 times, most recently from b99c6fb to 3ecdb28 Compare November 19, 2025 18:50
Signed-off-by: Stephen Crowe <6042774+crowecawcaw@users.noreply.github.com>
@sonarqubecloud
Copy link

# Provide a dummy etag to skip HEAD request if the method exists (added in s3transfer 0.6.0).
# For downloads from CAS, we don't need etag validation since files are content-addressed.
# Older s3transfer versions don't have this method, so we check before calling.
if hasattr(future.meta, "provide_object_etag"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to just check this once and cache it? (Or just do a version check of some sort) just to avoid looking it up for every on_queued where it shouldn't be changing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each future is a distinct instance, though surely if one has this attribute, they all would. Do you think this check would be slow enough to warrant optimization though? There will be 1 future per file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I was wondering that as well. It might not be terribly noticeable tbh? It seems a bit inefficient when every future will be the same but maybe not enough to really care about a hash/lookup per file

# For downloads from CAS, we don't need etag validation since files are content-addressed.
# Older s3transfer versions don't have this method, so we check before calling.
if hasattr(future.meta, "provide_object_etag"):
future.meta.provide_object_etag("dummy-etag")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting this value skips HeadObject requests? Can we link to docs for reference in the code comment above?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The s3transfer library actually does not have documentation: https://github.com/boto/s3transfer/

This behavior was discovered in the Python version matrix testing. In lieu of documentation, this change has unit tests which verify the behavior we want (i.e. not calling HEAD).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One concern from the team is this seems to be an internal implementation detail of S3 transfer. I appreciate the improvements, but also need to be careful end to end (JA, worker running jobs) to avoid another regression.

JA is a bit fragile as you have noticed so we want to check everything first.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do what was implemented for Incremental downloads here: https://github.com/aws-deadline/deadline-cloud/blob/mainline/src/deadline/job_attachments/_incremental_downloads/_manifest_s3_downloads.py#L533-L537

It will not depend on an internal API.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, that would work. It adds some complexity of having two download paths and two download managers (s3transfer and an internal thread pool), but it does avoid touching s3transfer details. If the JA team prefers that, I can close this PR.

stubber = Stubber(s3_client)
stubber.add_client_error(
"head_object",
"get_object",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit - is this due to small size at like 1797?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s3transfer calls HEAD to get the file size to determine whether it's small enough to get in a signle request or if it should use its multirequest approach. Since we provide the size, it never makes a HEAD request. The first request that fails when permissions are missing is the GET.

download_logger = getLogger("deadline.job_attachments.download")


class _FileSizeSubscriber(_BaseSubscriber):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please run the integration tests? This code path does change both deadline job download, incremental downloads.

I need to check how this impacts the worker agent too, if its in the call path we need to run the integration tests there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran the integration tests, they passed.

@sakshie95 sakshie95 removed the waiting-on-maintainers Waiting on the maintainers to review. label Nov 24, 2025
@crowecawcaw
Copy link
Contributor Author

Closing this PR in favor of one on s3transfer itself: boto/s3transfer#363

auto-merge was automatically disabled December 10, 2025 17:05

Pull request was closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

job attachments For an issue with job attachments

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants