enforce cpu/memory ceiling limit for prow jobs #36121
Open
+27
−67
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Part of #34139
/hold for discussion
We don't use the same instance sizes between clouds. Right now, we use an 8core 64gb RAM instance on GCP (c4|c4a|c4d-highmem-8) and a 16core 128gb node on AWS(r5ad.4xlarge or r5ad.2xlarge).
The preferred node size we should adopt consistently between clouds is the latest 8 core 32GB instance that has local SSDs.
Our nodes are underutilised in memory, as you can see in Datadog.
While we determine how to inject small, medium, and large pod sizes via Kyverno (mutating webhooks), I'll enforce a new change that caps the pod size to 7 cores and 27 GB of RAM. The remaining half core and a 1G of RAM goes to any agents we run on the cluster.
c4(30GB) -> c4d(31GB) -> c4a(32GB)

Less than 1% of prow jobs(54 out of 5874) request more than 28GB of RAM, and most of them don't need it.
Examples of misconfigured sizings: