-
Notifications
You must be signed in to change notification settings - Fork 56
Move tasks to rate-limited queue near rate-limit threshold #2941
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a mechanism to handle API rate limiting by moving tasks to a rate-limited queue when the remaining API calls are below a certain threshold. This is achieved by adding a check_rate_limit_remaining method to the JobHandler and calling it in various handlers.
My review identifies a critical issue in the implementation where apply_async is used to re-queue tasks. This would lead to task duplication instead of a retry. I've suggested using celery_task.retry() which is the correct approach to stop the current task and re-queue it for later execution.
|
Build succeeded. ✔️ pre-commit SUCCESS in 1m 42s |
lbarcziova
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have you checked if Celery doesn't provide something more native for this kind of use case, of scheduling a task for later, e.g. could this be used, instead of a separate queue? Or could you explain what would be benefit of the separate queue?
The idea behind this code is not to delay the execution of a task. I just want to be sure that if any of these tasks are going to block they will not stop the processing of our short-running tasks. I rely on the retry-after handling in ogr to do the waiting (if needed), in this way the tasks will wait for exactly the amount of time they are supposed to, based on the feedback from the service. They will wait in a queue where they can rest for up to an hour (much longer than in our other queues). And they will not stop other tasks from running. |
Thinking about it now, wouldn't this mess up the ordering? Let's say there is a task that would hit rate limits so it's placed in the queue, and in the meantime there is another task that starts when rate limits are no longer in place - it will be executed sooner than the first task. And this can add up, leaving "unlucky" tasks without being processed for a long time. That could be very confusing for users (but perhaps they are equally confused already by the current state 😅). But without any prioritization there could be a task that would be unlucky enough to be put back in the queue every time, or multiple times in a row, and users may be tempted to try and retrigger it, so we should account for that as well. |
I'm not sure. I expect that once a project is no longer rate-limited, tasks from the rate-limited queue will be executed as soon as possible since they'll be among the oldest tasks waiting. But absolutely the order can somehow change. However I think it is better reshuffle the order than to let tasks die. |
You are probably right, let's see how it behaves in practice.
👍 |
|
@majamassarini ok I see now, thanks for the explanation! Let's see how it performs like this and adjust accordingly. |
00456df to
4f11e36
Compare
|
Build succeeded. ✔️ pre-commit SUCCESS in 1m 44s |
4f11e36 to
e861a79
Compare
|
✔️ pre-commit SUCCESS in 1m 48s |
82da413 to
f86c275
Compare
|
✔️ pre-commit SUCCESS in 1m 43s |
f86c275 to
a84ea00
Compare
|
Build succeeded. ✔️ pre-commit SUCCESS in 1m 50s |
a84ea00 to
7522866
Compare
|
Build succeeded. ✔️ pre-commit SUCCESS in 1m 52s |
7522866 to
95fd5bd
Compare
|
/gemini review |
|
Build succeeded. ✔️ pre-commit SUCCESS in 1m 50s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
The pull request introduces a rate-limiting mechanism for tasks by refactoring the run method in JobHandler to _run and adding a check_rate_limit_remaining method. This change correctly moves the core logic to _run and allows for pre-execution checks in run. The new constants for rate-limiting are well-defined. Additionally, several handlers have been updated to reflect this run to _run change. Minor but important type-checking improvements were also made in packit_service/worker/handlers/koji.py for robustness. The test case in tests/unit/test_distgit.py was appropriately updated.
95fd5bd to
0d647f0
Compare
|
✔️ pre-commit SUCCESS in 1m 46s |
0d647f0 to
1311bd3
Compare
|
Build succeeded. ✔️ pre-commit SUCCESS in 1m 45s |
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a robust rate-limiting mechanism by moving tasks to a dedicated queue when API call limits are low, which is a great step towards improving service reliability. The implementation is solid, featuring a new rate-limited Celery queue and clear logic for re-queuing tasks. I particularly appreciate the significant improvement to the Celery auto-retry mechanism, which now only retries on transient errors, making the worker more resilient. The refactoring of run() to _run() across handlers is a clean way to integrate the new rate-limiting check, and the addition of unit tests for this new logic is commendable. I have a couple of minor suggestions to handle edge cases with timestamps.
8677cba to
5797d70
Compare
|
Build succeeded. ✔️ pre-commit SUCCESS in 1m 45s |
| # Only retry on specific exceptions that are likely to be transient: | ||
| # - PackitException: Packit-specific errors that might be retryable | ||
| # - OgrException: OGR library errors (API, network, authentication issues) | ||
| # - ConnectionError: Network connection problems | ||
| # - TimeoutError: Timeout issues | ||
| # - OSError: File system/OS errors that might be transient | ||
| # Note: RateLimitRequeueException is NOT in this list, so it won't trigger autoretry | ||
| autoretry_for = ( | ||
| PackitException, | ||
| OgrException, | ||
| ConnectionError, | ||
| TimeoutError, | ||
| OSError, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not base RateLimitRequeueException on BaseException and keep this as it was? Or are we certain that there can be no other exception that those listed here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried with BaseException but it was making the worker exit. Also I don't think it makes sense to retry for every exception. I hope I didn't miss anything important but we can always add it later.
|
Build succeeded. ✔️ pre-commit SUCCESS in 1m 48s |
|
Build succeeded. ✔️ pre-commit SUCCESS in 1m 43s |
328409a to
9779aff
Compare
|
Build succeeded. ✔️ pre-commit SUCCESS in 1m 50s |
Also fix mypy errors in koji.py like the following: Argument 1 to "utcfromtimestamp" of "datetime" has incompatible type "Union[int, float, str]"; expected "float" Co-authored-by: Nikola Forró <nforro@redhat.com>
Flower has an endpoint for /api/queues/length but it isn't exsposed as a metric Co-authored-by: Nikola Forró <nforro@redhat.com>
69fdfaf to
428298f
Compare
|
Build succeeded. ✔️ pre-commit SUCCESS in 1m 46s |
nforro
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. To nitpick, the wording around rate limits seems a bit weird to me. For example, I would rather see something like N requests remaining until rate limit, which is below/above the threshold of X instead of Rate limit remaining is low/high.
Related to packit/deployment#681