Increase service keepAliveTimeout from 72s to 905s#7677
Conversation
Summary of ChangesHello @jdolle, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses customer-reported upstream disconnects by extending the Highlights
Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request increases the keepAliveTimeout to 350 seconds to mitigate upstream disconnects. While the intention is good, the current implementation applies this change globally to all services using the common createServer function, whereas the justification is specific to usage-ingestor. I've left a comment suggesting how to scope this change correctly to avoid potential side-effects on other services.
💻 Website PreviewThe latest changes are available as preview in: https://pr-7677.hive-landing-page.pages.dev |
|
🐋 This PR was built and pushed to the following Docker images: Targets: Platforms: Image Tag: |
|
We are using Cloudflare, which recommends a keep alive of 900 seconds. Is it sufficient to only configure this on the Node.js services - or Should we increase the keep alive of envoy for all services behind cloudflare? |
|
Great find. I didn't realize all of our requests were going through Cloudflare. Knowing that now, I believe we should increase the keep alive of all our services to 905_000 (905 seconds). |
| trustProxy: true, | ||
| // If a connection is idle for 905 seconds or more, the connection times out. | ||
| // The default for fastify is 72_000, but this is meant for more dynamic clients. | ||
| // Requests to Hive's services are proxied through Cloudflare, which has a 900s |
There was a problem hiding this comment.
While CloudFlare has 900s configured, it's still only 1 part in the chain: cf -> cluster (envoy) -> usage-service.
Envoy has a limit of 1h by default, so adjusting this to 905s, should be ok, i think? (900s -> 1h -> 905s)
Also, this change assumes all users go through CF, which might not be correct for self-hosters so maybe it should be adjustable?
There was a problem hiding this comment.
1hr seems incredibly long.
In reviewing envoy's docs, it looks like we can adjust this but they combined the idle and keepalive timeouts into a single idle_timeout option.
How do we feel about lowering the envoy idle_timeout to 905s and setting our services to 910s? Also I will make this adjustable.
There was a problem hiding this comment.
I'm still uncovering some details. It looks like we actually use Contour for our proxy and it has a 60s idle timeout
https://projectcontour.io/docs/main/configuration/#timeout-configuration
There was a problem hiding this comment.
Contour is just the K8s integration layer for Envoy, but it's good to know that they configure some things differently than enovy's defaults.
Background
One of our customers was encountering upstream disconnects. This should help us reduce the likelihood that these 503s happen. Note that these errors do get retried by the hive-client, but it's much more efficient to maintain connections slightly longer. This also matches the idle timeout for a number of AWS services such as NAT Gateway.
Description
Increase keepAliveTimeout for services and contour (envoy) proxy. And I added an inline comment because this configuration is often misunderstood or overlooked.