Skip to content

Increase service keepAliveTimeout from 72s to 905s#7677

Open
jdolle wants to merge 8 commits intomainfrom
usage-ingestor-keepalive
Open

Increase service keepAliveTimeout from 72s to 905s#7677
jdolle wants to merge 8 commits intomainfrom
usage-ingestor-keepalive

Conversation

@jdolle
Copy link
Collaborator

@jdolle jdolle commented Feb 11, 2026

Background

One of our customers was encountering upstream disconnects. This should help us reduce the likelihood that these 503s happen. Note that these errors do get retried by the hive-client, but it's much more efficient to maintain connections slightly longer. This also matches the idle timeout for a number of AWS services such as NAT Gateway.

Description

Increase keepAliveTimeout for services and contour (envoy) proxy. And I added an inline comment because this configuration is often misunderstood or overlooked.

@jdolle jdolle self-assigned this Feb 11, 2026
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @jdolle, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses customer-reported upstream disconnects by extending the keepAliveTimeout for the usage-ingestor service. The change aims to reduce 503 errors by allowing connections to persist longer, which is more efficient than relying solely on client-side retries. An explanatory comment has also been added to the configuration for clarity.

Highlights

  • Increased keepAliveTimeout: The keepAliveTimeout for the usage-ingestor service has been increased from 72 seconds (72_000 ms) to 350 seconds (350_000 ms) to enhance connection stability and reduce upstream disconnects.
  • Configuration Clarity: An inline comment was added to the keepAliveTimeout configuration in fastify.ts to provide context and explain the rationale behind the new value, preventing potential misunderstandings.
Changelog
  • .changeset/lovely-points-beg.md
    • Added a new changeset entry for the hive package.
  • packages/services/service-common/src/fastify.ts
    • Increased the keepAliveTimeout in the createServer function to 350_000 milliseconds.
    • Added a detailed inline comment explaining the rationale behind the keepAliveTimeout value.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@jdolle jdolle requested a review from n1ru4l February 11, 2026 02:46
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request increases the keepAliveTimeout to 350 seconds to mitigate upstream disconnects. While the intention is good, the current implementation applies this change globally to all services using the common createServer function, whereas the justification is specific to usage-ingestor. I've left a comment suggesting how to scope this change correctly to avoid potential side-effects on other services.

@github-actions
Copy link
Contributor

github-actions bot commented Feb 11, 2026

🚀 Snapshot Release (alpha)

The latest changes of this PR are available as alpha on npm (based on the declared changesets):

Package Version Info
hive 9.5.0-alpha-20260225224145-7a3aec61aaaaafa89989af1ed4ff9d755f36f575 npm ↗︎ unpkg ↗︎

@github-actions
Copy link
Contributor

github-actions bot commented Feb 11, 2026

💻 Website Preview

The latest changes are available as preview in: https://pr-7677.hive-landing-page.pages.dev

@github-actions
Copy link
Contributor

github-actions bot commented Feb 11, 2026

🐋 This PR was built and pushed to the following Docker images:

Targets: build

Platforms: linux/amd64

Image Tag: 7a3aec61aaaaafa89989af1ed4ff9d755f36f575

@n1ru4l
Copy link
Contributor

n1ru4l commented Feb 11, 2026

We are using Cloudflare, which recommends a keep alive of 900 seconds. Is it sufficient to only configure this on the Node.js services - or does this also require adjusting the envoy proxy configuration (did have a look before coffee)?

Should we increase the keep alive of envoy for all services behind cloudflare?

@jdolle
Copy link
Collaborator Author

jdolle commented Feb 11, 2026

Great find. I didn't realize all of our requests were going through Cloudflare.

Knowing that now, I believe we should increase the keep alive of all our services to 905_000 (905 seconds).

@jdolle jdolle changed the title Increase usage-ingestor keepAliveTimeout from 72s to 350s Increase service keepAliveTimeout from 72s to 905s Feb 11, 2026
trustProxy: true,
// If a connection is idle for 905 seconds or more, the connection times out.
// The default for fastify is 72_000, but this is meant for more dynamic clients.
// Requests to Hive's services are proxied through Cloudflare, which has a 900s
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While CloudFlare has 900s configured, it's still only 1 part in the chain: cf -> cluster (envoy) -> usage-service.

Envoy has a limit of 1h by default, so adjusting this to 905s, should be ok, i think? (900s -> 1h -> 905s)

Also, this change assumes all users go through CF, which might not be correct for self-hosters so maybe it should be adjustable?

Copy link
Collaborator Author

@jdolle jdolle Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1hr seems incredibly long.
In reviewing envoy's docs, it looks like we can adjust this but they combined the idle and keepalive timeouts into a single idle_timeout option.

How do we feel about lowering the envoy idle_timeout to 905s and setting our services to 910s? Also I will make this adjustable.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still uncovering some details. It looks like we actually use Contour for our proxy and it has a 60s idle timeout
https://projectcontour.io/docs/main/configuration/#timeout-configuration

Copy link
Member

@dotansimha dotansimha Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Contour is just the K8s integration layer for Envoy, but it's good to know that they configure some things differently than enovy's defaults.

Copy link
Member

@dotansimha dotansimha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, this should align the configs across the pipeline.
Can we please test it on dev / staging cluster before merging? :)

@theguild-bot theguild-bot deployed to development February 25, 2026 22:54 Active
@theguild-bot theguild-bot temporarily deployed to staging February 25, 2026 22:54 Inactive
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

4 participants