Skip to content

Conversation

@shainaraskas
Copy link
Contributor

We want to recommend a higher vm.max_map_count for Elasticsearch environments - additional context in issue

What issues does this PR fix?

Part of https://github.com/elastic/docs-content-internal/issues/592

@prodsecmachine
Copy link
Collaborator

prodsecmachine commented Dec 23, 2025

Snyk checks have passed. No issues have been found so far.

Status Scanner Critical High Medium Low Total (0)
Open Source Security 0 0 0 0 0 issues
Licenses 0 0 0 0 0 issues

💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse.

@botelastic botelastic bot added the triage label Dec 23, 2025
@barkbay barkbay added the >docs Documentation label Dec 24, 2025
@botelastic botelastic bot removed the triage label Dec 24, 2025
@pebrc
Copy link
Collaborator

pebrc commented Dec 30, 2025

I see another ~30 references to the old value in the examples in the code base that we should probably address https://github.com/search?q=repo%3Aelastic%2Fcloud-on-k8s%20262144&type=code

@shainaraskas
Copy link
Contributor Author

@pebrc let me know if editing those recipes is overkill. I'll have to recreate the changes in main and backport them to the 3.x branches as well

Copy link
Collaborator

@pebrc pebrc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some testing on GKE Autopilot with the new values recommended in this PR. We have an issue here in that only the previously approved DaemonSet with the old value will be accepted. I think for this PR it means that we have to keep all the autopilot examples on the old value for the time being and make some adjustments to the virtual memory page to call out the special case.

I am also wondering why we are backporting this change to the 2.16 branch instead of just updating the main branch/docs-content with the new values. BTW for the current documentation in docs-content we have the same issue with GKE Autopilot.

privileged: true
runAsUser: 0
command: ['/usr/local/bin/bash', '-e', '-c', 'echo 262144 > /proc/sys/vm/max_map_count']
command: ['/usr/local/bin/bash', '-e', '-c', 'echo 1048576 > /proc/sys/vm/max_map_count']
Copy link
Collaborator

@pebrc pebrc Dec 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Image

I think one issue we have here is that the GKE Autopilot docs point to this guide. After we merge this PR the virtual memory docs will recommend a DaemonSet that is no longer working on GKE Autopilot due to the strict whitelisting GKE does, where any change compared to the approved DaemonSet will lead to rejection (see my earlier comment).

I am not sure how to best address this. One idea is that we could duplicate the virtual memory DaemonSet example in the Autopilot docs page with the old value. We should also add an explanatory comment, saying that on GKE Autopilot only the DaemonSet spec previously approved by Google is accepted and therefore a lower value for vm.max_map.count has to be used.

We can also open a follow-up issue to engage with Google to get a newer version of the DaemonSet approved.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pebrc thanks for spotting this.

I've made the following changes to this PR:

  • in the virtual memory docs, I've called out that GKE autopilot requires a different value. we can go further here and give an error message in the GKE autopilot doc if you have one.
  • reverted the autopilot recipe changes
  • reverted the max map count change in the code snippet on the gke autopilot page

we do have three more examples that refer to GKE - not sure if these should also be reverted or have a comment added:
https://github.com/elastic/cloud-on-k8s/pull/8971/files#diff-77f1bb9410a4ea3c6099f4fbd5be3691caac04e5679bddf1f34a1b1b99842a13
https://github.com/elastic/cloud-on-k8s/pull/8971/files#diff-102d7a34c4c9589e03cb65318e05792acdf4f516677aadacee82093de7d4ea35
https://github.com/elastic/cloud-on-k8s/pull/8971/files#diff-992bd46742b2dfffd13f95ce73af64b9cdd867e18b9d7ae88a762394d8538195

do you think using the updated value for the other examples is safe? I think if we expand on the autopilot doc with the error message, it might be ok, but since this is a nice-to-have thing, we could revert the examples and just keep the recommended value in the guidance. let me know what you think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this setting can now be configured using a GKE ComputeClass, let's maybe remove the DaemonSet from our samples and update the documentation accordingly? #8982

Copy link
Collaborator

@pebrc pebrc Jan 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great finding @barkbay I did not realise this was possible. It seems that custom compute classes are a relatively new addition (August 2024) and they are only supported for GKE 1.30+.

This PR targets ECK 2.16 where we still need to support Kubernetes 1.27 where this feature is not available yet.

Here is what I think we should do:

  • for this PR (2.16) if we really need to back port this vm.max_map_count change exclude the Autopilot examples.
  • for the current documentation 3.2. and newer let's switch to compute classes as illustrated by @barkbay 's PR

Copy link
Contributor Author

@shainaraskas shainaraskas Jan 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that @barkbay's PR still has vm.max_map_count set at 262144. Given the context of this PR, is that something that can be bumped to 1048576 now that kind is ComputeClass?

regardless, will look into editing the latest docs to reflect using a ComputeClass for autopilot.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you're using a ComputeClass rather than a DaemonSet to set the value, can you still use the initContainer described in this section to check the value?

https://www.elastic.co/docs/deploy-manage/deploy/cloud-on-k8s/deploy-eck-on-gke-autopilot#k8s-autopilot-deploy-elasticsearch

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that @barkbay's PR still has vm.max_map_count set at 262144. Given the context of this PR, is that something that can be bumped to 1048576 now that kind is ComputeClass?

🤦 , good catch, sorry, I was more focused on the "how" than on the "what". Here is the fix: #8986

if you're using a ComputeClass rather than a DaemonSet to set the value, can you still use the initContainer described in this section to check the value?

During my experiments this was set before Pods were scheduled, so the init container is not needed.

privileged: true
runAsUser: 0
command: ['/usr/local/bin/bash', '-e', '-c', 'echo 262144 > /proc/sys/vm/max_map_count']
command: ['/usr/local/bin/bash', '-e', '-c', 'echo 1048576 > /proc/sys/vm/max_map_count']
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pebrc thanks for spotting this.

I've made the following changes to this PR:

  • in the virtual memory docs, I've called out that GKE autopilot requires a different value. we can go further here and give an error message in the GKE autopilot doc if you have one.
  • reverted the autopilot recipe changes
  • reverted the max map count change in the code snippet on the gke autopilot page

we do have three more examples that refer to GKE - not sure if these should also be reverted or have a comment added:
https://github.com/elastic/cloud-on-k8s/pull/8971/files#diff-77f1bb9410a4ea3c6099f4fbd5be3691caac04e5679bddf1f34a1b1b99842a13
https://github.com/elastic/cloud-on-k8s/pull/8971/files#diff-102d7a34c4c9589e03cb65318e05792acdf4f516677aadacee82093de7d4ea35
https://github.com/elastic/cloud-on-k8s/pull/8971/files#diff-992bd46742b2dfffd13f95ce73af64b9cdd867e18b9d7ae88a762394d8538195

do you think using the updated value for the other examples is safe? I think if we expand on the autopilot doc with the error message, it might be ok, but since this is a nice-to-have thing, we could revert the examples and just keep the recommended value in the guidance. let me know what you think.

@shainaraskas
Copy link
Contributor Author

shainaraskas commented Dec 31, 2025

I am also wondering why we are backporting this change to the 2.16 branch instead of just updating the main branch/docs-content with the new values. BTW for the current documentation in docs-content we have the same issue with GKE Autopilot.

Fixing the issue in docs-content now (draft while we agree on language here: elastic/docs-content#4490).

Backporting was requested here, because ECK 2.16 supports the impacted versions of ES, 2.16 is still in support, and this setting is controlled at the ECK level. Not exactly sure why 2.16 was the only version targeted - @kunisen, can you provide some insight on that?

@kunisen
Copy link
Contributor

kunisen commented Jan 1, 2026

Thanks @shainaraskas and @pebrc !

Backporting was requested https://github.com/elastic/docs-content-internal/issues/592, because ECK 2.16 supports the impacted versions of ES, 2.16 is still in support, and this setting is controlled at the ECK level. Not exactly sure why 2.16 was the only version targeted - @kunisen, can you provide some insight on that?

I don't have strong opinion here and this is my thought:

  • ECK 2.16 and 3.x are the two versions we are actively maintaining, means for both code and docs. See EOL website page here.
  • Other versions, although it's great to change the value, it might be hard for us to do all of them. (It happens to ECK that being able to run all 8.16+ versions, and that could also mean ECK 1.x version right? IIRC, there's no such ECK version and stack version compatibility limitation like we have in ECE)

Based on the above thoughts, the most cost-effective and also following EOL policy way is to do 2.16 and 3.x, IMHO.


FWIW, why 2.16 is the only maintained version? is because we only maintain that minor. Per EOL policy,

  • For ECK 2.x, we maintain 2.16 (latest minor)
  • For ECK 3.x, we maintain 3.2 and 3.1 (latest 2 minors)

For the version that is under support but not under maintenance per our EOL policy, such as ECK 1.9, I didn't include it as target.


@shainaraskas @pebrc if doc team or ECK team think we should expand it to more targets, I think the support team is more than happy to have that. But that might be an overkill as you said.

I will leave the decision to you since you are more authoritative teams than us :) thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>docs Documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants