Skip to content

Conversation

@nate-kean
Copy link
Contributor

@nate-kean nate-kean commented Aug 14, 2025

Description

Fixes two problems that were preventing the CircleCI runners from passing/were causing them to time out:

  1. Problem: A mysterious configuration issue began occurring while trying to run mamba commands. This happened from no change on our end, and appears to be caused by an update to MiniForge mamba.
    Solution: Switched to micromamba. This turns out to be the preferred conda/mamba flavor for automated runners because it is extremely simple and downloads and installs very fast. It also does not suffer from the problem that we were experiencing with MiniForge mamba. We continue to use only the conda-forge channel, as micromamba comes preconfigured with just it.
  2. Problem: The "Create RAiDER environment" step was attempting to activate the RAiDER environment before it had been created, through the profile.d script.
    Solution: Delayed adding the line to activate the RAiDER environment to after it is created, so the runner only starts to attempt to activate it in the steps after when it has been created.

Motivation and Context

Fixes the problem that has been occurring in our CI builds lately that has caused them all to fail or time out. After this is merged, we will want to re-run the CI tests for recent pull requests that were failing.

How Has This Been Tested?

The ci/circleci checks in this PR are now passing.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

Checklist:

  • I have added an explanation of what your changes do and why you'd like us to include them.
  • I have written new tests for your core changes, as applicable.
  • I have successfully ran tests with your changes locally.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.

nate-kean added a commit to nate-kean/RAiDER that referenced this pull request Aug 14, 2025
nate-kean added a commit to nate-kean/RAiDER that referenced this pull request Aug 14, 2025
@nate-kean nate-kean force-pushed the fix-circle-ci-timeout branch from d35a341 to 3806194 Compare August 14, 2025 20:42
@nate-kean nate-kean marked this pull request as ready for review August 14, 2025 21:24
@jlmaurer
Copy link
Collaborator

@nate-kean thanks for this!
@jhkennedy @dbekaert do you see any issues with pinning major versions? The idea is to speed up package resolving and prevent breaking changes from coming into RAiDER without being able to address them first.

@jhkennedy
Copy link
Collaborator

jhkennedy commented Aug 19, 2025

do you see any issues with pinning major versions?

@jlmaurer yes, pinning major version guarantees RAiDER will break in the future, and is generally considered a bad practice, e.g., https://iscinumpy.dev/post/bound-version-constraints

And pinning the development environment here, but not the package dependencies, will mean you won't follow the typical user install path and miss dependency issues that users will hit.

Generally, if you want a known working environment, tools like conda-lock and pixi are designed to give you a locked environment you can use for that. You should still have an unlocked testing workflow so you hit user issues in CI though.

Unfotunately, there isn't a good way to automatically open PRs to update dependencies in conda/pixi lock files like dependabot does, which would be the right way to provide a locked environment and ensure you're seeing dependency issues as they appear, but there is a workaround for pixi:
https://pixi.sh/latest/integration/ci/updates_github_actions/

It's fine short term to add major upper-bounds to the dev environment, but you it should be a very short term. Ideally, in order of importance:

  • pyproject.toml and, if provided, environment.yml would list dependencies and only exclude known incompatibilities. With core conda dependencies, pixi is the best way to provide this
  • a lockfile that provides a known working environment
  • a CI tool that opens PRs to test new dependency and upadate the lock file
  • CI that regularly tests the minimum bounds of dependencies to make sure check the lower bounds and find new incompatibilities

@nate-kean
Copy link
Contributor Author

Wow, you bring up a lot of good points I didn't consider. You have convinced me, I shouldn't have added these upper bounds where we don't need them. Thanks for your input, @jhkennedy, I will remove the upper bounds I added.

nate-kean and others added 14 commits August 21, 2025 22:03
micromamba is optimized for simplicity, and evidently, the configuration error that was causing our builds to fail with mamba does not occur with micromamba.
As an added bonus, micromamba downloads and installs about six seconds faster.
Slightly improves stability by not relying on the presence of the environment.yml file or for it to be in a specific format. conda has this functionality built in as part of the `env create` command.
This puts a lower and upper bound on all dependencies that follow the Semver versioning convention. All packages are now bound to their current major version. Exceptions include numpy where we support both v1 and v2, and any other package that already had a specific minor/patch version lower bound.

Benefits:
- Faster "Create RAiDER environment" step because the number of possibilities the conda/mamba solver has to consider is reduced
- Lower RAM use for the same reason. This allows CircleCI to use the Docker/Medium resource class instead of Docker/Large more often.
- Improved long-term stability for RAiDER: capping the major version means RAiDER won't upgrade to an untested dependency version that it turns out to be incompatible with. Capping the minor version keeps RAiDER from ever installing a package version so old that it is incompatible.

Semver definition: https://semver.org/
This reverts commit 7435ff5.

Reapply "Optimization and stability: Add version bounds"

This reverts commit 1b8fa95.

Remove new upper bounds, keep new lower bounds
@nate-kean nate-kean force-pushed the fix-circle-ci-timeout branch from 1665a91 to 2a30200 Compare August 22, 2025 03:05
@jlmaurer
Copy link
Collaborator

@nate-kean I think you can rebase on dev and hopefully that last test will pass. I think #759 and/or #760 updated the test_GUNW test.

@nate-kean
Copy link
Contributor Author

Yes, I have done that. But now the runners are taking over an hour again. I'm starting to doubt whether my changes made any difference...

@nate-kean
Copy link
Contributor Author

nate-kean commented Aug 22, 2025

New upper bounds removed. I also had to make a couple fixes to tests to make all the tests pass. The tests in question even failed for me on dev, so I suspect it may have been upstream changes in dependencies which only started applying to me after I had reinstalled/updated my environment for this PR.

But now the CircleCI runners are taking over an hour to solve their environment again sometimes. It's making me wonder if these bouts of slowdowns are even being caused by us, especially since they have historically started after changes that shouldn't affect environment creation at all, and "solved" themselves after just trying again in a week. I might split these test fixes into their own pull request and close this one, because it doesn't appear to have really fixed the issue.

I am considering a new approach now: add a lock file (conda/conda-lock) to the project, and the dependency solutions for each Python version we test will be computed ahead of time and no longer have to be done by the CircleCI runners.

This was referenced Aug 22, 2025
@nate-kean
Copy link
Contributor Author

Closing; moved progress to #762

@nate-kean nate-kean closed this Aug 23, 2025
@nate-kean nate-kean deleted the fix-circle-ci-timeout branch August 24, 2025 02:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants