Skip to content

Conversation

@fabioaraujopt
Copy link

@fabioaraujopt fabioaraujopt commented Feb 10, 2026

Problem statement

In asynchronous reconciliation flows, a terminal failure from the last async create/update operation can be unintentionally masked by a later successful observe pass.

In practice, this can make a managed resource look healthy (Synced=True / reconcile success) even though the most recent async operation actually failed and has not yet been resolved by a successful retry.

Failure pattern (generic)

A representative sequence:

  1. Controller starts async create/update.
  2. Provider/API returns a terminal error for that async operation (for example, transient 5xx, dependency not ready, or backend operation failure).
  3. Callback path records failure (LastAsyncOperation=False, reconcile error).
  4. Next reconcile runs Observe.
  5. Observe returns ResourceExists=true and ResourceUpToDate=true (resource object may exist and desired diff is empty from Terraform perspective).
  6. Previous code path unconditionally cleared async failure condition and set reconcile success.

Result: failure signal from step 3 could be overwritten before a genuinely successful apply cycle occurs.

Concrete generic example

  • Async create returns a terminal provider error (e.g. InternalServerError or equivalent).
  • Immediately after, Observe reports resource as existing and up-to-date.
  • Status transitions could show success while the prior async operation remained unresolved.

This is a status/condition precedence issue, not a provider-specific business logic issue.

Root cause

The async observe code paths were clearing failure-related conditions based only on observe success (err == nil && exists && upToDate) without verifying whether the operation tracker still contained an unresolved async error.

Additionally, in the framework recoverExternalName branch (failed create + partial state), observe returned ResourceUpToDate=true, which could also contribute to a false healthy signal during recovery.

What this PR changes

1) Guard condition clearing with operation error state

For both async clients:

  • pkg/controller/external_async_tfpluginsdk.go
  • pkg/controller/external_async_tfpluginfw.go

Condition-clearing logic now executes only when:

  • observe is successful and
  • LastOperation.Error() is nil

So unresolved async failures are preserved until a true successful cycle clears them.

2) Harden framework recovery branch

In:

  • pkg/controller/external_async_tfpluginfw.go

When recovering external name after a failed async create, observe now returns:

  • ResourceExists=true
  • ResourceLateInitialized=true
  • ResourceUpToDate=false <-- changed from true

This keeps reconcile state conservative while a failed async operation remains unresolved.

Behavior after fix

If an async create/update fails:

  • failure condition is retained,
  • observe-only success does not prematurely clear failure,
  • resource is not presented as fully healthy/synced until a successful apply/update resolves the error.

Regression coverage added

SDK async client

File:

  • pkg/controller/external_async_tfpluginsdk_test.go

New case validates:

  • prior async create failure + observe up-to-date
  • LastAsyncOperation remains failed
  • operation tracker error remains set

Framework async client

New file:

  • pkg/controller/external_async_tfpluginfw_test.go

Cases validate:

  1. prior async failure is not masked by observe up-to-date
  2. failed-create recoverExternalName path does not report ResourceUpToDate=true and preserves failure signaling

Why this is safe

  • No API surface changes.
  • No CRD/schema changes.
  • Changes are limited to async reconcile condition handling and test coverage.
  • Logic is stricter/conservative: success is only reported when there is no unresolved async error state.

Test plan

  • go test ./pkg/controller/...
  • Validate unresolved async error is preserved in SDK async observe flow
  • Validate unresolved async error is preserved in framework async observe flow
  • Validate framework failed-create recovery path stays not-up-to-date until resolution

Prevent async observe paths from clearing failure signals when the last async operation still has an unresolved error, and add regression tests for both SDK and framework clients to ensure failed async operations are not reported as healthy.
@coderabbitai
Copy link

coderabbitai bot commented Feb 10, 2026

📝 Walkthrough

Walkthrough

These changes fix async operation error handling in both Terraform plugin framework and SDK implementations. The observe logic now preserves unresolved async errors instead of clearing them, while adjusting the ResourceUpToDate flag during failure recovery. Comprehensive tests validate these behaviors.

Changes

Cohort / File(s) Summary
Async TFPlugin Framework
pkg/controller/external_async_tfpluginfw.go, pkg/controller/external_async_tfpluginfw_test.go
Implementation now conditionally clears LastOperation only when no cached error exists, preserving terminal async failures. ResourceUpToDate explicitly set to false during failure recovery. New test file validates observe logic including error propagation, condition updates, and external name restoration after failed creates.
Async TFPlugin SDK
pkg/controller/external_async_tfpluginsdk.go, pkg/controller/external_async_tfpluginsdk_test.go
Mirrors framework changes with conditional LastAsyncOperation clearing based on error state. Extended tests add new "UpToDateWithUnresolvedAsyncFailure" case with prepare hooks and condition assertions to validate error preservation behavior.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and specifically describes the main fix: preventing async observe from masking unresolved operation failures, which is the core change across all modified files.
Description check ✅ Passed The description is directly related to the changeset, clearly outlining the three main changes: preventing LastAsyncOperation clearing, ensuring ResourceUpToDate=false on failed creates, and adding regression tests.
Configuration Api Breaking Changes ✅ Passed The custom check applies to changes in pkg/config/** files. This PR modifies only pkg/controller/** files, leaving all Configuration API files untouched.
Generated Code Manual Edits ✅ Passed Modified files do not match zz_*.go pattern; all are manually maintained source files with proper headers.
Template Breaking Changes ✅ Passed Changes to core library files are not template files (*.tmpl) as specified in custom check instructions. Comprehensive regression tests validate the behavioral changes.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


No actionable comments were generated in the recent review. 🎉

🧹 Recent nitpick comments
pkg/controller/external_async_tfpluginsdk_test.go (1)

143-151: Consider moving wantCondition and wantLastOpErr into the want struct.

These fields represent expected outcomes of the Observe call, so they'd be more at home alongside obs and err in the want struct. This would keep the args/want boundary consistent with the table-driven test conventions used elsewhere in this file and in the framework test file.

Not blocking at all — the tests are correct as-is. Just a readability nit for consistency.

♻️ Suggested restructuring
 type args struct {
-	r             Resource
-	cfg           *config.Resource
-	obj           xpresource.Managed
-	prepare       func(*terraformPluginSDKAsyncExternal, xpresource.Managed)
-	wantCondition *xpv1.Condition
-	wantLastOpErr error
+	r       Resource
+	cfg     *config.Resource
+	obj     xpresource.Managed
+	prepare func(*terraformPluginSDKAsyncExternal, xpresource.Managed)
 }
 type want struct {
-	obs managed.ExternalObservation
-	err error
+	obs           managed.ExternalObservation
+	err           error
+	lastCondition *xpv1.Condition
+	lastOpErr     error
 }

Tip

We've launched Issue Planner and it is currently in beta. Please try it out and share your feedback on Discord!


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant