Fix async observe masking unresolved operation failures #594
+255
−16
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Problem statement
In asynchronous reconciliation flows, a terminal failure from the last async create/update operation can be unintentionally masked by a later successful observe pass.
In practice, this can make a managed resource look healthy (
Synced=True/ reconcile success) even though the most recent async operation actually failed and has not yet been resolved by a successful retry.Failure pattern (generic)
A representative sequence:
5xx, dependency not ready, or backend operation failure).LastAsyncOperation=False, reconcile error).Observe.ObservereturnsResourceExists=trueandResourceUpToDate=true(resource object may exist and desired diff is empty from Terraform perspective).Result: failure signal from step 3 could be overwritten before a genuinely successful apply cycle occurs.
Concrete generic example
InternalServerErroror equivalent).Observereports resource as existing and up-to-date.This is a status/condition precedence issue, not a provider-specific business logic issue.
Root cause
The async observe code paths were clearing failure-related conditions based only on observe success (
err == nil && exists && upToDate) without verifying whether the operation tracker still contained an unresolved async error.Additionally, in the framework
recoverExternalNamebranch (failed create + partial state), observe returnedResourceUpToDate=true, which could also contribute to a false healthy signal during recovery.What this PR changes
1) Guard condition clearing with operation error state
For both async clients:
pkg/controller/external_async_tfpluginsdk.gopkg/controller/external_async_tfpluginfw.goCondition-clearing logic now executes only when:
LastOperation.Error()isnilSo unresolved async failures are preserved until a true successful cycle clears them.
2) Harden framework recovery branch
In:
pkg/controller/external_async_tfpluginfw.goWhen recovering external name after a failed async create, observe now returns:
ResourceExists=trueResourceLateInitialized=trueResourceUpToDate=false<-- changed from trueThis keeps reconcile state conservative while a failed async operation remains unresolved.
Behavior after fix
If an async create/update fails:
Regression coverage added
SDK async client
File:
pkg/controller/external_async_tfpluginsdk_test.goNew case validates:
LastAsyncOperationremains failedFramework async client
New file:
pkg/controller/external_async_tfpluginfw_test.goCases validate:
recoverExternalNamepath does not reportResourceUpToDate=trueand preserves failure signalingWhy this is safe
Test plan
go test ./pkg/controller/...