-
Notifications
You must be signed in to change notification settings - Fork 212
Description
We've noticed an odd failure mode with our deployment of Flux 2.4.0, which includes Kustomize controller 1.4.0. There's a chance that upon applying a change to a Kustomization object that the controller will simply stop reconciling the resource.
Observations from when it gets into this state:
.spec.interval
is completely ignored - the only reconciles come from the--max-retry-delay
controller-runtime triggers.- It seemingly does not proceed through any of the steps in the actual reconcile logic - no log lines are printed, no events emitted except the final "Reconciilation completed" message from the
defer
func, and the actual reconcile takes <10 ms, whereas before it would take ~100ms, indicating significantly less work is happening. - Updates to the backing Repository resource do not get picked up and applied onto the Kustomization object.
- The conditions are never updated.
- The finalizer, if removed, is never put back, even though that's the first thing the Reconcile function does.
Restarting the controller does unstick things, but this is a sub-optimal solution. We run over 40 clusters, and it would be tedious to manually monitor and remediate, and frankly hacky to code up some automation to automatically restart the controller when this condition is detected.
At this point, we're unsure what exactly is going on, and a detailed reading of the Reconcile function has not yielded any easy answers.
I couldn't find any similar-looking issues in the backlog, which probably indicates something rare or unique to our setup is happening. If anyone has any ideas on what to debug or check next, that would be appreciated!