The staleness issue in Kubernetes controllers leads to incorrect decisions. Version v1.36 introduces mechanisms that make controller behavior more predictable.

As long as teams operate on default dashboards, everything looks acceptable. In Kubernetes controllers, this is related to a local cache that is populated through the watch API server. This cache speeds up processing but introduces the risk of desynchronization. As a result, the controller may make an incorrect decision, miss a necessary action, or react with a delay. This is particularly noticeable after a controller restart or when the API server is unavailable, causing the cache to lag behind the actual state of the cluster.

In Kubernetes v1.36, a pragmatic solution was chosen: instead of eliminating the cache, a mechanism for checking its relevance was added. The client-go library introduced AtomicFIFO — an extension of the queue that processes events atomically, especially during batch operations (e.g., initial list). This eliminates the state where events arrive in different orders and break the consistency of the cache. Additionally, the function LastStoreSyncResourceVersion() was introduced, which allows for a clear understanding of which version of the resource the cache has seen. The trade-off here is evident: an additional check is added before taking action, which may slow down the response but reduces the risk of incorrect operations.

At the implementation level, kube-controller-manager utilizes this capability through feature gates like StaleControllerConsistency. When enabled, the controller compares the resource version in the cache with the one it has written to the API server before taking action. If the cache is outdated, the action is not executed. This behavior implements the read-your-own-writes principle without the need for synchronous requests to the API. The client-go library also added ConsistencyStore — a structure that tracks resource versions and helps informers determine whether the cache has caught up with the current state. For example, the ReplicaSet controller tracks versions of both the ReplicaSet itself and the associated Pods to avoid actions based on outdated data.

A separate layer is observability. The kube-controller-manager has added the metric stale_sync_skips_total, which indicates how many times the controller skipped synchronization due to outdated cache. Additionally, client-go publishes store_resource_version for each informer. This allows for a comparison of the cache state with the API server and diagnosing lag. Metrics are enabled by default but remain in alpha status. This signals that the interface and semantics may change.

The result is more predictable behavior of controllers in conditions of contention and high competition, especially for Pod-oriented controllers. However, no precise metrics of improvement are provided. What is important is that the system begins to clearly distinguish between “no action because it is not needed” and “no action because the cache is outdated.” This reduces the class of elusive errors that previously manifested only in production.

Future development is expected to shift towards controller-runtime to make these guarantees a standard for all custom controllers. This is a logical step: the staleness issue is not unique to built-in components but is a systemic property of an architecture with caching and eventual consistency.

Read – Kubernetes.com

The staleness issue in Kubernetes controllers leads to incorrect decisions. Version v1.36 introduces mechanisms that make controller behavior more predictable.

🚀 Deploy the Blocks