The Kubernetes 1.35 release, scheduled for December 17th, has gift-wrapped a variety of experimental improvements designed to enhance infrastructure flexibility and security. In this overview, we focus on its Alpha features extending across a broad spectrum of tasks: from watch-based route controller reconciliation and the long-awaited Gang Scheduling for AI/ML workloads to the secrets field for passing Service Account tokens, mutable volume attach limits, and proxying API server requests to fix version skew.
Note: In light of the ever-changing nature of the Kubernetes feature landscape, some of the KEPs discussed in this article may be removed from the milestone as the release date approaches.
Nodes
Node Declared Features (formerly Node Capabilities)
KEP-5328 introduces a Node Declared Features mechanism, allowing nodes to automatically declare available Kubernetes features. This provides the scheduler with a means to operate properly when component versions are out of sync (version skew). The proposal introduces a declaredFeatures field in the node status populated by kubelet, which the scheduler and admission controllers use to guarantee that the Pod ends up on a compatible node.

The mechanism itself is governed by strict rules to ensure its reliability and data relevance. The key here is the features’ temporary nature: they are not fixed as permanent attributes but exist only as long as kubelet actively confirms that they are available.
To avoid getting Pods stuck in the Failed state, kubelet identifies their complete list of features available during its bootstrap sequence, before the first Pods can be scheduled to a node. The feature list is handled in a fully automated fashion and will be reverted to its initial state if manually edited by the user/controller. It remains unchanged from startup to the next reboot, thus standing as “the authoritative source of truth” for kubelet’s declared features and providing consistency for the scheduler.
Example. In-place Pod resizing allows you to modify a container’s CPU and memory resource requests and limits without recreating the Pod. However, the Pod may end up on an older node that does not support features like in-place Pod resize for guaranteed QOS. In that case, an API request to change resources for a running Pod should be rejected. On the other hand, if a node supports the new feature, the Pod specification will have an entry that looks something like this:
declaredFeatures:
- GuaranteedQoSPodCPUResize
Allow HostNetwork Pods to Use User Namespaces
Currently, Kubernetes leaves you with a choice: you can run a Pod on the host network (hostNetwork: true), but then you can’t benefit from user namespaces (hostUsers: false) to isolate users. The API server simply disallows such a combination.
Many Kubernetes control plane components (e.g., kube-apiserver and kube-controller-manager) are usually run as static Pods with host network access rights (hostNetwork: true) to listen to host ports or interact with the network stack directly. However, due to the aforementioned limitation, these Pods cannot use user namespaces for additional isolation, thus rendering them riskier than regular Pods.
KEP-5607 allows you to circumvent this limitation. When this new feature is enabled, the API server will no longer reject Pod specifications that have both hostNetwork: true and hostUsers: false. This way, you can run containers on the host network while isolating their users with user namespaces.
Note. If your container runtime does not support this feature, the API server will accept the Pod, but it will be stuck in the ContainerCreating state. With this in mind, the KEP authors intend to keep this feature as alpha until it is supported by the widely used container runtimes (such as containerd and CRI-O), at which point it will be promoted to beta.
Pod Level Resources Support With In-Place Pod Vertical Scaling
KEP-1287, introduced in Kubernetes 1.27, lets you change CPU/memory requests and limits of individual containers without restarting them. KEP-2837 (K8s 1.32) expanded the Pod API with two new fields: pod.spec.resources.requests and pod.spec.resources.limit. These fields allow you to set Pod-level requests and limits for compute resources, such as CPU and memory, in addition to existing container-level settings. However, you still had to restart the Pod in order to apply the new Pod-level settings.
The most recent KEP-5419 does away with this limitation. Now, the user can patch pod.spec.resources on a running Pod without having to restart it. On top of that, KEP extends PodStatus to include a Pod-level analog of the container status resource fields to see how many resources are actually reserved for the whole Pod.
Restart All Containers on Container Exits
KEP-5532 is a logical evolution and extension of the functionality introduced in KEP-5307. While the earlier KEP featured a restartPolicy rules mechanism for individual containers, allowing kubelet to restart them based on exit codes, the current enhancement extends this logic to the Pod level. Now you can trigger a restart of all containers in the Pod, not just the one that failed, based on its exit code.
When a restart action is performed, the container status history is fully preserved, and the restart counts are correctly incremented for both individual containers and the Pod. The action happens in place: the IP address, sandbox, and attached volumes are preserved, while all init and core containers are started from scratch. If, however, the Pod has restartPolicy: Never, and one of the init containers crashes with an error after the RestartAllContainers is triggered, the entire Pod will be marked as Failed.
Here’s a self-explanatory example:
apiVersion: v1
kind: Pod
metadata:
name: my-ml-worker
spec:
restartPolicy: Never
initContainers:
- name: setup-envs
image: setup
- name: watcher-sidecar
image: watcher
restartPolicy: Always
restartPolicyRules:
- action: RestartAllContainers # New action introduced in KEP-5532
onExit:
exitCodes:
operator: In # Choosing from several exit codes
values: [88] # A specific exit code indicating the Pod should be restarted
containers:
- name: main-container
image: training-app
Scheduling
Extended Toleration Operators for Threshold-Based Placement
This new KEP extends the structure of the core/v1 Toleration API by adding new math comparison operators to the standard Equal and Exists ones: Lt (less than) and Gt (greater than). It also updates the logic of the TaintToleration scheduler plugin, which now has to interpret taint and toleration values as numbers, rather than just comparing strings. Although data in Value fields technically remain strings, when it comes to using new operators, the system tries to parse them as 64-bit integers (int64). If at least one of the values in the “Taint-Toleration” pair turns out not to be a valid integer, the comparison is considered unsuccessful, and toleration is not applied. Support for floating-point numbers is intentionally dropped to prevent precision errors; instead, users are encouraged to use scaling (e.g., by writing “95.5%” as the integer 955).
Example. Say you want to run your latency-critical Pods solely on nodes with SLA > 95%. Suppose they are also to be evicted from a node if its SLA falls below this threshold. The new KEP allows you to do so (unlike NodeAffinity):
# High-SLA on-demand node
apiVersion: v1
kind: Node
metadata:
name: ondemand-node-1
spec:
taints:
- key: node.kubernetes.io/sla
value: "950"
effect: NoExecute
---
# Inference service requires SLA > 950 with 30s grace period
apiVersion: apps/v1
kind: Deployment
metadata:
name: inference-service
spec:
template:
spec:
tolerations:
- key: node.kubernetes.io/sla
operator: Gt
value: "950"
effect: NoExecute
tolerationSeconds: 30
Gang Scheduling Support in Kubernetes
Kubernetes is widely used for AI, ML, and HPC jobs requiring multiple processes to run simultaneously. The regular Kubernetes scheduler works on a Pod-by-Pod basis, allocating Pods to nodes sequentially. When resources are limited, there is a risk that not all the workers needed to run a job will be scheduled. Say your cluster can only accept five out of the required ten workers, then they get to the nodes and block the resources while waiting for the rest of the workers to be scheduled. This can lead to deadlocks and results in wasted resources.
This new Kubernetes alpha feature addresses the issue by implementing so-called gang scheduling. It enforces the “All-or-Nothing” principle: a group of Pods (gang) is scheduled to nodes only when resources are available for all members of the gang (or the minimum required quorum, defined by the minCount parameter — see the example below).

The API gets a new Workload core type that manages the lifecycle of a group of Pods as a single entity, allowing the scheduler to handle the entire job. The association logic relies on PodGroups, where group parameters such as minCount (the minimum number of Pods to start) are defined. The Pod specification is enhanced with a field to reference the parent Workload object so that the scheduler knows that this Pod is part of a group. On top of that, the Workload objects support workloads with complex internal structures, such as built-in Job and StatefulSet, as well as custom workloads like JobSet, LeaderWorkerSet, MPIJob, and TrainJob.
Example of a Workload definition:
apiVersion: scheduling/v1alpha1
kind: Workload
metadata:
namespace: ns-1
name: job-1
spec:
podGroups:
- name: "pg1"
policy:
gang:
minCount: 100
Auth
Harden Kubelet Serving Certificate Validation in Kube-API Server
Currently, kube-apiserver’s TLS certificate validation mechanism allows “relaxed” validation when connecting to kubelet. That is: if the root CA (--kubelet-certificate-authority) is set up, the server checks the digital signature, but doesn’t always require the IP address or hostname in the Subject Alternative Name (SAN) field of the certificate to match the actual destination address.
This renders the cluster susceptible to Man-in-the-Middle attacks. An attacker with a valid certificate (signed by a trusted CA but issued to a different name/IP) can hijack the connection from the API server to kubelet (e.g., kubectl exec or kubectl logs).
This new alpha feature modifies the API server to validate the Common Name (CN) of the kubelet’s serving certificate as system:node:<nodename>, where nodename is the name of the Node object as reported by the kubelet.
Note: enabling this feature may disrupt existing clusters that use custom kubelet serving certificates. You will need to reissue those certificates before this feature can be enabled.
Constrained Impersonation
In Kubernetes, the impersonation mechanism has historically lacked granularity. If a user or service account was given the right to impersonate another user (e.g., someUser), it was granted all the privileges that someUser had. That violated the principle of least privilege and created serious security risks.
KEP-5284 introduces so-called constrained impersonation as well as a couple of verb prefixes:
impersonate:<mode>:— to impersonate a certain type of subject, such as a user, service account, etc.impersonate-on:<mode>:<verb>:— to perform a specific action on a resource.
With them, all permissions are double-checked. Now, to perform an action on behalf of another user, an impersonator must have:
- The permission to constrained-impersonate the target user (this one is cluster-scoped).
- The explicit permission to perform a specific action (defined by the API verb, e.g., list). Can be either cluster-scoped or namespace-scoped.

Storage
CSI Driver Opt-in for Service Account Tokens via secrets Field
This alpha feature introduces a new, more secure way of transferring Service Account tokens to CSI drivers. Currently, such tokens are passed to CSI drivers via the volume_context field in NodePublishVolumeRequest. The key issue is that volume_context is not designed to store sensitive information. This led to multiple security vulnerabilities, such as CVE-2023-2878 and CVE-2024-3744, where tokens were accidentally logged due to the volume_context field not being properly sanitized.
The new mechanism allows CSI drivers to explicitly request tokens via a dedicated secrets field specifically tailored to storing sensitive data. The CSIDriver resource specification gets a new serviceAccountTokenInSecrets field. If it is set to false (default behavior), tokens continue to be sent via volume_context. This ensures full backward compatibility with existing drivers. Enabling it will result in tokens being transferred exclusively via the secrets field.
Example:
apiVersion: storage.k8s.io/v1
kind: CSIDriver
metadata:
name: example-csi-driver
spec:
# ... existing fields ...
tokenRequests:
- audience: "example.com"
expirationSeconds: 3600
# New field for opting into secrets delivery
serviceAccountTokenInSecrets: true # defaults to false
Note: the serviceAccountTokenInSecrets field can only be set when tokenRequests is configured. Otherwise, the API server will reject the CSIDriver specs.
Mutable PersistentVolume Node Affinity
Say you want to improve the availability of your stateful workload by leveraging the new regional storage that the storage provider has afforded you. That is, suppose you need a way to inform the scheduler that the volume is now available in a different zone. KEP-5381 lets you do just that by turning the PersistentVolume.spec.nodeAffinity field mutable so that your volume’s node placement rules are no longer fixed and can be updated.
Apps
Consider Terminating Pods in Deployments
In some situations, the total number of running and terminating Pods in a Deployment can exceed the defined limits (i.e., spec.replicas + maxSurge for RollingUpdate). Even though terminating Pods are marked for deletion, they still use up cluster resources for a period of time set by terminationGracePeriodSeconds.
This leads to issues like triggering unnecessary node autoscaling (e.g., see #95498 or #99513) or preventing new Pods from being scheduled in clusters with limited resources until the old Pods are gone (#98656). The update behavior tends to be inconsistent depending on the strategy: RollingUpdate creates new Pods right away (TerminationStarted), while Recreate waits for the old Pods to be fully terminated (TerminationComplete).
To address this, KEP-3973 adds a new .spec.podReplacementPolicy field to the Deployment specification. It allows users to control when replacement Pods are going to be created. You get two options: TerminationStarted and TerminationComplete. The first one allows the deployment to launch new Pods without waiting for the old ones to fully terminate. The second option makes the deployment wait until Pods marked with a deletionTimestamp are in a Succeeded or Failed state (i.e., fully terminated and deleted from etcd). A new status field, .status.terminatingReplicas, is also added to Deployments and ReplicaSets to keep track of Pods that are terminating.
Mutable Container Resources When Job Is Suspended
Often, the resource needs for batch workloads in Kubernetes are not known at the time of creation and depend on the cluster’s available capacity. This proposal optimizes cluster utilization by presenting a means to modify resource requests/limits for suspended jobs, allowing them to start with the configurations that match the actual cluster conditions.
KEP-5440 updates Pod template validation logic to relax validation of the Job’s Template field to render container resource specifications (i.e., CPU/GPU and memory requests and limits, as well as extended resources like TPUs/FPGAs) mutable for suspended jobs (Job.Spec.Suspend=true).
Various
Unknown Version Interoperability Proxy
During Kubernetes cluster upgrades, different API servers may run different versions (this is known as a version skew). As a result, these API servers may support different sets of built-in API resources, which can potentially lead to incomplete data and errors. For instance, a client (e.g., a controller or kubectl) might connect to an API server that is unaware of a requested resource — even though another API server in the cluster supports it — and stumble on a 404 error as a result. Furthermore, this can cause system component failures as well. The Garbage Collector might mistakenly delete objects, believing their owner has been deleted because it hit a 404 on its request, or the Namespace Lifecycle Controller might block a namespace deletion because it cannot confirm the namespace is empty (or it might delete it too early, leaving garbage data in etcd).
This new alpha feature enables API servers to proxy requests they don’t understand. Rather than immediately responding with a 404, an API server first checks if a peer can handle the request and forwards it accordingly, thereby rendering version differences invisible to the client. If no peer is able to serve the request, the server returns a 503 “Service Unavailable,” signaling a temporary issue (which is much safer).
To solve the issue of inconsistent resource discovery, the KEP also proposes utilizing the existing Aggregated Discovery mechanism. This ensures that every API server has a complete, consistent view of all resources available across the cluster, eliminating discovery gaps caused by version skew.
Integrate CSI Volume Attach Limits with Cluster Autoscaler
Currently, when simulating Pod scheduling to make scaling decisions, the Cluster Autoscaler (CAS) does not account for volume attach limits imposed by CSI drivers. This leads to the CAS creating an insufficient number of new nodes since it assumes by default that there are enough volumes for new Pods available on those nodes. However, that is not the case.
Alternatively, the scheduler might attempt to place a Pod requiring a volume onto a newly created node on which the corresponding CSI driver has not yet been initialized. Because the driver’s information is missing, the scheduler incorrectly assumes there are no limits and schedules the Pod. As a result, the Pod becomes Pending on a node unable to attach its volume.
KEP-5030 addresses those issues by introducing changes to both components. The CAS now considers volume limits during its scaling simulation. It creates a “template” for the new node that includes the limit info. The information is obtained either from an existing node in the same node group (if there are nodes present already) or directly from the cloud provider if the group is scaling up from zero.
The scheduler now treats the lack of CSI driver information on a node as a zero attach limit for that driver’s volumes. This prevents it from placing Pods on nodes that are not yet ready to handle their volumes.
CCM: Watch-Based Route Controller Reconciliation Using Informers
Currently, the route controller in Kubernetes reconciles every 10 seconds by default, checking all nodes and all routes in the cloud provider to ensure they’re aligned with the desired state. This creates a constant and often redundant load on the cloud provider’s API, as this check runs even if no cluster changes have occurred
The KEP’s authors propose transitioning to a watch-based model where reconciliation is triggered by node-related events like additions, deletions, or updates to fields such as status.addresses and spec.PodCIDRs.
To ensure robustness, a full periodic reconciliation will still run, but at a much longer interval. This is necessary for self-healing and cleaning up stale routes that might have been left over from missed events (say, if a node was deleted while the controller was offline). A new route_controller_route_sync_total metric will track the number of route reconciliations as well.
Other v1.35 release highlights
While our article is specifically focused on revealing the new alpha features in Kubernetes v1.35, this release also includes many other updates. Here are some of them, as chosen by Dmitry Shurupov, our co-founder:
- In-Place Update of Pod Resources (KEP 1287) will be GA. Recently, we provided a detailed explanation of how the in-place resizing works.
- Structured Authentication Config (KEP 3331) will be GA. We have been directly involved in implementing this feature and described its background and capabilities two years ago.
- PerferSameNode Traffic Distribution (KEP 3015) will be GA. This enhancement improves network efficiency by instructing kube-proxy to route traffic to a local endpoint whenever possible.
- Pod Certificates (KEP 4317) will be moved to Beta and enabled by default. We covered this feature in our K8s v1.34 overview.
- The removal of cgroup v1 (KEP 5573) is introduced as a Beta enhancement. Using cgroup v2 brings unified control group hierarchy, better resource isolation, and other new features. It has been stable in Kubernetes since v1.25 and will be default now. Complete removal of cgroup v1 is expected no earlier than in K8s v1.38.
Still, it’s not the complete list of changes introduced in Kubernetes v1.35! For the most comprehensive information, refer to the official enhancements tracker and changelog.
Conclusion
Kubernetes 1.35 has ushered in many promising alpha features that enhance the platform’s flexibility, security, and efficiency in handling complex workloads. These updates — spanning from smart resource scheduling and dynamic allocation to stronger authentication and improved storage operations — are a testament to the community’s continuous effort to meet modern demands. We’re grateful for the excellent work performed by developers from all over the world and look forward to trying out the new K8s release!
Interested in other alpha features recently added to Kubernetes? Check out our deep dives into:
- Kubernetes v1.34 (August 2025).
Comments