Kubernetes Architecture

A Kubernetes cluster consists of at least one main (control) plane, and one or more worker machines, called nodes.

Both the control planes and node instances can be physical devices, virtual machines, or instances in the cloud.
In managed Kubernetes environments like AWS EKS, GCP GKE, Azure AKS the control plane is managed by the cloud provider.

The latest stable release is Kubernetes v1.36 (April 2026), with the ecosystem consolidating around containerd, the Gateway API, and native sidecar containers.

Control Plane

The control plane is also known as a master node or head node.
The control plane manages the worker nodes and the Pods in the cluster.

In production environments, the control plane usually runs across multiple computers and a cluster usually runs multiple nodes, providing fault-tolerance and high availability.
It is not recommended to run user workloads on master mode.
The Control plane’s components make global decisions about the cluster, as well as detect and respond to cluster events.

The control plane receives input from a CLI or UI via an API.

API Server (kube-apiserver)

API server exposes a REST interface to the Kubernetes cluster. It is the front end for the Kubernetes control plane.
All operations against pods, services, and so forth, are executed programmatically by communicating with the endpoints provided by it.

It tracks the state of all cluster components and manages the interaction between them.
It is designed to scale horizontally.
It consumes YAML/JSON manifest files.
It validates and processes the requests made via API.
Supports Declarative Validation (GA in v1.36) using ValidatingAdmissionPolicy as an in-process alternative to admission webhooks.
Supports Server-Side Sharded List and Watch (v1.36) for improved API scalability at large cluster sizes.

etcd (key-value store)

Etcd is a consistent, distributed, and highly-available key-value store.
is stateful, persistent storage that stores all of Kubernetes cluster data (cluster state and config).
is the source of truth for the cluster.
can be part of the control plane, or, it can be configured externally.
The latest major release is etcd v3.6.0 (May 2025), introducing significant memory optimizations, downgrade support, and migration to v3store. etcd v3.7.0-beta.0 was announced in May 2026.

ETCD benefits include
- Fully replicated: Every node in an etcd cluster has access to the full data store.
- Highly available: etcd is designed to have no single point of failure and gracefully tolerate hardware failures and network partitions.
- Reliably consistent: Every data ‘read’ returns the latest data ‘write’ across all clusters.
- Fast: etcd has been benchmarked at 10,000 writes per second.
- Secure: etcd supports automatic Transport Layer Security (TLS) and optional secure socket layer (SSL) client certificate authentication.
- Simple: Any application, from simple web apps to highly complex container orchestration engines such as Kubernetes, can read or write data to etcd using standard HTTP/JSON tools.

Scheduler (kube-scheduler)

The scheduler is responsible for assigning work to the various nodes. It keeps watch over the resource capacity and ensures that a worker node’s performance is within an appropriate threshold.
It schedules pods to worker nodes.
It watches api-server for newly created Pods with no assigned node, and selects a healthy node for them to run on.

If there are no suitable nodes, the pods are put in a pending state until such a healthy node appears.
It watches API Server for new work tasks.
Factors taken into account for scheduling decisions include:
- Individual and collective resource requirements.
- Hardware/software/policy constraints.
- Affinity and anti-affinity specifications.
- Data locality.
- Inter-workload interference.
- Deadlines and taints.
Workload-Aware Scheduling (introduced in v1.35, advancing in v1.36) allows the scheduler to treat groups of pods (e.g., AI training jobs) as a single scheduling unit, enabling gang scheduling and coordinated preemption decisions.

Dynamic Resource Allocation (DRA) (GA in v1.36) enables the scheduler to match resource claims (e.g., GPUs, FPGAs) with available devices using CEL expressions for flexible filtering.

Controller Manager (kube-controller-manager)

Controller manager is responsible for making sure that the shared state of the cluster is operating as expected.
It watches the desired state of the objects it manages and watches their current state through the API server.

It takes corrective steps to make sure that the current state is the same as the desired state.
It is a controller of controllers.
It runs controller processes. Logically, each controller is a separate process, but to reduce complexity, they are all compiled into a single binary and run in a single process.
Some types of controllers are:
- Node controller: Responsible for noticing and responding when nodes go down.
- Job controller: Watches for Job objects that represent one-off tasks, then creates Pods to run those tasks to completion.
- EndpointSlice controller: Populates EndpointSlice objects (joins Services & Pods). Note: The older Endpoints controller is being phased out in favor of EndpointSlice for better scalability.
- Service Account & Token controllers: Create default accounts and API access tokens for new namespaces.

Cloud Controller Manager

The cloud controller manager integrates with the underlying cloud technologies in your cluster when the cluster is running in a cloud environment.
The cloud-controller-manager only runs controllers that are specific to your cloud provider.

Cloud controller lets you link your cluster into cloud provider’s API, and separates out the components that interact with that cloud platform from components that only interact with your cluster.
The following controllers can have cloud provider dependencies:
- Node controller: For checking the cloud provider to determine if a node has been deleted in the cloud after it stops responding.
- Route controller: For setting up routes in the underlying cloud infrastructure.
- Service controller: For creating, updating, and deleting cloud provider load balancers.

Data Plane Worker Node(s)

The data plane is known as the worker node or compute node.

A virtual or physical machine that contains the services necessary to run containerized applications.
A Kubernetes cluster needs at least one worker node, but normally has many.
The worker node(s) host the Pods that are the components of the application workload.

Pods are scheduled and orchestrated to run on nodes.
Cluster can be scaled up and down by adding and removing nodes.
Node components run on every node, maintaining running pods and providing the Kubernetes runtime environment.

kubelet

A Kubelet tracks the state of a pod to ensure that all the containers are running and healthy.

provides a heartbeat message every few seconds to the control plane.
runs as an agent on each node in the cluster.
acts as a conduit between the API server and the node.

instantiates and executes Pods.
watches API Server for work tasks.
gets instructions from master and reports back to Masters.
Supports In-Place Pod Resize (GA in v1.35, Dec 2025) — allows changing CPU and memory resources of a running pod without restarting it.
Fine-Grained Kubelet API Authorization (GA in v1.36) — provides granular access control for the kubelet’s API endpoints, improving node-level security.

kube-proxy

Kube proxy is a networking component that routes traffic coming into a node from the service to the correct containers.

is a network proxy that runs on each node in a cluster.
manages IP translation and routing.
maintains network rules on nodes. These network rules allow network communication to Pods from inside or outside of cluster.
ensures each Pod gets a unique IP address.
makes possible that all containers in a pod share a single IP.

facilitates Kubernetes networking services and load-balancing across all pods in a service.
It deals with individual host sub-netting and ensures that the services are available to external parties.
kube-proxy supports three backend modes:
- iptables — the traditional default mode, being superseded.
- nftables — the modern replacement (stable in v1.36), solving iptables performance limitations and planned to become the default in v1.37+.
- IPVS — deprecated in v1.35 and will be removed in a future release. Users should migrate to nftables.
eBPF-based alternatives (e.g., Cilium) can replace kube-proxy entirely, providing higher performance through kernel-level packet processing without iptables rules.

Container Runtime

Container runtime is responsible for running containers (in Pods).
Kubernetes supports any implementation of the Kubernetes Container Runtime Interface (CRI) specifications.
To run the containers, each worker node has a container runtime engine.
It pulls images from a container image registry and starts and stops containers.

Kubernetes supports several CRI-compliant container runtimes:
- containerd — the most widely adopted runtime (used by ~80% of production clusters), default for EKS, GKE, and AKS.
- CRI-O — lightweight OCI-compliant runtime, commonly used with OpenShift.

⚠️ Important: Docker (via dockershim) is no longer supported as a container runtime since Kubernetes v1.24 (April 2022). The dockershim component was removed from the kubelet. Docker-built images continue to work with all CRI-compliant runtimes as they produce standard OCI images. For clusters previously using Docker, the recommended migration is to containerd.

Key Kubernetes Architecture Updates (2024-2026)

Native Sidecar Containers (GA in v1.33) — init containers with restartPolicy: Always that run alongside app containers for the pod’s entire lifecycle, providing first-class support for service meshes, log shippers, and observability agents.

User Namespaces (GA in v1.36) — containers get their own UID/GID space, so a container running as root no longer maps to root on the host node, significantly improving security isolation.
Gateway API — the successor to Ingress, now the recommended way to manage external traffic routing into the cluster. The CKA exam curriculum has been updated to include Gateway API.
ValidatingAdmissionPolicy (GA in v1.36) — declarative, in-process admission control using CEL expressions as an alternative to admission webhooks.

PSI Metrics (GA in v1.36) — Pressure Stall Information metrics exposed by the kubelet for node-level resource pressure monitoring.