LINUX & HPC : Advanced Large Scale Computing at a Glance !: Kubernetes

Kubernetes cluster is a group of nodes that are used to run containerized applications and services. The cluster consists of a control plane, which manages the overall state of the cluster, and worker nodes, which run the containerized applications.

The control plane is responsible for managing the configuration and deployment of applications on the cluster, as well as monitoring and scaling the cluster as needed. It includes components such as the Kubernetes API server, the etcd datastore, the kube-scheduler, and the kube-controller-manager.

The worker nodes are responsible for running the containerized applications and services. Each node typically runs a container runtime, such as Docker or containerd, as well as a kubelet process that communicates with the control plane to manage the containers running on the node.

In a Kubernetes cluster, applications are deployed as pods, which are the smallest deployable units in Kubernetes. Pods contain one or more containers, and each pod runs on a single node in the cluster. Kubernetes manages the deployment and scaling of the pods across the cluster, ensuring that the workload is evenly distributed and resources are utilized efficiently.

In Kubernetes, the native scheduler is a built-in component responsible for scheduling pods onto worker nodes in the cluster. When a new pod is created, the scheduler evaluates the resource requirements of the pod, along with any constraints or preferences specified in the pod's definition, and selects a node in the cluster where the pod can be scheduled. The native scheduler uses a combination of heuristics and policies to determine the best node for each pod. It considers factors such as the available resources on each node, the affinity and anti-affinity requirements of the pod, any node selectors or taints on the nodes, and the current state of the cluster. The native scheduler in Kubernetes is highly configurable and can be customized to meet the specific needs of different workloads. For example, you can configure the scheduler to prioritize certain nodes in the cluster over others, or to balance the workload evenly across all available nodes.

[sachinpb@remotehostn18 ~]$ kubectl get pods -n kube-system | grep kube-scheduler
kube-scheduler-remotehost18 1/1 Running 11 398d

kubectl cordon is a command in Kubernetes that is used to mark a node as unschedulable. This means that Kubernetes will no longer schedule any new pods on the node, but will continue to run any existing pods on the node.

The kubectl cordon command is useful when you need to take a node offline for maintenance or other reasons, but you want to ensure that the existing pods on the node continue to run until they can be safely moved to other nodes in the cluster. By marking the node as unschedulable, you can prevent Kubernetes from scheduling any new pods on the node, which helps to ensure that the overall health and stability of the cluster is maintained.

[sachinpb@remotenode18 ~]$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
remotenode01 Ready worker 270d v1.23.4
remotenode02 Ready worker 270d v1.23.4
remotenode03 Ready worker 270d v1.23.4
remotenode04 Ready worker 81d v1.23.4
remotenode07 Ready worker 389d v1.23.4
remotenode08 Ready worker 389d v1.23.4
remotenode09 Ready worker 389d v1.23.4
remotenode14 Ready worker 396d v1.23.4
remotenode15 Ready worker 81d v1.23.4
remotenode16 Ready worker 396d v1.23.4
remotenode17 Ready worker 396d v1.23.4
remotenode18 Ready control-plane,master 398d v1.23.4

[sachinpb@remotenode18 ~]$ kubectl cordon remotenode16
node/remotenode16 cordoned
[sachinpb@remotenode18 ~]$ kubectl uncordon remotenode16
node/remotenode16 uncordoned

[sachinpb@remotenode18 ~]$ kubectl cordon remotenode16
node/remotenode16 cordoned
[sachinpb@remotenode18 ~]$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
remotenode01 Ready worker 270d v1.23.4
remotenode02 Ready worker 270d v1.23.4
remotenode03 Ready worker 270d v1.23.4
remotenode04 Ready worker
remotenode07 Ready worker 389d v1.23.4
remotenode08 Ready worker 389d v1.23.4
remotenode09 Ready worker 389d v1.23.4
remotenode14 Ready worker 396d v1.23.4
remotenode15 Ready worker 81d v1.23.4
remotenode16 Ready,SchedulingDisabled worker 396d v1.23.4

remotenode17 Ready worker 396d v1.23.4
remotenode18 Ready control-plane,master 398d v1.23.4

[sachinpb@remotenode18 ~]$

After the node has been cordoned off, you can use the kubectl drain command to safely and gracefully terminate any running pods on the node and reschedule them onto other available nodes in the cluster. Once all the pods have been moved, the node can then be safely removed from the cluster.

kubectl drain is a command in Kubernetes that is used to gracefully remove a node from a cluster. This is typically used when performing maintenance on a node, such as upgrading or replacing hardware, or when decommissioning a node from the cluster.

Source

[sachinpb@remotenode18 ~]$ kubectl drain --ignore-daemonsets remote16
node/remote16 already cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-j749l, kube-system/fuse-device-plugin-daemonset-59lrp, kube-system/kube-proxy-v26k2, kube-system/nvidia-device-plugin-daemonset-w2k57, kube-system/rdma-shared-dp-ds-zdpfw, sys-monitor/prometheus-op-prometheus-node-exporter-rh4db
node/remote16 drained
[sachinpb@remotenode18 ~]$

By default kubectl drain is non-destructive, you have to override to change that behaviour. It runs with the following defaults:

--delete-local-data=false
--force=false
--grace-period=-1 (Period of time in seconds given to each pod to terminate gracefully. If negative, the default value specified in the pod will be used.)
--ignore-daemonsets=false
--timeout=0s

Each of these safeguard deals with a different category of potential destruction (local data, bare pods, graceful termination, daemonsets). It also respects pod disruption budgets to adhere to workload availability. Any non-bare pod will be recreated on a new node by its respective controller (e.g. daemonset controller, replication controller). It's up to you whether you want to override that behaviour (for example you might have a bare pod if running jenkins job. If you override by setting --force=true it will delete that pod and it won't be recreated). If you don't override it, the node will be in drain mode indefinitely (--timeout=0s))

Source

When a node is drained, Kubernetes will automatically reschedule any running pods onto other available nodes in the cluster, ensuring that the workload is not interrupted. The kubectl drain command ensures that the node is cordoned off, meaning no new pods will be scheduled on it, and then gracefully terminates any running pods on the node. This helps to ensure that the pods are shut down cleanly, allowing them to complete any in-progress tasks and save any data before they are terminated.

After the pods have been rescheduled, the node can then be safely removed from the cluster. This helps to ensure that the overall health and stability of the cluster is maintained, even when individual nodes need to be taken offline for maintenance or other reasons

When kubectl drain returns successfully, that indicates that all of the pods have been safely evicted. It is then safe to bring down the node. After maintenance work we can use kubectl uncordon to tell Kubernetes that it can resume scheduling new pods onto the node.

[sachinpb@remotenode18 ~]$ kubectl uncordon remotenode16
node/remotenode16 uncordoned

Let's try all the above steps and see :

1) Retrieve information from a Kubernetes cluster

--------------------------------

2) Kubernetes cordon is an operation that marks or taints a node in your existing node pool as unschedulable.

[sachinpb@remotenode18 ~]$ kubectl cordon remotenode16
node/remotenode16 cordoned
[sachinpb@remotenode18 ~]$

[sachinpb@remotenode18 ~]$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
remotenode01 Ready worker 270d v1.23.4
remotenode02 Ready worker 270d v1.23.4
remotenode03 Ready worker 270d v1.23.4
remotenode04 Ready worker 81d v1.23.4
remotenode07 Ready worker 389d v1.23.4
remotenode08 Ready worker 389d v1.23.4
remotenode09 Ready worker 389d v1.23.4
remotenode14 Ready worker 396d v1.23.4
remotenode15 Ready worker 81d v1.23.4
remotenode16 Ready,SchedulingDisabled worker 396d v1.23.4
remotenode17 Ready worker 396d v1.23.4
remotenode18 Ready control-plane,master 398d v1.23.4

3) Drain node in preparation for maintenance. The given node will be marked unschedulable to prevent new pods from arriving. Then drain deletes all pods

[sachinpb@remotenode18 ~]$ kubectl drain remotenode16 --grace-period=2400
node/remotenode16 already cordoned
error: unable to drain node "remotenode16" due to error:cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): kube-system/calico-node-j749l, kube-system/fuse-device-plugin-daemonset-59lrp, kube-system/kube-proxy-v26k2, kube-system/nvidia-device-plugin-daemonset-w2k57, kube-system/rdma-shared-dp-ds-zdpfw, sys-monitor/prometheus-op-prometheus-node-exporter-rh4db, continuing command...
There are pending nodes to be drained:
remotenode16
cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): kube-system/calico-node-j749l, kube-system/fuse-device-plugin-daemonset-59lrp, kube-system/kube-proxy-v26k2, kube-system/nvidia-device-plugin-daemonset-w2k57, kube-system/rdma-shared-dp-ds-zdpfw, sys-monitor/prometheus-op-prometheus-node-exporter-rh4db
[sachinpb@remotenode18 ~]$

NOTE:

The given node will be marked unschedulable to prevent new pods from arriving. Then drain deletes all pods except mirror pods (which cannot be deleted through the API server). If there are DaemonSet-managed pods, drain will not proceed without –ignore-daemonsets, and regardless it will not delete any DaemonSet-managed pods, because those pods would be immediately replaced by the DaemonSet controller, which ignores unschedulable markings. If there are any pods that are neither mirror pods nor managed–by ReplicationController, DaemonSet or Job–, then drain will not delete any pods unless you use –force.

----------------------------

4) Drain node with --ignore-daemonsets

[sachinpb@remotenode18 ~]$ kubectl drain --ignore-daemonsets remotenode16 --grace-period=2400
node/remotenode16 cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-j749l, kube-system/fuse-device-plugin-daemonset-59lrp, kube-system/kube-proxy-v26k2, kube-system/nvidia-device-plugin-daemonset-w2k57, kube-system/rdma-shared-dp-ds-zdpfw, sys-monitor/prometheus-op-prometheus-node-exporter-rh4db
node/remotenode16 drained

----------------------

5) Uncordon will mark the node as schedulable.

[sachinpb@remotenode18 ~]$ kubectl uncordon remotenode16
node/remotenode16 uncordoned
[sachinpb@remotenode18 ~]$

-----------------

6) Retrieve information from a Kubernetes cluster

How to automate above process creating Jenkins pipeline job to cordon ,drain and uncordon the nodes with the help of groovy script:

-------------------------Sample groovy script--------------------------------

node("Kubernetes-master-node") {
stage("1") {
sh 'hostname'
sh 'cat $SACHIN_HOME/manual//hostfile'
k8s_cordon_drain()
k8s_uncordon()
}
}

/*
* CI -Kubernetes cluster : This function will cordon/drain the worker nodes in hostfile

*/
def k8s_cordon_drain() {
def maxTries = 3 // the maximum number of times to retry the kubectl commands
def sleepTime = 5 * 1000 // the amount of time to wait between retries (in milliseconds)
def filename = '$SACHIN_HOME/manual/hostfile'
def content = readFile(filename)
def hosts = content.readLines().collect { it.split()[0] }
println "List of Hostnames to be cordoned from K8s cluster: ${hosts}"
hosts.each { host ->
def command1 = "kubectl cordon $host"
def command2 = "kubectl drain --ignore-daemonsets --grace-period=2400 $host"
def tries = 0
def result1 = null
def result2 = null
while (tries < maxTries) {
result1 = sh(script: command1, returnStatus: true)
if (result1 == 0) {
println "Successfully cordoned $host"
break
} else {
tries++
println "Failed to cordoned $host (attempt $tries/$maxTries), retrying in ${sleepTime/1000} seconds..."
sleep(sleepTime)
}
}
if (result1 == 0) {
tries = 0
while (tries < maxTries) {
result2 = sh(script: command2, returnStatus: true)
if (result2 == 0) {
println "Successfully drained $host"
break
} else {
tries++
println "Failed to drain $host (attempt $tries/$maxTries), retrying in ${sleepTime/1000} seconds..."
sleep(sleepTime)
}
}
}

if (result2 != 0) {
println "Failed to drain $host after $maxTries attempts"
}
}
}

/*
* CI - Kubernetes cluster : This function will uncordon the worker nodes in hostfile

*/
def k8s_uncordon() {
def maxTries = 3 // the maximum number of times to retry the kubectl commands
def sleepTime = 5 * 1000 // the amount of time to wait between retries (in milliseconds)
def filename = '$SACHIN_HOME/manual/hostfile'
def content = readFile(filename)
def hosts = content.readLines().collect { it.split()[0] }
println "List of Hostnames to be uncordoned from K8s cluster: ${hosts}"
hosts.each { host ->
def command1 = "kubectl uncordon $host"
def tries = 0
def result1 = null
while (tries < maxTries) {
result1 = sh(script: command1, returnStatus: true)
if (result1 == 0) {
println "Successfully cordoned $host"
break
} else {
tries++
println "Failed to uncordon $host (attempt $tries/$maxTries), retrying in ${sleepTime/1000} seconds..."
sleep(sleepTime)
}
}
if (result1 != 0) {
println "Failed to uncordon $host after $maxTries attempts"
}
}
}

------------------Jenkins Console output for pipeline job -----------------

Started by user jenkins-admin
[Pipeline] Start of Pipeline
[Pipeline] node
Running on Kubernetes-master-node in $SACHIN_HOME/workspace/test_sample4_cordon_drain
[Pipeline] {
[Pipeline] stage
[Pipeline] { (1)
[Pipeline] sh
+ hostname
kubernetes-master-node
[Pipeline] sh
+ cat $SACHIN_HOME/manual//hostfile
Remotenode16 slots=4
Remotenode17 slots=4
[Pipeline] readFile
[Pipeline] echo
List of Hostnames to be cordoned from K8s cluster: [Remotenode16, Remotenode17]
[Pipeline] sh
+ kubectl cordon Remotenode16
node/Remotenode16 cordoned
[Pipeline] echo
Successfully cordoned Remotenode16
[Pipeline] sh
+ kubectl drain --ignore-daemonsets --grace-period=2400 Remotenode16
node/Remotenode16 already cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-j749l, kube-system/fuse-device-plugin-daemonset-59lrp, kube-system/kube-proxy-v26k2, kube-system/nvidia-device-plugin-daemonset-w2k57, kube-system/rdma-shared-dp-ds-zdpfw, sys-monitor/prometheus-op-prometheus-node-exporter-rh4db
node/Remotenode16 drained
[Pipeline] echo
Successfully drained Remotenode16
[Pipeline] sh
+ kubectl cordon Remotenode17
node/Remotenode17 cordoned
[Pipeline] echo
Successfully cordoned Remotenode17
[Pipeline] sh
+ kubectl drain --ignore-daemonsets --grace-period=2400 Remotenode17
node/Remotenode17 already cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-hz5zh, kube-system/fuse-device-plugin-daemonset-dj72m, kube-system/kube-proxy-g87dc, kube-system/nvidia-device-plugin-daemonset-tk5x8, kube-system/rdma-shared-dp-ds-n4g5w, sys-monitor/prometheus-op-prometheus-node-exporter-gczmz
node/Remotenode17 drained
[Pipeline] echo
Successfully drained Remotenode17
[Pipeline] readFile
[Pipeline] echo
List of Hostnames to be uncordoned from K8s cluster: [Remotenode16, Remotenode17]
[Pipeline] sh
+ kubectl uncordon Remotenode16
node/Remotenode16 uncordoned
[Pipeline] echo
Successfully cordoned Remotenode16
[Pipeline] sh
+ kubectl uncordon Remotenode17
node/Remotenode17 uncordoned
[Pipeline] echo
Successfully cordoned Remotenode17
[Pipeline] }
[Pipeline] // stage
[Pipeline] }
[Pipeline] // node
[Pipeline] End of Pipeline
Finished: SUCCESS

-----------------------------------------------------------------

Reference:

https://kubernetes.io/docs/home/

LINUX & HPC : Advanced Large Scale Computing at a Glance !

Wednesday, April 19, 2023

Kubernetes - decommissioning a node from the cluster

No comments:

Post a Comment

Popular Posts

Translate