LINUX & HPC : Advanced Large Scale Computing at a Glance !: April 2023

Wednesday, April 19, 2023

Kubernetes - decommissioning a node from the cluster

Kubernetes cluster is a group of nodes that are used to run containerized applications and services. The cluster consists of a control plane, which manages the overall state of the cluster, and worker nodes, which run the containerized applications.

The control plane is responsible for managing the configuration and deployment of applications on the cluster, as well as monitoring and scaling the cluster as needed. It includes components such as the Kubernetes API server, the etcd datastore, the kube-scheduler, and the kube-controller-manager.

The worker nodes are responsible for running the containerized applications and services. Each node typically runs a container runtime, such as Docker or containerd, as well as a kubelet process that communicates with the control plane to manage the containers running on the node.

In a Kubernetes cluster, applications are deployed as pods, which are the smallest deployable units in Kubernetes. Pods contain one or more containers, and each pod runs on a single node in the cluster. Kubernetes manages the deployment and scaling of the pods across the cluster, ensuring that the workload is evenly distributed and resources are utilized efficiently.

In Kubernetes, the native scheduler is a built-in component responsible for scheduling pods onto worker nodes in the cluster. When a new pod is created, the scheduler evaluates the resource requirements of the pod, along with any constraints or preferences specified in the pod's definition, and selects a node in the cluster where the pod can be scheduled. The native scheduler uses a combination of heuristics and policies to determine the best node for each pod. It considers factors such as the available resources on each node, the affinity and anti-affinity requirements of the pod, any node selectors or taints on the nodes, and the current state of the cluster. The native scheduler in Kubernetes is highly configurable and can be customized to meet the specific needs of different workloads. For example, you can configure the scheduler to prioritize certain nodes in the cluster over others, or to balance the workload evenly across all available nodes.

[sachinpb@remotehostn18 ~]$ kubectl get pods -n kube-system | grep kube-scheduler
kube-scheduler-remotehost18 1/1 Running 11 398d

kubectl cordon is a command in Kubernetes that is used to mark a node as unschedulable. This means that Kubernetes will no longer schedule any new pods on the node, but will continue to run any existing pods on the node.

The kubectl cordon command is useful when you need to take a node offline for maintenance or other reasons, but you want to ensure that the existing pods on the node continue to run until they can be safely moved to other nodes in the cluster. By marking the node as unschedulable, you can prevent Kubernetes from scheduling any new pods on the node, which helps to ensure that the overall health and stability of the cluster is maintained.

[sachinpb@remotenode18 ~]$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
remotenode01 Ready worker 270d v1.23.4
remotenode02 Ready worker 270d v1.23.4
remotenode03 Ready worker 270d v1.23.4
remotenode04 Ready worker 81d v1.23.4
remotenode07 Ready worker 389d v1.23.4
remotenode08 Ready worker 389d v1.23.4
remotenode09 Ready worker 389d v1.23.4
remotenode14 Ready worker 396d v1.23.4
remotenode15 Ready worker 81d v1.23.4
remotenode16 Ready worker 396d v1.23.4
remotenode17 Ready worker 396d v1.23.4
remotenode18 Ready control-plane,master 398d v1.23.4

[sachinpb@remotenode18 ~]$ kubectl cordon remotenode16
node/remotenode16 cordoned
[sachinpb@remotenode18 ~]$ kubectl uncordon remotenode16
node/remotenode16 uncordoned

[sachinpb@remotenode18 ~]$ kubectl cordon remotenode16
node/remotenode16 cordoned
[sachinpb@remotenode18 ~]$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
remotenode01 Ready worker 270d v1.23.4
remotenode02 Ready worker 270d v1.23.4
remotenode03 Ready worker 270d v1.23.4
remotenode04 Ready worker
remotenode07 Ready worker 389d v1.23.4
remotenode08 Ready worker 389d v1.23.4
remotenode09 Ready worker 389d v1.23.4
remotenode14 Ready worker 396d v1.23.4
remotenode15 Ready worker 81d v1.23.4
remotenode16 Ready,SchedulingDisabled worker 396d v1.23.4

remotenode17 Ready worker 396d v1.23.4
remotenode18 Ready control-plane,master 398d v1.23.4

[sachinpb@remotenode18 ~]$

After the node has been cordoned off, you can use the kubectl drain command to safely and gracefully terminate any running pods on the node and reschedule them onto other available nodes in the cluster. Once all the pods have been moved, the node can then be safely removed from the cluster.

kubectl drain is a command in Kubernetes that is used to gracefully remove a node from a cluster. This is typically used when performing maintenance on a node, such as upgrading or replacing hardware, or when decommissioning a node from the cluster.

Source

[sachinpb@remotenode18 ~]$ kubectl drain --ignore-daemonsets remote16
node/remote16 already cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-j749l, kube-system/fuse-device-plugin-daemonset-59lrp, kube-system/kube-proxy-v26k2, kube-system/nvidia-device-plugin-daemonset-w2k57, kube-system/rdma-shared-dp-ds-zdpfw, sys-monitor/prometheus-op-prometheus-node-exporter-rh4db
node/remote16 drained
[sachinpb@remotenode18 ~]$

By default kubectl drain is non-destructive, you have to override to change that behaviour. It runs with the following defaults:

--delete-local-data=false
--force=false
--grace-period=-1 (Period of time in seconds given to each pod to terminate gracefully. If negative, the default value specified in the pod will be used.)
--ignore-daemonsets=false
--timeout=0s

Each of these safeguard deals with a different category of potential destruction (local data, bare pods, graceful termination, daemonsets). It also respects pod disruption budgets to adhere to workload availability. Any non-bare pod will be recreated on a new node by its respective controller (e.g. daemonset controller, replication controller). It's up to you whether you want to override that behaviour (for example you might have a bare pod if running jenkins job. If you override by setting --force=true it will delete that pod and it won't be recreated). If you don't override it, the node will be in drain mode indefinitely (--timeout=0s))

Source

When a node is drained, Kubernetes will automatically reschedule any running pods onto other available nodes in the cluster, ensuring that the workload is not interrupted. The kubectl drain command ensures that the node is cordoned off, meaning no new pods will be scheduled on it, and then gracefully terminates any running pods on the node. This helps to ensure that the pods are shut down cleanly, allowing them to complete any in-progress tasks and save any data before they are terminated.

After the pods have been rescheduled, the node can then be safely removed from the cluster. This helps to ensure that the overall health and stability of the cluster is maintained, even when individual nodes need to be taken offline for maintenance or other reasons

When kubectl drain returns successfully, that indicates that all of the pods have been safely evicted. It is then safe to bring down the node. After maintenance work we can use kubectl uncordon to tell Kubernetes that it can resume scheduling new pods onto the node.

[sachinpb@remotenode18 ~]$ kubectl uncordon remotenode16
node/remotenode16 uncordoned

Let's try all the above steps and see :

1) Retrieve information from a Kubernetes cluster

--------------------------------

2) Kubernetes cordon is an operation that marks or taints a node in your existing node pool as unschedulable.

[sachinpb@remotenode18 ~]$ kubectl cordon remotenode16
node/remotenode16 cordoned
[sachinpb@remotenode18 ~]$

[sachinpb@remotenode18 ~]$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
remotenode01 Ready worker 270d v1.23.4
remotenode02 Ready worker 270d v1.23.4
remotenode03 Ready worker 270d v1.23.4
remotenode04 Ready worker 81d v1.23.4
remotenode07 Ready worker 389d v1.23.4
remotenode08 Ready worker 389d v1.23.4
remotenode09 Ready worker 389d v1.23.4
remotenode14 Ready worker 396d v1.23.4
remotenode15 Ready worker 81d v1.23.4
remotenode16 Ready,SchedulingDisabled worker 396d v1.23.4
remotenode17 Ready worker 396d v1.23.4
remotenode18 Ready control-plane,master 398d v1.23.4

3) Drain node in preparation for maintenance. The given node will be marked unschedulable to prevent new pods from arriving. Then drain deletes all pods

[sachinpb@remotenode18 ~]$ kubectl drain remotenode16 --grace-period=2400
node/remotenode16 already cordoned
error: unable to drain node "remotenode16" due to error:cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): kube-system/calico-node-j749l, kube-system/fuse-device-plugin-daemonset-59lrp, kube-system/kube-proxy-v26k2, kube-system/nvidia-device-plugin-daemonset-w2k57, kube-system/rdma-shared-dp-ds-zdpfw, sys-monitor/prometheus-op-prometheus-node-exporter-rh4db, continuing command...
There are pending nodes to be drained:
remotenode16
cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): kube-system/calico-node-j749l, kube-system/fuse-device-plugin-daemonset-59lrp, kube-system/kube-proxy-v26k2, kube-system/nvidia-device-plugin-daemonset-w2k57, kube-system/rdma-shared-dp-ds-zdpfw, sys-monitor/prometheus-op-prometheus-node-exporter-rh4db
[sachinpb@remotenode18 ~]$

NOTE:

The given node will be marked unschedulable to prevent new pods from arriving. Then drain deletes all pods except mirror pods (which cannot be deleted through the API server). If there are DaemonSet-managed pods, drain will not proceed without –ignore-daemonsets, and regardless it will not delete any DaemonSet-managed pods, because those pods would be immediately replaced by the DaemonSet controller, which ignores unschedulable markings. If there are any pods that are neither mirror pods nor managed–by ReplicationController, DaemonSet or Job–, then drain will not delete any pods unless you use –force.

----------------------------

4) Drain node with --ignore-daemonsets

[sachinpb@remotenode18 ~]$ kubectl drain --ignore-daemonsets remotenode16 --grace-period=2400
node/remotenode16 cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-j749l, kube-system/fuse-device-plugin-daemonset-59lrp, kube-system/kube-proxy-v26k2, kube-system/nvidia-device-plugin-daemonset-w2k57, kube-system/rdma-shared-dp-ds-zdpfw, sys-monitor/prometheus-op-prometheus-node-exporter-rh4db
node/remotenode16 drained

----------------------

5) Uncordon will mark the node as schedulable.

[sachinpb@remotenode18 ~]$ kubectl uncordon remotenode16
node/remotenode16 uncordoned
[sachinpb@remotenode18 ~]$

-----------------

6) Retrieve information from a Kubernetes cluster

How to automate above process creating Jenkins pipeline job to cordon ,drain and uncordon the nodes with the help of groovy script:

-------------------------Sample groovy script--------------------------------

node("Kubernetes-master-node") {
stage("1") {
sh 'hostname'
sh 'cat $SACHIN_HOME/manual//hostfile'
k8s_cordon_drain()
k8s_uncordon()
}
}

/*
* CI -Kubernetes cluster : This function will cordon/drain the worker nodes in hostfile

*/
def k8s_cordon_drain() {
def maxTries = 3 // the maximum number of times to retry the kubectl commands
def sleepTime = 5 * 1000 // the amount of time to wait between retries (in milliseconds)
def filename = '$SACHIN_HOME/manual/hostfile'
def content = readFile(filename)
def hosts = content.readLines().collect { it.split()[0] }
println "List of Hostnames to be cordoned from K8s cluster: ${hosts}"
hosts.each { host ->
def command1 = "kubectl cordon $host"
def command2 = "kubectl drain --ignore-daemonsets --grace-period=2400 $host"
def tries = 0
def result1 = null
def result2 = null
while (tries < maxTries) {
result1 = sh(script: command1, returnStatus: true)
if (result1 == 0) {
println "Successfully cordoned $host"
break
} else {
tries++
println "Failed to cordoned $host (attempt $tries/$maxTries), retrying in ${sleepTime/1000} seconds..."
sleep(sleepTime)
}
}
if (result1 == 0) {
tries = 0
while (tries < maxTries) {
result2 = sh(script: command2, returnStatus: true)
if (result2 == 0) {
println "Successfully drained $host"
break
} else {
tries++
println "Failed to drain $host (attempt $tries/$maxTries), retrying in ${sleepTime/1000} seconds..."
sleep(sleepTime)
}
}
}

if (result2 != 0) {
println "Failed to drain $host after $maxTries attempts"
}
}
}

/*
* CI - Kubernetes cluster : This function will uncordon the worker nodes in hostfile

*/
def k8s_uncordon() {
def maxTries = 3 // the maximum number of times to retry the kubectl commands
def sleepTime = 5 * 1000 // the amount of time to wait between retries (in milliseconds)
def filename = '$SACHIN_HOME/manual/hostfile'
def content = readFile(filename)
def hosts = content.readLines().collect { it.split()[0] }
println "List of Hostnames to be uncordoned from K8s cluster: ${hosts}"
hosts.each { host ->
def command1 = "kubectl uncordon $host"
def tries = 0
def result1 = null
while (tries < maxTries) {
result1 = sh(script: command1, returnStatus: true)
if (result1 == 0) {
println "Successfully cordoned $host"
break
} else {
tries++
println "Failed to uncordon $host (attempt $tries/$maxTries), retrying in ${sleepTime/1000} seconds..."
sleep(sleepTime)
}
}
if (result1 != 0) {
println "Failed to uncordon $host after $maxTries attempts"
}
}
}

------------------Jenkins Console output for pipeline job -----------------

Started by user jenkins-admin
[Pipeline] Start of Pipeline
[Pipeline] node
Running on Kubernetes-master-node in $SACHIN_HOME/workspace/test_sample4_cordon_drain
[Pipeline] {
[Pipeline] stage
[Pipeline] { (1)
[Pipeline] sh
+ hostname
kubernetes-master-node
[Pipeline] sh
+ cat $SACHIN_HOME/manual//hostfile
Remotenode16 slots=4
Remotenode17 slots=4
[Pipeline] readFile
[Pipeline] echo
List of Hostnames to be cordoned from K8s cluster: [Remotenode16, Remotenode17]
[Pipeline] sh
+ kubectl cordon Remotenode16
node/Remotenode16 cordoned
[Pipeline] echo
Successfully cordoned Remotenode16
[Pipeline] sh
+ kubectl drain --ignore-daemonsets --grace-period=2400 Remotenode16
node/Remotenode16 already cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-j749l, kube-system/fuse-device-plugin-daemonset-59lrp, kube-system/kube-proxy-v26k2, kube-system/nvidia-device-plugin-daemonset-w2k57, kube-system/rdma-shared-dp-ds-zdpfw, sys-monitor/prometheus-op-prometheus-node-exporter-rh4db
node/Remotenode16 drained
[Pipeline] echo
Successfully drained Remotenode16
[Pipeline] sh
+ kubectl cordon Remotenode17
node/Remotenode17 cordoned
[Pipeline] echo
Successfully cordoned Remotenode17
[Pipeline] sh
+ kubectl drain --ignore-daemonsets --grace-period=2400 Remotenode17
node/Remotenode17 already cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-hz5zh, kube-system/fuse-device-plugin-daemonset-dj72m, kube-system/kube-proxy-g87dc, kube-system/nvidia-device-plugin-daemonset-tk5x8, kube-system/rdma-shared-dp-ds-n4g5w, sys-monitor/prometheus-op-prometheus-node-exporter-gczmz
node/Remotenode17 drained
[Pipeline] echo
Successfully drained Remotenode17
[Pipeline] readFile
[Pipeline] echo
List of Hostnames to be uncordoned from K8s cluster: [Remotenode16, Remotenode17]
[Pipeline] sh
+ kubectl uncordon Remotenode16
node/Remotenode16 uncordoned
[Pipeline] echo
Successfully cordoned Remotenode16
[Pipeline] sh
+ kubectl uncordon Remotenode17
node/Remotenode17 uncordoned
[Pipeline] echo
Successfully cordoned Remotenode17
[Pipeline] }
[Pipeline] // stage
[Pipeline] }
[Pipeline] // node
[Pipeline] End of Pipeline
Finished: SUCCESS

-----------------------------------------------------------------

Reference:

https://kubernetes.io/docs/home/

Thursday, April 13, 2023

IBM Spectrum Symphony and LSF with Apache Hadoop

IBM Spectrum Symphony (formerly known as IBM Platform Symphony) is a high-performance computing (HPC) and grid computing software platform that enables organizations to process large amounts of data and run compute-intensive applications at scale. It provides a distributed computing infrastructure that can be used for a wide range of data-intensive workloads, such as scientific simulations, financial modeling, and big data analytics. IBM Spectrum Symphony is a parallel services middleware and cluster manager. It is widely used in banks for risk analytics, data analytics in a shared, multi-user, multi-application, multi-job environment. IBM Spectrum Symphony also works with IBM Spectrum LSF (for batch workloads) in the same cluster to allow both batch and parallel services workloads to share the same cluster.

Some of the key features of IBM Spectrum Symphony include:

Distributed computing: The platform allows organizations to distribute computing workloads across a large number of nodes, which can be located in different data centers or cloud environments.
Resource management: IBM Spectrum Symphony provides a resource management framework that allows organizations to allocate and manage compute, storage, and network resources more efficiently.
High availability: The platform is designed to provide high availability and fault tolerance, ensuring that applications can continue to run even if individual nodes or components fail.
Performance optimization: IBM Spectrum Symphony includes a range of performance optimization features, such as load balancing and data caching, which can help organizations to achieve faster processing times and better overall performance.
Support for multiple programming languages: The platform supports a wide range of programming languages, including Java, Python, and C++, which makes it easy for developers to build and deploy applications on the platform.

IBM Spectrum LSF (Load Sharing Facility) is another software platform that is often used in conjunction with IBM Spectrum Symphony to manage and optimize workloads in a distributed computing environment. LSF provides a range of features for resource management, workload scheduling, and job prioritization, which can help organizations to improve performance and efficiency.

When used together, IBM Spectrum Symphony and IBM Spectrum LSF can provide a comprehensive solution for managing and optimizing large-scale distributed computing environments. IBM Spectrum Symphony provides the distributed computing infrastructure and application management capabilities, while IBM Spectrum LSF provides the workload management and optimization features.

Some of the key features of LSF that complement IBM Spectrum Symphony include:

Advanced job scheduling: LSF provides sophisticated job scheduling capabilities, allowing organizations to prioritize and schedule jobs based on a wide range of criteria, such as resource availability, job dependencies, and user priorities.
Resource allocation: LSF can manage the allocation of resources, ensuring that jobs are run on the most appropriate nodes and that resources are used efficiently.
Job monitoring: LSF provides real-time monitoring of job progress and resource usage, allowing organizations to quickly identify and resolve issues that may impact performance.
Integration with other tools: LSF can be integrated with a wide range of other HPC tools and applications, including IBM Spectrum Symphony, providing a seamless workflow for managing complex computing workloads.

Integrating LSF with Hadoop can help organizations to optimize the use of their resources and achieve better performance when running Hadoop workloads.

Apache Hadoop ("Hadoop") is a framework for large-scale distributed data storage and processing on computer clusters that uses the Hadoop Distributed File System ("HDFS") for the data storage and MapReduce programming model for the data processing. Since MapReduce workloads might only represent a small fraction of overall workload, but typically requires their own standalone environment, MapReduce is difficult to support within traditional HPC clusters. However, HPC clusters typically use parallel file systems that are sufficient for initial MapReduce workloads, so you can run MapReduce workloads as regular parallel jobs running in an HPC cluster environment. Use the IBM Spectrum LSF integration with Apache Hadoop to submit Hadoop MapReduce workloads as regular LSF parallel jobs.

To run your Hadoop application through LSF, submit it as an LSF job. Once the LSF job starts to run, the Hadoop connector script (lsfhadoop.sh) automatically provisions an open source Hadoop cluster within LSF allocated resources, then submits actual MapReduce workloads into this Hadoop cluster. Since each LSF Hadoop job has its own resource (cluster), the integration provides a multi-tenancy environment to allow multiple users to share the common pool of HPC cluster resources. LSF is able to collect resource usage of MapReduce workloads as normal LSF parallel jobs and has full control of the job life cycle. After the job is complete, LSF shuts down the Hadoop cluster.

By default, the Apache Hadoop integration configures the Hadoop cluster with direct access to shared file systems and does not require HDFS. This allows you to use existing file systems in your HPC cluster without having to immediately invest in a new file system. Through the existing shared file system, data can be stored in common share locations, which avoids the typical data stage-in and stage-out steps with HDFS.

The general steps to integrate LSF with Hadoop:

Install and configure LSF: The first step is to install and configure LSF on the Hadoop cluster. This involves setting up LSF daemons on the cluster nodes and configuring LSF to work with the Hadoop Distributed File System (HDFS).
Configure Hadoop for LSF: Hadoop needs to be configured to use LSF as its resource manager. This involves setting the yarn.resourcemanager.scheduler.class property in the Hadoop configuration file to com.ibm.platform.lsf.yarn.LSFYarnScheduler.
Configure LSF for Hadoop: LSF needs to be configured to work with Hadoop by setting up the necessary environment variables and resource limits. This includes setting the LSF_SERVERDIR and LSF_LIBDIR environment variables to the LSF installation directory and configuring LSF resource limits to ensure that Hadoop jobs have access to the necessary resources.
Submit Hadoop jobs to LSF: Hadoop jobs can be submitted to LSF using the yarn command-line tool with the -Dmapreduce.job.submithostname and -Dmapreduce.job.queuename options set to the LSF submit host and queue, respectively.
Monitor Hadoop jobs in LSF: LSF provides a web-based user interface and command-line tools for monitoring and managing Hadoop jobs running on the cluster. This allows users to monitor job progress, resource usage, and other metrics, and to take corrective action if necessary.

LSF can be used as a standalone workload management software for Hadoop clusters, without the need for IBM Spectrum Symphony. LSF provides advanced job scheduling and resource management capabilities, which can be used to manage and optimize Hadoop workloads running on large HPC clusters. By integrating LSF with Hadoop, organizations can ensure that Hadoop jobs have access to the necessary resources and are scheduled and managed efficiently, improving overall performance and resource utilization.

In addition, IBM Spectrum Symphony provides additional capabilities beyond workload management, such as distributed computing infrastructure, data movement, and integration with other data center software. If an organization requires these additional capabilities, they may choose to use IBM Spectrum Symphony alongside LSF for even greater benefits. But LSF can be used independently as a workload manager for Hadoop clusters.

Submitting LSF jobs to a Hadoop cluster involves creating an LSF job script that launches the Hadoop job and then submitting the job to LSF using the bsub command. . LSF will then schedule the job to run on the cluster. To submit LSF jobs to a Hadoop cluster, you need to follow these general steps:

Write the Hadoop job: First, you need to write the Hadoop job that you want to run on the cluster. This can be done using any of the Hadoop APIs, such as MapReduce, Spark, or Hive.
Create the LSF job script: Next, you need to create an LSF job script that will launch the Hadoop job on the cluster. This script will typically include the Hadoop command to run the job, along with any necessary environment variables, resource requirements, and other LSF-specific settings.
Submit the LSF job: Once the job script is ready, you can submit it to LSF using the bsub command. This will add the job to the LSF queue and wait for available resources to run the job.
Monitor the job: LSF provides several tools for monitoring and managing jobs running on the cluster, such as the bjobs command and the LSF web interface. You can use

Example 1: bsub command that can be used to submit a Hadoop job to an LSF-managed Hadoop cluster:

bsub -J my_hadoop_job -oo my_hadoop_job.out -eo my_hadoop_job.err -R "rusage[mem=4096]" -q hadoop_queue hadoop jar my_hadoop_job.jar input_dir output_dir

where:

-J: Specifies a name for the job. In this case, we're using "my_hadoop_job" as the job name.

-oo: Redirects the standard output of the job to a file. In this case, we're using "my_hadoop_job.out" as the output file.

-eo: Redirects the standard error of the job to a file. In this case, we're using "my_hadoop_job.err" as the error file.

-R: Specifies resource requirements for the job. In this case, we're requesting 4 GB of memory (mem=4096) for the job.

-q: Specifies the LSF queue to submit the job to. In this case, we're using the "hadoop_queue" LSF queue.

After the bsub command options, we specify the Hadoop command to run the job (hadoop jar my_hadoop_job.jar) and the input and output directories for the job (input_dir and output_dir). This will submit the Hadoop job to LSF, which will then schedule and manage the job on the Hadoop cluster. For more details please refer these links.

Example 2: How to submit a Hadoop job using bsub command with LSF?

bsub -q hadoop -J "Hadoop Job" -n 10 -o hadoop.log -hadoop /path/to/hadoop/bin/hadoop jar /path/to/hadoop/examples.jar pi 10 1000

This command will submit a Hadoop job to the LSF scheduler and allocate resources as necessary based on the job's requirements.

where:

-q hadoop specifies that the job should be submitted to the Hadoop queue.
-J "Hadoop Job" specifies a name for the job.
-n 10 specifies the number of cores to use for the job.
-o hadoop.log specifies the name of the output log file.
-hadoop specifies that the command that follows should be executed on a Hadoop cluster.
/path/to/hadoop/bin/hadoop specifies the path to the Hadoop executable.
jar /path/to/hadoop/examples.jar pi 10 1000 specifies the command to run the Hadoop job, which in this case is the pi example program with 10 mappers and 1000 samples.

Example 3: How to submit a wordcount MapReduce job using bsub with LSF ?

bsub -q hadoop -J "MapReduce Job" -n 10 -o mapreduce.log -hadoop /path/to/hadoop/bin/hadoop jar /path/to/hadoop/examples.jar wordcount /input/data /output/data

where:

-q hadoop specifies that the job should be submitted to the Hadoop queue.

-J "MapReduce Job" specifies a name for the job.

-n 10 specifies the number of cores to use for the job.

-o mapreduce.log specifies the name of the output log file.

-hadoop specifies that the command that follows should be executed on a Hadoop cluster.

/path/to/hadoop/bin/hadoop specifies the path to the Hadoop executable.

jar /path/to/hadoop/examples.jar wordcount /input/data /output/data specifies the command to run the MapReduce job, which in this case is the wordcount example program with input data in /input/data and output data in /output/data.

Example 4: How to submit a terasort MapReduce job using bsub with LSF?

bsub -q hadoop -J "MapReduce Job" -n 20 -o mapreduce.log -hadoop /path/to/hadoop/bin/hadoop jar /path/to/hadoop/examples.jar terasort -Dmapred.map.tasks=100 -Dmapred.reduce.tasks=50 /input/data /output/data

where:

-q hadoop specifies that the job should be submitted to the Hadoop queue.

-J "MapReduce Job" specifies a name for the job.

-n 20 specifies the number of cores to use for the job.

-o mapreduce.log specifies the name of the output log file.

-hadoop specifies that the command that follows should be executed on a Hadoop cluster.

/path/to/hadoop/bin/hadoop specifies the path to the Hadoop executable.

jar /path/to/hadoop/examples.jar terasort -Dmapred.map.tasks=100 -Dmapred.reduce.tasks=50 /input/data /output/data specifies the command to run the MapReduce job, which in this case is the terasort example program with input data in /input/data and output data in /output/data, and specific configuration parameters to control the number of map and reduce tasks.

Example 5: How to submit a grep MapReduce job using bsub with LSF?

bsub -q hadoop -J "MapReduce Job" -n 30 -o mapreduce.log -hadoop /path/to/hadoop/bin/hadoop jar /path/to/hadoop/examples.jar grep -input /input/data -output /output/data -regex "example.*"

where:

-q hadoop specifies that the job should be submitted to the Hadoop queue.

-J "MapReduce Job" specifies a name for the job.

-n 30 specifies the number of cores to use for the job.

-o mapreduce.log specifies the name of the output log file.

-hadoop specifies that the command that follows should be executed on a Hadoop cluster.

/path/to/hadoop/bin/hadoop specifies the path to the Hadoop executable.

jar /path/to/hadoop/examples.jar grep -input /input/data -output /output/data -regex "example.*" specifies the command to run the MapReduce job, which in this case is the grep example program with input data in /input/data, output data in /output/data, and a regular expression pattern to search for.

Example 6: How to submit a non MapReduce hadoop job using bsub with LSF?

bsub -q hadoop -J "Hadoop Job" -n 10 -o hadoopjob.log -hadoop /path/to/hadoop/bin/hadoop fs -rm -r /path/to/hdfs/directory

where:

-q hadoop specifies that the job should be submitted to the Hadoop queue.

-J "Hadoop Job" specifies a name for the job.

-n 10 specifies the number of cores to use for the job.

-o hadoopjob.log specifies the name of the output log file.

-hadoop specifies that the command that follows should be executed on a Hadoop cluster.

/path/to/hadoop/bin/hadoop fs -rm -r /path/to/hdfs/directory specifies the command to run the Hadoop job, which in this case is to remove a directory in HDFS at /path/to/hdfs/directory.

This command will submit a non-MapReduce Hadoop job to the LSF scheduler and allocate resources as necessary based on the job's requirement

Example 7: If you have a Hadoop cluster with YARN and Spark installed, you can submit Spark jobs to the cluster using bsub as shown in the example.

bsub -q normal -J "Spark Job" -n 20 -o sparkjob.log /path/to/spark/bin/spark-submit --class com.example.MyApp --master yarn --deploy-mode cluster /path/to/my/app.jar arg1 arg2

where:

-q normal specifies that the job should be submitted to the normal queue.

-J "Spark Job" specifies a name for the job.

-n 20 specifies the number of cores to use for the job.

-o sparkjob.log specifies the name of the output log file.

/path/to/spark/bin/spark-submit specifies the path to the spark-submit script.

--class com.example.MyApp specifies the main class of the Spark application.

--master yarn --deploy-mode cluster specifies the mode to run the application in.

/path/to/my/app.jar arg1 arg2 specifies the path to the application jar file and its arguments.

The above example does not explicitly require Hadoop to be installed or used. However, it assumes that the Spark cluster is running in YARN mode, which is typically used in a Hadoop cluster. In general, Spark can be run in various modes, including standalone, YARN, and Mesos. There are various other parameters and configurations that can be specified. Some examples include:

--num-executors: Specifies the number of executor processes to use for the job.

--executor-cores: Specifies the number of cores to allocate per executor.

--executor-memory: Specifies the amount of memory to allocate per executor.

--driver-memory: Specifies the amount of memory to allocate for the driver process.

--queue: Specifies the YARN queue to submit the job to.

--files: Specifies a comma-separated list of files to be distributed with the job.

--archives: Specifies a comma-separated list of archives to be distributed with the job.

These parameters can be used to fine-tune the resource allocation and performance of Spark jobs in a Hadoop cluster. Additionally, there are other options that can be used to configure the behavior of the Spark application itself, such as --conf to specify Spark configuration options and --jars to specify external JAR files to be used by the application

Here is an example LSF configuration file (lsf.conf) that includes settings for running Spark applications:

# LSF Configuration File

# Spark settings

LSB_JOB_REPORT_MAIL=N

LSB_DEFAULTGROUP=spark

LSB_DEFAULTJOBGROUP=spark

LSB_JOB_ACCOUNTING_INTERVAL=60

LSB_SUB_LOGLEVEL=3

LSB_JOB_PROLOGUE="/opt/spark/current/bin/load-spark-env.sh"

LSB_JOB_WRAPPER="mpirun -n 1 $LSF_BINDIR/lsb.wrapper $LSB_BINARY_NAME"

LSB_HOSTS_TASK_MODEL=cpu

An example Spark configuration file (spark-defaults.conf) that includes settings for running Spark applications using LSF:

# Spark Configuration File

# LSF settings

spark.master=yarn

spark.submit.deployMode=cluster

spark.yarn.queue=default

spark.executor.instances=2

spark.executor.memory=2g

spark.executor.cores=2

spark.driver.memory=1g

spark.driver.cores=1

spark.yarn.am.memory=1g

spark.yarn.am.cores=1

spark.yarn.maxAppAttempts=2

spark.eventLog.enabled=true

spark.eventLog.dir=hdfs://namenode:8020/spark-event-logs

spark.history.fs.logDirectory=hdfs://namenode:8020/spark-event-logs

spark.scheduler.mode=FAIR

spark.serializer=org.apache.spark.serializer.KryoSerializer

This configuration file sets several parameters for running Spark applications on a YARN cluster managed by LSF, including specifying the number of executor instances, executor memory, and executor cores, as well as setting the queue and memory allocation for the Spark ApplicationMaster.

Configure the Apache Hadoop integration

Run a Hadoop application on LSF

Using LSF as the scheduler for Hadoop can provide better resource utilization, job scheduling, queuing, integration with other workloads, and monitoring and management capabilities than the built-in YARN scheduler. This can help improve the performance, scalability, and efficiency of Hadoop clusters, especially in large, complex environments.

Better resource utilization: LSF has advanced resource allocation and scheduling algorithms that can improve resource utilization in Hadoop clusters. This can lead to better performance and reduced infrastructure costs.
Better job scheduling: LSF has more advanced job scheduling features than YARN, such as support for job dependencies, job preemption, and priority-based job scheduling. This can help optimize job execution and reduce waiting times.
Advanced queuing: LSF allows for more flexible and advanced queuing mechanisms, including job prioritization and preemption, multiple queues with different priorities, and customizable scheduling policies.
Integration with other workloads: LSF is a general-purpose job scheduler that can be used to manage a wide range of workloads, including Hadoop, MPI, and other distributed computing frameworks. This allows for better integration and coordination of workloads on the same infrastructure.
Advanced monitoring and management: LSF provides more advanced monitoring and management tools than YARN, including web-based interfaces, command-line tools, and APIs for job management, resource monitoring, and performance analysis.

LSF is a versatile job scheduler that can be used for a wide range of workloads, including batch and real-time scheduling. While LSF is often used for batch scheduling workloads, it can also be used for real-time scheduling workloads like Apache Kafka, thanks to its advanced scheduling capabilities and integration capabilities with other distributed computing frameworks.

LSF has advanced scheduling capabilities that can help optimize the allocation of resources for real-time workloads, including support for job prioritization, preemption, and multiple queues with different priorities. This can help ensure that real-time workloads are allocated the necessary resources in a timely and efficient manner.

Furthermore, LSF has integration capabilities with other distributed computing frameworks like Apache Kafka. For example, LSF can be used to manage the resource allocation and scheduling of Kafka brokers, consumers, and producers. This can help optimize the performance and scalability of Kafka clusters.

Examples for applications with real time scheduling:

A major financial services company uses Hadoop and LSF to process real-time financial data. LSF is used to manage the allocation of compute resources for Hadoop, including managing the cluster's memory, CPU, and disk resources. This setup enables the company to process real-time financial data with low latency and high throughput.
A large e-commerce company uses Hadoop and LSF to process large volumes of customer data in real-time. LSF is used to schedule and manage jobs across multiple Hadoop clusters, optimizing the allocation of resources to ensure that real-time processing is prioritized. This setup enables the company to personalize customer experiences and deliver targeted marketing campaigns in real-time.
A global telecommunications company uses Hadoop and LSF to process real-time data from its network infrastructure. LSF is used to manage job scheduling and resource allocation, ensuring that data is processed quickly and efficiently. This setup enables the company to monitor and optimize network performance in real-time, providing a better customer experience.

Overall, the combination of Hadoop and LSF can provide a powerful and flexible platform for processing both historical as well as real-time data in production environments. By leveraging the advanced resource management and scheduling capabilities of LSF, organizations can optimize performance, reduce latency, and improve the overall efficiency of their Hadoop clusters.

Reference:

https://www.sachinpbuzz.com/2019/08/spectum-lsf-multicluster-job-forwarding.html
https://www.sachinpbuzz.com/2021/08/spectrum-lsf-101-installation-and.html
https://www.ibm.com/docs/en/spectrum-lsf/10.1.0?topic=hadoop-about-lsf-apache
https://www.ibm.com/docs/en/spectrum-lsf/10.1.0?topic=hadoop-run-application-lsf
https://hadoop.apache.org/
https://www.edureka.co/blog/videos/hadoop-architecture/

Tuesday, April 4, 2023

Linux Test Harness : avocado and op-test framework

A Test Harness, also known as a testing framework or testing tool, is a software tool or library that provides a set of functions, APIs, or interfaces for writing, organizing, and executing tests. Test harnesses provide a structured way to write tests and automate the testing process.

Linux avocado test framework and Linux op-test framework are both open-source testing frameworks designed for testing and validating Linux-based systems. Both frameworks are widely used in the Linux community and have a strong user base. The choice between the two depends on the specific testing needs and requirements of the user.

The Linux avocado test framework is a modular and extensible testing framework that allows users to write and run tests for different levels of the Linux stack, including the kernel, user space, and applications. It provides a wide range of plugins and tools for testing, including functional, performance, and integration testing. The framework is easy to install and use and supports multiple test runners and reporting formats.

On the other hand, the Linux op-test framework is a set of Python libraries and utilities that automate the testing of hardware and firmware components in Linux-based systems. It provides a high-level Python API for interacting with hardware and firmware interfaces, as well as a set of pre-built tests for validating various hardware components such as CPU, memory, and storage. The framework is highly flexible and customizable, allowing users to create their own tests and integrate with other testing tools and frameworks.

While both frameworks are designed for testing Linux-based systems, the Linux avocado test framework provides a broad range of testing capabilities across different levels of the Linux stack, while the Linux op-test framework focuses specifically on automating hardware and firmware testing. The choice between the two depends on the specific testing needs and requirements of the user.

The Linux avocado test framework provides a plugin called "avocado-vt" which can be used to run tests that require a reboot between different test stages. This plugin enables the framework to run destructive tests, like kernel crash dump (kdump) testing, that require the system to be rebooted multiple times.

Similarly, the Linux op-test framework also provides support for testing scenarios that require system reboot. The framework includes a "reboot" library that allows users to reboot the system under test and wait for it to come back up before continuing with the test. This library can be used to test scenarios like kdump and fadump that require system reboot.

The community maintained avocado tests repository:

Avocado is a set of tools and libraries to help with automated testing. One can call it a test framework with benefits. Native tests are written in Python and they follow the unittest pattern, but any executable can serve as a test.

This repository contains a collection of miscellaneous tests and plugins for the Linux Avocado test framework that cover a wide range of functional, performance, and integration testing scenarios. The tests are designed to be modular and easy to use, and can be integrated with the Avocado test framework to extend its capabilities.

https://github.com/avocado-framework-tests/avocado-misc-tests

How to run avocado misc tests :

To run the Avocado Misc Tests, you first need to install the Linux Avocado test framework on your system. Once you have installed the framework, you can clone the Avocado Misc Tests repository from GitHub by running the following command in a terminal:

git clone https://github.com/avocado-framework-tests/avocado-misc-tests.git

git clone git@github.com:avocado-framework-tests/avocado-misc-tests.git

# git clone git@github.com:avocado-framework-tests/avocado-misc-tests.git
Cloning into 'avocado-misc-tests'...
remote: Enumerating objects: 18087, done.
remote: Counting objects: 100% (451/451), done.
remote: Compressing objects: 100% (239/239), done.
remote: Total 18087 (delta 242), reused 368 (delta 208), pack-reused 17636
Receiving objects: 100% (18087/18087), 6.15 MiB | 16.67 MiB/s, done.
Resolving deltas: 100% (11833/11833), done.
#

This repository is dedicated to host any tests written using the Avocado API. It is being initially populated with tests ported from autotest client tests repository, but it's not limited by that.

After cloning the repository, you can navigate to the avocado-misc-tests directory and run the tests using the avocado run command. For example, to run all the tests in the network category, you can run the following command:

cd avocado-misc-tests
avocado run network/

This will run all the tests in the network category. You can also run individual tests by specifying the path to the test file, like this:

avocado run network/test_network_ping.py

This will run the test_network_ping.py test in the network category.

Before running the tests, you may need to configure the Avocado framework to use the appropriate test runner, test environment, and plugins for your system. You can find more information on how to configure and use the Avocado framework in the official documentation:

https://avocado-framework.readthedocs.io/en/latest/

$ avocado run avocado-misc-tests/generic/stress.py
JOB ID : 0018adbc07c5d90d242dd6b341c87972b8f77a0b
JOB LOG : $HOME/avocado/job-results/job-2016-01-18T15.32-0018adb/job.log
TESTS : 1
(1/1) avocado-misc-tests/generic/stress.py:Stress.test: PASS (62.67 s)
RESULTS : PASS 1 | ERROR 0 | FAIL 0 | SKIP 0 | WARN 0 | INTERRUPT 0
JOB HTML : $HOME/avocado/job-results/job-2016-01-18T15.32-0018adb/html/results.html
TIME : 62.67 s

There are a few more interesting things about the Avocado test framework and its usability and use cases:

Flexible test design: The Avocado test framework is designed to be flexible and adaptable to a wide range of testing scenarios. It supports various test types, including functional, integration, performance, and stress tests, and can be used to test software at different levels of abstraction, from system-level to individual components. Avocado also provides a wide range of plugins and interfaces for integrating with other tools and frameworks, making it easy to customize and extend its capabilities.
Easy to use: Avocado is designed to be easy to use, even for users who are new to testing or have limited programming experience. It uses a simple YAML-based syntax for defining tests and test plans, and provides a user-friendly command-line interface for running tests and viewing results. Avocado also includes detailed documentation and tutorials to help users get started quickly.
Scalability and distributed testing: Avocado supports distributed testing across multiple systems, making it easy to scale up testing to handle large workloads. It includes a built-in job scheduler for managing test execution across multiple systems, and can be integrated with various cloud-based services for running tests in the cloud.
Community support: Avocado is an open-source project maintained by a vibrant community of developers and testers. The community provides regular updates and bug fixes, and is actively involved in improving the usability and functionality of the framework. The Avocado community also provides support through various channels, including GitHub, mailing lists, and IRC.
Use cases: Avocado is used by various organizations and companies for testing different types of software, including operating systems, virtualization platforms, container platforms, and cloud services. It is particularly well-suited for testing complex, distributed systems that require a high degree of automation and scalability. Some of the organizations that use Avocado include Red Hat, IBM, Intel, and Huawei.

License

Except where otherwise indicated in a given source file, all original contributions to Avocado are licensed under the GNU General Public License version 2 (GPLv2) or any later version. By contributing you agree that these contributions are your own (or approved by your employer) and you grant a full, complete, irrevocable copyright license to all users and developers of the Avocado project, present and future, pursuant to the license of the project.

==============================================================

OpenPower Test Framework

===============================================================

The community maintained, op-tests repository.

https://github.com/open-power/op-test

git clone git@github.com:open-power/op-test.git

# git clone git@github.com:open-power/op-test.git
Cloning into 'op-test'...
remote: Enumerating objects: 8716, done.
remote: Counting objects: 100% (623/623), done.
remote: Compressing objects: 100% (275/275), done.
remote: Total 8716 (delta 416), reused 480 (delta 347), pack-reused 8093
Receiving objects: 100% (8716/8716), 23.89 MiB | 23.39 MiB/s, done.
Resolving deltas: 100% (6488/6488), done.
#

Pre-requisites for op-tests: [Please do not forget to do it on remote host]

1) yum install sshpass

2) pip3 install pexpect

3) echo "set enable-bracketed-paste off" > .inputrc ; export INPUTRC=$PWD/.inputrc

bind 'set enable-bracketed-paste off'

How to run testcase:

CASE 1:

./op-test -c machine.conf --run testcases.RunHostTest --host-cmd ls

Testcase: https://github.com/open-power/op-test/blob/master/testcases/RunHostTest.py

where machine.conf :

[op-test]
bmc_type=OpenBMC /EBMC_PHYP/FSP_PHYP
bmc_username=abc
bmc_ip=w39
bmc_username=root
bmc_password=0penBmc
hmc_ip=a.b.c.d
hmc_username=hmcuser
hmc_password=hmcpasswd123
host_ip=x.y.x.k
host_user=hostuser
host_password=hostpasswd123
system_name=power10
lpar_name=lpar_name_1
lpar_prof=default_profile

CASE2:

# cat ebmc-type.conf
[op-test]
bmc_type=EBMC_PHYP
bmc_username=service
bmc_password=0penBmc!
bmc_ip=A.B.C.D
hmc_ip=myhmc.com
hmc_username=hmcuser
hmc_password=hmcpassword
system_name=myhost
lpar_name=myhost-lpar7-sachinpb
lpar_prof=default_profile
lpar_gateway=A.B.C.1
lpar_subnet=255.255.255.0
lpar_hostname=myhost-lpar7.com
lpar_mac=CD:EF:GH:IJ:KL
host_ip=X.Y.Z.U
host_user=user
host_password=userpassword
dump_server_ip=
dump_server_pw=
dump_path=/mnt
linux_src_dir=
kernel_image=
initrd_image=
num_of_iterations=

./op-test -c ebmc-type.conf --run testcases.RunHostTest --host-cmd-file cmd.conf

where :

# cat cmd-script.sh
echo "welcome SACHIN P B"
hostname
uptime
date
#

OUTPUT: should have run the above script as shown below:

# ./op-test -c ebmc-type.conf --run testcases.RunHostTest --host-cmd-file cmd-script.sh
Logs in: /root/fix_kdump_FalsePositives/op-test/test-reports/test-run-20230829095648
2023-08-29 09:56:48,758:op-test:setUpLoggerFile:INFO:Preparing to set location of Log File to /root/fix_kdump_FalsePositives/op-test/test-reports/test-run-20230829095648/20230829145648758035.main.log
2023-08-29 09:56:48,758:op-test:setUpLoggerFile:INFO:Log file: /root/fix_kdump_FalsePositives/op-test/test-reports/test-run-20230829095648/20230829145648758035.main.log
2023-08-29 09:56:48,758:op-test:setUpLoggerDebugFile:INFO:Preparing to set location of Debug Log File to /root/fix_kdump_FalsePositives/op-test/test-reports/test-run-20230829095648/20230829145648758291.debug.log
[console-expect]#which whoami && whoami
/usr/bin/whoami
root
[console-expect]#echo $?
echo $?
0
[console-expect]#echo "welcome SACHIN P B"
echo "welcome SACHIN P B"
welcome SACHIN P B
[console-expect]#echo $?
echo $?
0
[console-expect]#hostname
hostname
myhost.com
[console-expect]#echo $?
echo $?
0
[console-expect]#uptime
uptime
09:58:15 up 7:50, 2 users, load average: 0.08, 0.02, 0.01
[console-expect]#echo $?
echo $?
0
[console-expect]#date
date
Tue Aug 29 09:58:15 CDT 2023
[console-expect]#echo $?
echo $?
0
ok
Ran 1 test in 7.510s
OK
2023-08-29 09:58:17,787:op-test:<module>:INFO:Exit with Result errors="0" and failures="0"

------------------------------------------------------------------------------------------------------------------

Example 2:

python3 op-test -c machine.conf --run testcases.PowerNVDump.KernelCrash_disable_radix

python3 op-test -c machine.conf --run testcases.PowerNVDump.KernelCrash_XIVE_off

python3 op-test --run-suite osdump-suite -c CR-machine.conf

python3 op-test --run testcases.RunHostTest -c CR-Machine.conf --host-cmd-file CR-Machine_command.conf --host-cmd-timeout 286400

Example 3: [Testcase file : PowerNVDump.py]

1) How to execute ONLY kdump tests :

python3 op-test --run-suite osdumpkdumpsuite -c machine.conf

2) How to execute ONLY Fadump tests :

python3 op-test --run-suite osdumpfadumpsuite -c machine.conf

3) How to run sanity tests includes basic kdump and fadump

python3 op-test --run-suite osdumpsanitysuite -c machine.conf

Example 4: [ Testcase file : OpTestKexec.py]

How to run the kexec test in op-tests framework

./op-test -c machine.conf --run testcases.OpTestKexec.OpTestKexec.test_load_unload

./op-test -c machine.conf --run testcases.OpTestKexec.OpTestKexec.test_load_and_exec

./op-test -c machine.conf --run testcases.OpTestKexec.OpTestKexec.test_syscall_load_and_exec

./op-test -c machine.conf --run testcases.OpTestKexec.OpTestKexec.test_kexec_unsigned_kernel

./op-test -c machine.conf --run testcases.OpTestKexec.OpTestKexec.test_kexec_in_loop

where machine.conf :

[op-test]
bmc_type=FSP_PHYP
bmc_username=bmcadmin
bmc_password=**********
bmc_ip=ABC-fsp.america.com
hmc_ip=HMC1.america.com
hmc_username=adminhmc
hmc_password=********
system_name=System123
lpar_name=system123-lp4_SACHINPB
lpar_prof=default_profile
lpar_gateway=9.x.y.1
lpar_subnet=255.255.255.0
lpar_hostname=System123-lp4.com
lpar_mac=A:B:C:D
host_ip=9.X.y.z
host_user=root
host_password=**********
dump_server_ip=9.m.n.c
dump_server_pw=**********
dump_path=/mnt
linux_src_dir=
kernel_image=
initrd_image=
num_of_iterations=100

===============================================================

How to analyze op-test output :

Traverse this directory path: op-test/test-reports/test-run-$date

There are are 3 log files to investigate the test failure or life-cycle of testsuite

# pwd

/root/fix_kdump_FalsePositives/op-test/test-reports/test-run-$DATE

# ls -1

$DATE.log

$DATE.main.log

$DATE.debug.log

1 ) $DATE.log ====> You could see console related commands and outputs

For example :

lssyscfg -m Serverx0 -r lpar --filter lpar_names=Serverx0-lp6 -F state

lsrefcode -m Serverx0 -r lpar --filter lpar_names=Serverx0-lp6 -F refcode

chsysstate -m Serverx0 -r lpar -n Server-lp6 -o shutdown --immed

2) $DATE.main.log

If you add any statements in log.info , that will be logged in this file

log.info("=============== Testing kdump/fadump over ssh ===============")

3) $DATE.debug.log

If you add any comments with log.debug, that will be logged in this file

log.debug("SACHIN_DEBUG: In loop1")

============================================================

Listed are some interesting things about the op-test framework and its use cases:

Testing hardware systems: The op-test framework is designed for testing hardware systems, particularly servers, using the OpenPOWER architecture. It includes a wide range of tests that cover different aspects of hardware functionality, such as power management, CPU, memory, and I/O.
Integration with OpenBMC: The op-test framework integrates with the OpenBMC project, an open-source implementation of the Baseboard Management Controller (BMC) firmware that provides out-of-band management capabilities for servers. This integration allows users to control and monitor server hardware using the OpenBMC interface, and to run tests on the hardware using the op-test framework.
UEFI and firmware testing: The op-test framework includes support for testing UEFI firmware and other low-level system components, such as the Hostboot bootloader. This allows users to test the system firmware and ensure that it is functioning correctly.
Easy to use: The op-test framework is designed to be easy to use, even for users who are not familiar with hardware testing. It uses a simple command-line interface and provides detailed documentation and tutorials to help users get started quickly.
Scalability: The op-test framework is designed to be scalable and can be used to test multiple systems in parallel. This makes it suitable for testing large server farms and data centers.
Community support: The op-test framework is an open-source project with an active community of developers and testers. The community provides regular updates and bug fixes, and is actively involved in improving the usability and functionality of the framework. The op-test community also provides support through various channels, including GitHub, mailing lists, and IRC.
Use cases: The op-test framework is used by various organizations and companies for testing hardware systems, including server manufacturers, data center operators, and cloud service providers. Some of the organizations that use the op-test framework include IBM, Google, and Rackspace.

How to contribute to op-test framework open source community :

1) mkdir kdump_xive_off_check

2) cd kdump_xive_off_check

3) git clone git@github.com:SACHIN-PB/op-test.git

Fork the repository from master : https://github.com/open-power/op-test

NOTE: In Git, forking a repository means creating a copy of the original repository into your own GitHub account.

This is typically done when you want to contribute to an open-source project or collaborate with other developers.

4) git config user.email

5) git config user.name

NOTE: To get proper username and email . Please do the following setup at /root directory

# cat .gitconfig

[user]

email = sachin@linux.XYZ.com

name = Sachin P B

6) git branch

7) git remote -v

origin git@github.com:SACHIN-PB/op-test.git (fetch)

origin git@github.com:SACHIN-PB/op-test.git (push)

8) git remote add upstream git@github.com:open-power/op-test.git

9) git remote -v

origin git@github.com:SACHIN-PB/op-test.git (fetch)

origin git@github.com:SACHIN-PB/op-test.git (push)

upstream git@github.com:open-power/op-test.git (fetch)

upstream git@github.com:open-power/op-test.git (push)

10) git checkout -b "kdump_xive_off_check"

11) git branch

12) vi testcases/PowerNVDump.py

13) git diff

14) git status

15) git add testcases/PowerNVDump.py

16) git status

17) git commit -s

18) git branch

19) git push origin kdump_xive_off_check

Enumerating objects: 7, done.

Counting objects: 100% (7/7), done.

Delta compression using up to 16 threads

Compressing objects: 100% (4/4), done.

Writing objects: 100% (4/4), 880 bytes | 880.00 KiB/s, done.

Total 4 (delta 3), reused 0 (delta 0), pack-reused 0

remote: Resolving deltas: 100% (3/3), completed with 3 local objects.

remote:

remote: Create a pull request for 'kdump_xive_off_check' on GitHub by visiting:

remote: https://github.com/SACHIN-PB/op-test/pull/new/kdump_xive_off_check

remote:

To github.com:SACHIN-PB/op-test.git

* [new branch] kdump_xive_off_check -> kdump_xive_off_check

20) Create PR using the link created at step 19 and request for the review

Example https://github.com/open-power/op-test/pull/7XYZ4:

21) You can update your PR by running these commands

git commit --amend
git push -f origin kdump_xive_off_check

======================

Reference:

1) https://github.com/open-power/op-test/blob/master/testcases/RunHostTest.py

2) https://github.com/avocado-framework-tests/avocado-misc-tests

3) https://avocado-framework.readthedocs.io/en/latest/

LINUX & HPC : Advanced Large Scale Computing at a Glance !

Wednesday, April 19, 2023

Kubernetes - decommissioning a node from the cluster

Thursday, April 13, 2023

IBM Spectrum Symphony and LSF with Apache Hadoop

Tuesday, April 4, 2023

Linux Test Harness : avocado and op-test framework

Popular Posts

Translate