LINUX & HPC : Advanced Large Scale Computing at a Glance !

Data Address Watchpoint Register (DAWR) - PowerPC CPU Registers Overview

2025-07-26T19:56:00.001+05:30

The DAWR (Data Address Watch Register) is a PowerPC hardware CPU register used for debugging and tracing memory access. It's not a standard general-purpose register, but a specialized one for monitoring specific memory addresses. It's part of the memory management unit (MMU) and allows the programmer to set breakpoints or triggers based on memory reads or writes to a particular address.

Purpose:

The DAWR is primarily used for debugging purposes. It allows programmers to set up watchpoints on specific memory locations. When the CPU accesses (reads or writes) the watched address, it can trigger an event, such as a breakpoint or a debug interrupt. It detects unexpected memory writes, helping diagnose issues like buffer overflows or data corruption.

Hardware Implementation:

The DAWR is part of the PowerPC's MMU. It works by comparing the memory address being accessed with the address stored in the DAWR. If they match, the hardware triggers a specified action.

DAWR operates by:

- Storing a target memory address.

- Comparing every memory access against this address.

- Triggering an event if there's a match.

This is done entirely in hardware, making it efficient and low-overhead compared to software-based tracing.

Usage:

The DAWR is not directly used for arithmetic or other general-purpose computations. It's a dedicated resource for tracing memory access patterns during development and debugging.

DAWR is integrated with Linux kernel features like:

- perf

- ptrace

- hw_breakpoint API

Example:

A programmer might use the DAWR to detect when a specific variable in memory is being written to by the program, which can be helpful in tracking down data corruption issues.

Step 1: Create a simple C program

#include <stdio.h>
#include <unistd.h>
int main() {
volatile int target = 0;
for (int i = 0; i < 10; i++) {
target += i;
printf("target = %d\\n", target);
sleep(1);
}
return 0;
}

------------------------

Compile it:

gcc -O0 -g -o watchme watchme.c

-----------------------

Step 2: Find the address of `target` using GDB

gdb ./watchme
(gdb) break main
(gdb) run
(gdb) print &target

Suppose it returns 0x555555756014

Step 3: Use perf to trace writes

perf record -e mem:0x555555756014:w -p $(pidof watchme)

Step 4: View the report

perf report

-----------------------------------------------------

Other PowerPC Registers:

Besides the DAWR, PowerPC has a set of general-purpose registers (32 of them), floating-point registers (32), and vector registers (32 with AltiVec).

1. General-Purpose Registers (GPRs)

- Count: 32 registers (`r0` to `r31`)
- Width: 32-bit or 64-bit depending on the architecture
- Purpose: Used for integer arithmetic, logical operations, address calculations, and data movement.
- Special Notes:
- `r1` is typically used as the stack pointer.
- `r0` has special behavior in some instructions (e.g., it may be treated as zero).

Example:

addi r3, r3, 1 ; Increment value in r3 by 1

------------------------------------------------

2. Floating-Point Registers (FPRs) :

- Count: 32 registers (`f0` to `f31`)
- Width: 64-bit
- Purpose: Used for floating-point arithmetic (e.g., addition, multiplication, division).
- IEEE 754 compliant for single and double precision.

Example:

fadd f1, f2, f3 ; f1 = f2 + f3

3. Vector Registers (VRs) with AltiVec/VMX

- Count: 32 registers (`v0` to `v31`)
- Width: 128-bit
- Purpose: Used for SIMD (Single Instruction, Multiple Data) operations.
- AltiVec (also known as VMX) enables parallel processing of data—ideal for multimedia, signal processing, and scientific computing.

Example:

vector float a, b, c;

c = vec_add(a, b); // Adds two vectors element-wise

NOTE: VMX stands for Vector Multimedia Extension, and it's the original name for what is more commonly known as AltiVec — a SIMD (Single Instruction, Multiple Data) instruction set used in PowerPC processors. That allows a single instruction to operate on multiple data elements in parallel. Each vector register(`v0` to `v31`) can hold:

- 16 x 8-bit integers
- 8 x 16-bit integers
- 4 x 32-bit integers or floats

Parallel operations on multiple data elements. Efficient for Image and video processing, Audio filtering, Cryptography, Matrix and vector math

#include <altivec.h>

vector float a = {1.0, 2.0, 3.0, 4.0};
vector float b = {5.0, 6.0, 7.0, 8.0};
vector float c = vec_add(a, b); // c = {6.0, 8.0, 10.0, 12.0}

This performs four additions in parallel using a single instruction.

----------------------------------

4. Special-Purpose Registers (SPRs):

Includes registers like:

- LR (Link Register): Stores return address for function calls
- CTR (Count Register): Used in loops and branching
- XER (Fixed-Point Exception Register): Tracks overflow, carry, etc.
- MSR (Machine State Register): Controls processor state
- DAWR: For memory watchpoint

These are accessed using `mfspr` (move from SPR) and `mtspr` (move to SPR) instructions.

---------------------------------

Conclusion :

DAWR is a powerful tool for low-level debugging on PowerPC systems. Whether you're working on kernel modules, embedded systems, or performance profiling, DAWR can help you pinpoint memory access issues with precision.

Reference :

https://www.ibm.com/docs/en/aix/7.2.0?topic=faa-special-purpose-register-changes-special-purpose-register-field-handling
https://docs.rtems.org/docs/4.11.0/cpu-supplement/powerpc.html

RAID Demystified: A Beginner’s Guide to Data Protection and Performance

2025-03-29T14:48:00.006+05:30

In today’s digital world, data is invaluable and losing data/files due to disk failure can be devastating. This is where RAID (Redundant Array of Independent Disks) comes into play. RAID is a powerful technology that enhances data protection, storage performance, and fault tolerance, ensuring your data remains safe and accessible.

What is RAID?

RAID combines multiple hard drives into a single logical unit, providing benefits such as:

✅ Improved Performance – Faster read/write speeds by distributing data across multiple disks.

✅ Higher Availability – Ensures continuous access to data even if a drive fails.

✅ Fault Tolerance – Protects against hardware failures by storing data redundantly.

RAID is widely used by businesses, IT professionals, and individuals who need secure and efficient storage solutions.

How RAID Works: A Simple Example

Imagine you have a storage server with three disks where your data is stored.

📌 Without RAID: If one disk fails, all data on it is lost, potentially leading to downtime or permanent data loss.

📌 With RAID: The system keeps functioning because the data is mirrored or distributed across other disks, ensuring redundancy and seamless recovery.

RAID Controller: The Brain Behind the System

A RAID controller manages the disks, making them appear as a single storage unit to the operating system. This improves redundancy, performance, and reliability without manual intervention.

Different RAID types/configurations cater to different needs. Let’s break down the most commonly used RAID levels:

🔹 RAID 0 (Striping) – Enhances speed by distributing data across multiple drives. 🚀 Pros: High performance. ❌ Cons: No redundancy—if one drive fails, all data is lost.

🔹 RAID 1 (Mirroring) – Creates an exact copy of data on two drives. 🔄 Pros: Excellent data protection. ❌ Cons: Requires double the storage space.

🔹 RAID 5 (Striping with Parity) – Balances performance and redundancy by distributing parity data. 🔄 Pros: Can survive a single drive failure. ❌ Cons: Slower write speeds due to parity calculations.

🔹 RAID 6 (Double Parity) – Similar to RAID 5 but with extra fault tolerance. 🔄 Pros: Can survive two simultaneous drive failures. ❌ Cons: Requires more storage space.

🔹 RAID 10 (RAID 1 + RAID 0) – Combines speed and redundancy by mirroring data and striping it across multiple drives. Pros: Best of both worlds—fast and secure. ❌ Cons: Requires at least four drives.

Choosing the Right RAID Level

Your choice depends on your needs:

✅ Need speed? RAID 0

✅ Need redundancy? RAID 1, RAID 5, or RAID 6

✅ Need both? RAID 10

Advantages of Using RAID

✔ Data Redundancy – Protects against hardware failures.

✔ Fault Tolerance – Keeps systems running even if a drive fails.

✔ High Performance – Improves read/write speeds.

✔ Scalability – Easily expand storage by adding more drives.

✔ Faster Recovery – Restores lost data quickly compared to traditional backups.

Limitations of RAID

❌ Additional Cost – Requires multiple hard drives.

❌ Setup Complexity – Requires technical knowledge.

❌ Not a Backup Solution – Protects against drive failures, but not accidental deletions or cyberattacks.

Final Thoughts: Is RAID Right for You?

RAID is an excellent data protection solution that improves performance and prevents data loss from hardware failures. Whether you’re a business storing critical files or a home user safeguarding personal data, RAID can provide peace of mind and reliability.

However, remember that RAID is not a replacement for backups—it only protects against hardware failures, not accidental deletions or cyber threats.

Linux Memory Management and Layout

2025-02-09T11:47:00.001+05:30

In Linux, RAM is divided into kernel space and user space to manage memory and protect the kernel from user applications. Here's a breakdown of how this division works, along with details on high memory, low memory, and crashkernel allocation:

Kernel Space vs. User Space

Kernel Space: This area is reserved for running the kernel, kernel extensions, and most device drivers. User-space applications cannot directly access this memory.

User Space: This is the memory area where application software and some drivers execute, typically with one address space per process.

The separation ensures that processes are protected from each other and that the kernel is protected from user-space applications.

Memory Division

1. Address Space: On some architectures like x86-64 ppc64, the virtual address space is split into two halves. The bottom half is for user-space allocations, and the top half is for kernel allocations. For example, on x86, there can be a 3G/1G split, where 3GB is for user space and 1GB for the kernel.

2. Memory Allocation: Memory is allocated as needed within the address space. The address space split determines the use of virtual addresses but doesn’t dictate physical memory use. The kernel allocates memory for its own binary and any additional needs, which cannot be swapped out.

3. Memory Mapping: The kernel manages memory mapping with the help of hardware (MMU - memory management unit). The kernel maintains its own mappings shared by all processes, and each process gets its own user-space mapping.

High Memory vs. Low Memory

Applicability: The "high/low memory" split is primarily relevant to 32-bit architectures with relatively large physical RAM (more than ~1 GB). On 64-bit architectures or systems with smaller physical address spaces, the entire physical space can be accessed from the kernel virtual memory space, and all physical memory is considered "low memory".

Low Memory: In Linux systems, low memory is typically used for the kernel. A portion of the kernel virtual address space can be mapped as a single contiguous chunk into physical "low memory".

High Memory: High memory is often used for application space. It is a range of the kernel's memory space where data to be accessed is placed.

In essence, high memory is a region of kernel memory used to map physical memory that cannot be contiguously mapped with the rest of the kernel's virtual memory.

Crashkernel Memory Allocation:

Crashkernel is a reserved memory region that can be used to boot a second kernel in case of a system crash, enabling crash analysis and debugging. Its is design that some part of the this allocated in Low memory and other part at high memory

GRUB (GRand Unified Bootloader) is a boot loader that loads the Linux kernel, presenting a menu of operating systems or kernels to choose from at system startup. It allows users to select which OS to boot and pass arguments to the kernel. GRUB's RAM usage occurs in the real mode area (RMA), which is a part of RAM used by GRUB for its operations and to load boot components. The current size of the RMA is 768 MB. However, GRUB2 might fail with an out-of-memory error depending on bigger size of kernel image or the initramfs

The memory layout of a program is divided into segments, each with a specific purpose. These segments include the text segment, data segment (initialized and uninitialized), heap, and stack. In the context of memory allocation, the parameters you've provided describe the boundaries and limits of memory usage for a program. Here's an explanation of each term:

memory_limit: This sets the maximum amount of memory in bytes that a script is allowed to allocate. It helps prevent poorly written scripts from using up all available memory. A value of 0 here (0000000000000000) likely means there is no memory limit. However, in some contexts, it might represent the initial memory limit which can be changed. In PHP, setting `memory_limit` to -1 means the script can use all the memory that is left over from the operating system and other important processes running.

alloc_bottom: This indicates the starting address of the memory region available for allocation. In your example, it's 0000000010350000.

alloc_top: This represents the highest address up to which memory can be allocated. Here, it's 0000000030000000.

alloc_top_hi: This could indicate a higher boundary for memory allocation, possibly in a system that uses segmented memory management. In this case, it's 0000000100000000.

rmo_top: This might refer to the top of the read-only memory region. It is 0000000030000000 in your layout.

ram_top: This likely indicates the top address of the available RAM. It's 0000000100000000 in your example.

The stack segment is near the top of memory with a high address, while the text, data, and heap segments have lower addresses. When a function is called, stack memory is allocated for it, and when a new local variable is declared, more stack memory is allocated, causing the stack to grow downwards. Stack memory allocation and deallocation are done automatically. The heap is where dynamic memory allocation takes place using functions like `malloc()` and `calloc()`. Unlike the stack, heap memory allocation is not continuous, and users can free heap memory, causing fragmentation.

Here's the conversion for each of your memory parameters:

memory_limit: 0000000000000000 (hex) = 0 bytes. It means there is no memory limit imposed.

alloc_bottom: 0000000010350000 (hex) = 271,151,104 bytes.

alloc_top: 0000000030000000 (hex) = 805,306,368 bytes.

alloc_top_hi: 0000000100000000 (hex) = 4,294,967,296 bytes.

rmo_top:0000000030000000 (hex) = 805,306,368 bytes.

ram_top:0000000100000000 (hex) = 4,294,967,296 bytes.

Now, calculating the sizes of memory ranges:

1) Allocation Range (alloc_bottom to alloc_top):
alloc_top - alloc_bottom + 1 = 805,306,368 - 271,151,104 + 1 = 534,155,265$$ bytes
In MB: $$534,155,265 / 1,048,576 \approx 509.41MB
2) Upper Allocation Range (alloc_bottom to alloc_top_hi):
alloc_top_hi - alloc_bottom + 1 = 4,294,967,296 - 271,151,104 + 1 = 4,023,816,193$$ bytes
In GB: 4,023,816,193 / 1,073,741,824 approx 3.75GB
3) RMO Size (from alloc_bottom to rmo_top):
rmo_top - alloc_bottom + 1 = 805,306,368 - 271,151,104 + 1 = 534,155,265 bytes
approx 509.41MB
4) RAM Size (from alloc_bottom to ram_top):
ram_top - alloc_bottom + 1 = 4,294,967,296 - 271,151,104 + 1 = 4,023,816,193bytes
approx 3.75GB

-----------------------------------------

The Real Mode Area (RMA) in the context of the Linux kernel and boot process refers to the region of memory where the system initially operates in real mode before transitioning to protected mode. In real mode, the processor can address only 1 MB of memory.

Here's a breakdown of how the RMA fits into the boot process:

1. Real Mode Operation: When the computer starts, it boots into real mode. In this mode, the processor has a limited addressing capability (1MB).

2. GRUB Loading: GRUB (or any bootloader) operates initially in real mode. It loads the kernel image into memory using BIOS disk I/O services.

3. Memory Arrangement: In real mode, RAM is organized such that the kernel image is loaded into memory by the boot loader. A small part of the kernel containing real-mode code is loaded below the 640K barrier, while the larger part that runs in protected mode is loaded after the first megabyte.

4. Transition to Protected Mode: After loading the necessary components, the system switches from real mode to protected mode, which allows access to more memory and advanced features.

5. Memory Addressing: In Real Mode, memory access is done using Segmentation via a segment:offset system.

Key Components in IBM Power Systems Boot Process

2025-01-11T20:59:00.002+05:30

Key Components in IBM Power Systems Boot Process

Stage 1 : OPAL Firmware: Initializes hardware and provides runtime services. Passes control to Petitboot as the default bootloader.

Stage 2 : Petitboot: Functions as the primary bootloader. Uses kexec to load the Linux kernel and initramfs directly.

Scans available devices for bootable options.
Detects core.elf (GRUB binary) as a bootable ELF file.
Loads and executes core.elf.

Stage 3: GRUB: May be involved as an intermediate bootloader for Linux distributions that rely on GRUB configuration (e.g., RHEL, SLES, or Ubuntu Server).Works as part of the core.elf file, loaded by Petitboot in some scenarios.

GRUB reads its configuration file (e.g., /boot/grub/grub.cfg).
Presents a boot menu (if configured) or selects the default kernel.
Loads the Linux kernel and initramfs into memory.
Passes control to the kernel.

Steage 4: Linux Kernel : The kernel initializes the system and starts the init process.

----------------------------------------------------------------------------------------------------------------

Advantages :

Compatibility: Supports Linux distributions with GRUB-based boot processes.
Flexibility: Allows advanced boot scenarios (e.g., multiple kernels, chainloading).
Optimization: Petitboot handles hardware initialization efficiently, while GRUB adds cross-distro compatibility.

While Petitboot is the default and primary bootloader in IBM Power Systems, GRUB can be used as part of the boot process, particularly through the core.elf file. Petitboot loads and executes GRUB when required, allowing Linux distributions to leverage GRUB's flexibility and maintain consistent boot processes across architectures. This combination ensures optimal performance and compatibility for enterprise-grade Linux distributions on Power Systems.

---------------------------- Power Firmware -------------------

In IBM's PowerPC architecture (commonly used in IBM Power Systems), firmware plays a critical role in managing hardware resources, initializing the system, and providing runtime services. Here's how it works and where firmware resides:

Key Firmware Components in IBM Power Systems

Hostboot:

Responsible for the low-level initialization of the system, such as memory controller setup and processor initialization. Resides in non-volatile storage (e.g., flash memory) on the system board.

Runs on the main processor during the very early stages of boot.

OpenPOWER Firmware (OPAL):

Acts as the interface between the hardware and the operating system.

Provides services such as interrupt handling, power management, and hardware abstraction.

Resides in non-volatile memory (flash storage) but is loaded into main system RAM for execution during boot.

Petitboot:

A Linux-based bootloader that uses the kexec mechanism to load the Linux kernel.

Petitboot itself resides on the system's non-volatile storage and is executed in main system RAM during the boot process.

System Management Firmware:

Manages system-level operations such as monitoring, diagnostics, and recovery.

Runs on a dedicated service processor (e.g., BMC) and resides in the BMC's non-volatile storage.

Where Firmware Resides

Non-Volatile Storage (Flash Memory):

Core firmware components like Hostboot, OPAL, and Petitboot are stored in the system's flash memory on the motherboard or a separate chip. This ensures persistence even when the system is powered off.

System RAM:

During the boot process, firmware like Hostboot, OPAL, and Petitboot are copied from flash memory into main system RAM for execution.

The Linux kernel uses OPAL calls to interact with hardware, and these services are available as long as the system runs.

Service Processor (BMC):

The BMC firmware resides in its own dedicated non-volatile memory on the service processor.

The BMC operates independently of the main system and manages power-on, firmware updates, and error reporting. Interaction with Linux OS on Power Systems

Firmware-to-OS Handoff:

OPAL firmware initializes hardware and performs diagnostics before handing control to the Linux kernel. Petitboot (running on top of OPAL) loads the Linux kernel via kexec.

Runtime Services:

OPAL continues to provide runtime services to the Linux kernel, such as hardware interrupts, error handling, and power state management. Linux interacts with firmware using the OPAL API and device tree structures.

Firmware Location on Filesystem:

Firmware blobs for devices (e.g., network cards, GPUs) are stored in /lib/firmware. Core system firmware (e.g., OPAL, Hostboot) does not reside in the Linux filesystem but in the system's non-volatile memory. Summary for Power Systems

Primary Firmware (Hostboot, OPAL, Petitboot): Resides in non-volatile storage (flash memory) on the system board. Executed in system RAM during boot and runtime.

Service Processor Firmware (BMC): Resides in dedicated non-volatile storage on the service processor. Operates independently of the main CPU and Linux OS.

Device Firmware: Resides in /lib/firmware on the Linux filesystem and is loaded into specific hardware devices by their drivers.

This modular design ensures separation between core firmware, runtime services, and device-specific firmware, enabling robust and scalable operations on IBM Power Systems.

------------------------------------------HMC-----------------------------------

The Hardware Management Console (HMC) is a critical component in managing IBM server systems, including IBM Power Systems and IBM Z mainframes. It serves as a physical or virtual appliance that provides a unified interface for system administrators to control and monitor multiple servers and their partitions.

Key Features of HMC

- Management Interface: The HMC offers both command line (SSH) and web-based interfaces, allowing for flexible access and management of the systems it oversees. This includes functionalities for monitoring system health, configuring hardware, and managing logical partitions.

- Multi-System Management: One of the significant advantages of the HMC is its ability to manage multiple servers simultaneously. This capability is essential for organizations with complex IT infrastructures, as it simplifies administration tasks and enhances operational efficiency.

- Virtualization Support: The HMC plays a crucial role in virtualization by enabling the creation and management of logical partitions (LPARs). This allows for better resource utilization and flexibility in deploying applications across different environments.

- Monitoring and Diagnostics: Administrators can quickly identify hardware issues through the HMC's monitoring tools. It provides real-time status updates and alerts, facilitating proactive maintenance and reducing downtime.

- Redundancy and Reliability: The HMC can be configured in redundant setups to ensure high availability. Dual HMCs can manage the same systems, providing backup capabilities in case one console fails.

- Security Features: The HMC is designed with security in mind, featuring a closed platform that restricts unauthorized software installations and limits access to essential functions. It is firewalled by default, with minimal open ports to enhance security against external threats.

Connection Between mkvterm and Virtual Serial Adapters

The mkvterm command on IBM's Hardware Management Console (HMC) is used to open a virtual terminal connection to a logical partition (LPAR). This command is closely associated with virtual serial adapters, which facilitate the connection between the HMC and the LPAR.

1. Virtual Serial Adapters:

- Each LPAR typically has two virtual serial server adapters, allowing for console connections. These adapters are configured within the Virtual I/O Server (VIOS) environment.

- The mkvtermerm command allows users to connect to these virtual serial adapters, effectively opening a console session for managing the LPAR

2. Usage of mkvterm:

- The mkvterm command on the HMC corresponds directly to commands used in the VIOS for managing virtual terminal sessions. When you execute mkvterm, you specify the LPAR ID to establish a connection through an available virtual serial adapter.

- If the HMC is unavailable, users can still access the LPAR console via VIOS using a similar command (mkvterm), which underscores the flexibility of managing LPARs through different interfaces.

3. Session Management:

- It is crucial to manage these sessions properly. For instance, if a console session is active through VIOS, attempting to start another session via HMC will result in an error indicating that a terminal session is already open. This emphasizes the need for careful session handling to avoid conflicts.

4. Command Examples:

- To create a virtual serial client adapter on VIOS, one might use commands like:

chhwres -m ms02 -r virtualio --rsubtype serial -o a -p ms02-vio1 -s 45 -a adapter_type=client,remote_lpar_name=Machine02,remote_slot_num=0,supports_hmc=0

```

- To start a console session for an LPAR using mkvterm:

```bash

mkvterm -id <LPAR-ID>

```

The mkvterm command serves as a bridge between the HMC and virtual serial adapters, allowing administrators to manage LPARs efficiently through console connections. Proper configuration and management of these connections ensure seamless operations within IBM's Power Systems environment.

Comprehensive Guide to Linux Kernel Selftests

2024-10-30T14:42:00.003+05:30

The Linux kernel includes a powerful suite of self-tests found under the `tools/testing/selftests/` directory. These self-tests provide a framework for developers to verify specific code paths within the kernel, ensuring the robustness and reliability of individual components. This guide covers how to set up, run, and customize these tests to suit different testing needs.

Overview of Kernel Self-Tests

Kernel self-tests serve as essential tools for validating new code additions or changes to the Linux kernel. They can be run post-kernel build, installation, and boot to validate that code changes do not introduce regressions or unexpected behavior. One critical feature within these self-tests is the hot-plug testing, which allows developers to simulate and verify the dynamic plugging and unplugging of CPUs and memory.

Hot-Plug Tests: Limited and Full Modes

Hot-plug testing, by nature, can be complex due to dependencies on hardware readiness for CPU and memory offlining. To mitigate potential system hangs during these tests, the kernel provides:

Limited Mode: CPU hot-plug is tested on a single CPU, and memory hot-plug is restricted to 2% of available hot-plug-capable memory.

Full Mode: This extensive test mode performs hot-plug actions across all hot-plug capable CPUs and 10% of the memory. This mode can be useful for systems where comprehensive validation of hot-plug readiness is required.

Building and Running Self-Tests

$ make -C tools/testing/selftests

$ make -C tools/testing/selftests run_tests

Alternatively, to both build and run in a single command, use:

$ make kselftest

Note: Some tests, particularly those requiring hardware access, may need root privileges.

Running Specific Self-Tests

Kernel self-tests offer flexibility for targeting specific components or subsystems. To run only the tests related to a specific subsystem, use the `TARGETS` variable:

For example, to run only `ptrace` tests:

$ make -C tools/testing/selftests TARGETS=ptrace run_tests

$ make TARGETS="size timers" kselftest

Refer to the `Makefile` located at `tools/testing/selftests/Makefile` for a complete list of possible test targets.

Running Full Range Hot-Plug Self-Tests

For systems requiring full-scale hot-plug testing, the following commands build and execute the tests across all hot-plug-capable resources.

$ make -C tools/testing/selftests hotplug

$ make -C tools/testing/selftests run_hotplug

Installing Kernel Self-Tests

To simplify running self-tests on other systems, the `kselftest_install.sh` script allows installation to a default or user-specified location.

Default Installation

$ cd tools/testing/selftests

$ ./kselftest_install.sh

Custom Installation Location

Specify an installation directory for easier management and accessibility:

$ cd tools/testing/selftests

$ ./kselftest_install.sh <install_dir>

Running Installed Self-Tests

After installing, run the installed tests using the provided `run_kselftest.sh` script, which executes all self-tests in the specified installation directory.

$ cd kselftest

$ ./run_kselftest.sh

---------------------------------------

Testing BPF (Berkeley Packet Filter) and fprobe features within Linux kernel self-tests involves several additional steps because these tests require specific kernel configurations and privileges to ensure proper operation. Here’s a breakdown of the steps required:

Enable Required Kernel Configurations

To run BPF and fprobe self-tests, ensure the kernel has the necessary configurations enabled. These settings are often found in the kernel's `.config` file before building the kernel.

CASE 1 : For BPF Self-Tests

Enable the following configurations:

- `CONFIG_BPF`: Enable BPF support

- `CONFIG_BPF_SYSCALL`: Enable BPF syscall interface

- `CONFIG_BPF_JIT`: Enable BPF Just-In-Time (JIT) compiler

- `CONFIG_HAVE_EBPF_JIT`: Enable extended BPF JIT support

- `CONFIG_DEBUG_INFO_BTF`: Enable BPF Type Format (BTF), which is often required for BPF self-tests

- `CONFIG_NET`: Enable networking stack if testing network-related BPF functionality

If running BPF tracing , make sure the following options are also enabled:

- `CONFIG_TRACING`: Enable kernel tracing

- `CONFIG_FUNCTION_TRACER`: Enable function tracing (for eBPF tracing features)

- `CONFIG_BPF_EVENTS`: Enable BPF events for tracing

CASE2: For fprobe Self-Tests

fprobe relies on certain configurations, particularly in the tracing subsystem:

- `CONFIG_KPROBES`: Enable kernel probes

- `CONFIG_FPROBE`: Enable fprobe functionality, which is typically under kernel debugging and tracing options

- `CONFIG_KPROBE_EVENTS`: Enable kprobe events (needed for fprobe usage and testing)

Use `make menuconfig` or `make nconfig` in your kernel source directory to enable these options interactively.

--------------------------------------------

Recompile and Boot into the Custom Kernel

After setting the configurations, recompile the kernel to include the BPF and fprobe functionality, then reboot into this kernel.

$ make -j$(nproc) && make modules_install && make install

$ reboot

Install Required User-Space Utilities

Some BPF and fprobe tests may require user-space tools for BPF program management or tracing. Ensure these tools are installed:

bpftool: Used to inspect and manage BPF programs and maps.
iproute: Provides tools like `tc` for attaching BPF programs to network traffic.

Run BPF and fprobe Self-Tests : Navigate to the `tools/testing/selftests/` directory in your kernel source and specify the `TARGETS` for BPF and fprobe:

Running BPF Tests

$ make -C tools/testing/selftests TARGETS=bpf run_tests

BPF self-tests are designed to cover various aspects, such as BPF JIT, network socket filters, and tracing. Some BPF tests may require root privileges, so consider running with `sudo` if you encounter permission errors.

Running fprobe Tests

$ make -C tools/testing/selftests TARGETS=fprobe run_tests

The fprobe tests are generally under the tracing or ftrace section. Since fprobe works closely with kprobes and tracing functionalities, you may see additional output related to probe events.

Troubleshooting Common Issues

Missing Kernel Configurations: If tests fail with messages about unsupported configurations, recheck that all necessary kernel options are enabled.

Permission Issues: Some BPF programs and fprobes require privileged access. Run the tests with `sudo` if needed.

BTF (BPF Type Format) Requirements: Some BPF tests require BTF data, which may need to be enabled explicitly in `.config` (`CONFIG_DEBUG_INFO_BTF`).

By following these steps, you should be able to run BPF and fprobe tests as part of the Linux kernel self-tests, verifying their functionality and stability within your custom-built kernel.

Reference:

https://www.kernel.org/doc/Documentation/kselftest.txt

Overview of the Linux Test Project (LTP) and Submitting Patches to the Mailing List

2024-10-30T09:09:00.009+05:30

The Linux Test Project (LTP) is a collaborative initiative aimed at enhancing the reliability, robustness, and stability of the Linux kernel. This project was initiated by SGI, OSDL, and Bull, and is currently developed and maintained by a consortium of major technology companies, including SUSE, Red Hat, Fujitsu, IBM, Cisco, and Oracle. The primary objective of LTP is to provide a comprehensive suite of tests that serve the open-source community.

Goals and Objectives

LTP's primary objective is to provide a collection of automated tools to rigorously test the Linux kernel and associated system libraries. The test suites focus on various kernel components, including file systems, networking, memory management, and system calls. By using LTP, developers can gain valuable insights into potential bugs, stability issues, or areas for performance optimization.

Key Components of LTP:

Test Suites: LTP offers several specialized test suites, including:
Syscalls: Verifies the implementation and response of different system calls.
File Systems: Tests various file systems for integrity, performance, and stability.
Networking: Assesses network protocol handling, performance, and error management.
Security: Evaluates the kernel’s security features, enforcing secure data handling and user permissions.

The main goals of LTP include:

Testing Automation: Automating the testing process to improve the quality of the Linux kernel and its associated system libraries.
Reliability Validation: Ensuring that the Linux kernel can handle various workloads without failure.
Robustness Assessment: Testing the kernel's ability to recover from errors and handle unexpected situations gracefully.

Testing Suites : LTP consists of a collection of tools designed specifically for testing the Linux kernel and related features. These testing suites cover a wide range of scenarios to validate different aspects of kernel performance.

Key Features:

Comprehensive Coverage: The tests are designed to cover various functionalities within the kernel.
Stress Testing: Some tests are intended to stress specific components of the system, which can help identify potential issues before they affect production environments.

How LTP Works: The LTP testing workflow follows these primary steps:

1. Initialization: Set up the test environment and prerequisites, configuring the necessary resources.
2. Test Selection: Choose relevant tests based on the kernel area under evaluation—such as file systems, network protocols, or memory management.
3. Test Execution: Run the selected tests on the system to validate the kernel's behavior and performance.
4. Result Collection: Gather data from each test run, noting any failures, errors, or performance metrics.
5. Reporting: Summarize test results with detailed logs and reports for analysis.

Important Warning

It is crucial to note that LTP tests should not be run on production systems. Certain tests like "growfiles, doio, and iogen" are particularly aggressive as they stress the I/O capabilities of the system. These tests are designed to discover or even induce problems, making them unsuitable for environments where stability is critical.

Contribution and Community

LTP is an open-source project that encourages contributions from developers around the world. Users can report issues, submit pull requests, and participate in discussions regarding improvements and new features. The project maintains an active GitHub repository where users can access the latest code, documentation, and testing results.

How to Subscribe: Visit the Mailing List Subscription Page: Go to the LTP Mailing List Info Page to find details about the mailing list. Fill Out Subscription Form: You will need to provide your email address and follow the instructions to complete your subscription.

https://lists.linux.it/listinfo/ltp

This is a mailing list for the Linux Test Project. The list is used for patches, bug reports as well as general discussion about LTP and Linux testing.

Sending Patches: Prepare Your Patch. Make sure your patch is ready for submission. It's advisable to test your patch using GitHub Actions to ensure it compiles cleanly across various architectures and distributions. Use git send-email: The preferred method for sending patches is using git send-email. This will format your patch correctly for the mailing list. Once prepared, send your patch to the mailing list (ltp@lists.linux.it).

To post a message to all the list members, send email to ltp@lists.linux.it

LWN.net is a subscriber-supported publication that focuses on providing in-depth coverage of the Linux and free software development communities. LWN relies on subscriptions to fund its operations, as traditional advertising models have proven ineffective for their needs. Subscribing helps ensure the continuity of their publication. https://lwn.net/Articles/708182/#t

Test case tutorial:

This is a step-by-step tutorial on writing a simple C LTP test, where topics of the LTP and Linux kernel testing will be introduced gradually using a concrete example. Please refer this link

https://linux-test-project.readthedocs.io/en/latest/developers/test_case_tutorial.html

The list of syscalls which are tested under testcases/kernel/syscalls:

kernel syscalls: 365
tested syscalls: 341

https://linux-test-project.readthedocs.io/en/latest/users/stats.html

source

NOTE:

Cross-Compilation Process:

Source Code Compilation: process begins with compiling source files (e.g., .c or .cpp) into object files (.o). This is done using a cross-compiler that targets the architecture of the device for which the code is intended
Linking: After generating object files, a linker combines them into a single executable file, such as .ELF (or main.axf). This step may also involve using a linker script that defines memory regions and other configurations specific to the target device1.
Conversion: Once the ELF(or .axf) file is created, it can be converted into other formats like .bin, which are suitable for uploading to the target hardware.

For the ppc64le (PowerPC 64-bit Little Endian) architecture ( the equivalent of the .axf** format used in ARM cross-compilation) is typically the ELF (Executable and Linkable Format). The ELF format is widely used for executable files, object code, shared libraries, and core dumps across various architectures, including ppc64le. When cross-compiling for ppc64le, the output binary files are usually in ELF format, which can be identified by the .elf extension or simply no extension at all.

Toolchain Setup: To cross-compile for ppc64le, you need to install a suitable toolchain. A toolchain is a set of programming tools that work together to facilitate the development, building, and deployment of software applications. This typically includes compilers, linkers, debuggers, and libraries that are executed in a sequence where the output of one tool serves as the input for the next. This architecture, developed by the AIM alliance (Apple, IBM, and Motorola), is based on the RISC (Reduced Instruction Set Computing) principles and supports both 32-bit and 64-bit processing.

Example : install gcc-powerpc64le-linux-gnu

Compiling and Linking: Use the cross-compiler to compile source files:

powerpc64le-linux-gnu-gcc -o output_binary source.c

This command generates an ELF executable named "output_binary".

file output_binary

The output should indicate that it is an ELF file for PowerPC architecture and the resulting ELF binaries can be transferred to a ppc64le system for execution. This format allows developers to create executables that are compatible with PowerPC systems, facilitating software development in embedded and server environments.

Tests setup:

The internal LTP library provides a set of features that permits to customize tests behavior by setting environment variables and using specific tests arguments.

Library environment variables:

Following environment variables are expected to be set by LTP users. Therefore, with some exceptions, they have LTP_ prefix. Environment variables with TST_ prefix are used inside LTP shell API and should not be set by users.

https://linux-test-project.readthedocs.io/en/latest/users/setup_tests.html

Installation and tests execution: Basics requirements to build LTP are the following:

git
autoconf
automake
make
gcc
m4
pkgconf / pkg-config
libc headers
linux headers
git-email

git clone --recurse-submodules https://github.com/linux-test-project/ltp.git

cd ltp

make autotools

./configure

Running single tests:

cd testcases/kernel/syscalls/foo

make

PATH=$PATH:$PWD ./foo01

To compile all tests :

make

# install LTP inside /opt/ltp by default

make install

Running tests: To run all the test suites

cd /opt/ltp

# run syscalls testing suite

./kirk -f ltp -r syscalls

Note: Many test cases have to be executed as root.

Test suites (e.g. syscalls) are defined in the runtest directory. Each file contains a list of test cases in a simple format.

Each test case has its own executable or script that can directly executed:

testcases/bin/abort01

# some tests have arguments

testcases/bin/mesgq_nstest -m none

# vast majority of tests have a help

testcases/bin/ioctl01 -h

# Many require certain environment variables to be set

LTPROOT=/opt/ltp PATH="$PATH:$LTPROOT/testcases/bin" testcases/bin/wc01.sh

Most commonly, the PATH variable needs to be set and also LTPROOT, but there are a number of other variables which usually kirk sets for you.

How to send a patch:

Here are the detailed steps for cloning the Linux Test Project (LTP), adding tests, building, running test cases, and sending patches to the community

Pre-requisite : subscribe to mailing list as mentioned in above section.

1. Check Your Email ID and Username:

- Verify your Git configuration by running:

cat .gitconfig

2. Navigate to the Patch Directory:

- Change to the directory where your patch is located

cd LTP_vxyz_patch/

3. Clone the LTP Repository:

- Clone the LTP repository from GitHub:

git clone git@github.com:linux-test-project/ltp.git

4. Change Directory to LTP:

- Move into the cloned LTP directory

cd ltp

5. Build the Project:

- Execute the build script to compile the project:

./build.sh

6. Navigate to the Test Case Directory:

- Go to the specific test case directory you want to modify or add tests

cd testcases/kernel/mem/hugetlb/hugeshmget

7. Modify/Create a Test Case:

- Open the test case file in a text editor (e.g., `vi`) and write or modify your test case:

vi hugeshmget06.c

8. Compile and Check the Test Case:

- Build and check your test case for any errors:

make && make check

9. Run the Test Case:

- Execute your test case to verify its functionality:

./hugeshmget06

10. Run with Input Argument:

- Optionally, run your test case with an input argument for loop of count=5 (e.g., `-i 5`):

./hugeshmget06 -i 5

11. Check Git Status:

- Check which files have been modified or added in your Git repository

git status

12. Stage Your Changes:

- Add your modified files to the staging area for commit:

git add ../../../../../runtest/hugetlb hugeshmget06.c ../../.gitignore

13. Check Git Status Again:

- Verify that your changes are staged correctly:

git status

14. Commit Your Changes:

- Commit your changes with a sign-off message:

git commit -s

15. Create a Patch File:

- Generate a patch file for your commit:

git format-patch -1

16. Edit the Patch File Before Sending:

- Open the generated patch file and edit it as necessary (e.g., update version, email IDs):

vi 0001-Migrating-the-libhugetlbfs-testcases-shm-gettest.c-t.patch

17. Send the Patch via Email:

- Use `git send-email` to send your patch to the community mailing list and reviewers:

git send-email --to=ltp@lists.linux.it --cc=reviewer1@xyz.co --cc=reviewer2@abc.co ./0001-Migrating-the-libhugetlbfs-testcases-shm-gettest.c-t.patch

- Ensure that you fill in the subject line and other email details as prompted.

18. Respond to Review Comments via Email:

- After receiving feedback, reply to the email addressing any review comments and mention any new versions of patches based on those comments.

By following these steps, you can effectively contribute to the Linux Test Project by adding new tests, verifying their functionality, and submitting patches for review by the community.

Search your Patch in mailing list :

To search details on patch with search key for example - "libhugetlbfs" or "shmget-test" or by username "sachinpb"

Example 1 : https://lore.kernel.org/ltp/?q=libhugetlbfs

Example 2 : https://lore.kernel.org/ltp/?q=sachinpb

Example 3: https://lore.kernel.org/ltp/ZvqQxP9KVW6PqFOo@yuki.lan/

[LTP] [ANNOUNCE] The Linux Test Project has been released for SEPTEMBER 2024

Let us test the newly added testcase "hugeshmget06 A hugepage shm test"

Step 1 : Clone the LTP repo from https://github.com/linux-test-project/ltp

Step 2 : Confirm the testcase merged with master branch

Step 3 : Check the new testcase "hugeshmget06.c" existing in "/root/ltp/testcases/kernel/mem/hugetlb/hugeshmget"

Step 4 : Build test cases by running script ./build.sh as shown below:

Step 5 : Do make install and check the binaries at "/root/ltp-install/"

Step 6 : Execute the testcase ./hugeshmget06 as shown below

Step 7: Execute the above testcase in loop of any count . For example count=3

Conclusion

The Linux Test Project plays a vital role in ensuring that the Linux kernel remains reliable and robust for users across various industries. By providing a structured approach to testing, LTP helps developers identify potential issues early in the development process, ultimately contributing to a more stable operating environment for all Linux users.

For more information or to get involved with LTP, you can visit their [GitHub repository](https://github.com/linux-test-project/ltp).

In this blog, we've explored the Linux Test Project (LTP), covering how to add tests and submit patches to the mailing list. I hope you found this information useful and that it enhances your understanding and contributions to LTP. Best of luck on your journey with LTP!

Reference :

1) https://linux-test-project.readthedocs.io/en/latest/
2) https://linux-test-project.readthedocs.io/en/latest/index.html
3) https://jitwxs.cn/51dc9e04

Git Unlocked:: A Beginner's Guide to Version Control

2024-10-29T11:02:00.011+05:30

Git is source code management systems. Its distributed version control system (DVCS) which facilitates multiple developers and teams to work in parallel. Git’s primary emphasis is on providing speed while maintaining data integrity. Git also provides ability to perform almost all the operations offline, when network is not available. All the changes can be pushed to the server on network’s availability.

source: GIT workflow structure

Git is designed as a distributed version control system, which means:

Data Structure:

Git uses a data structure called a Merkle tree, where each commit points to its parent commit, forming a chain of versions. Each commit has a unique SHA-1 hash that identifies it and its content.

Objects:

There are four main types of objects in Git:

Blobs: Store file data.
Trees: Represent directories and contain blobs or other trees.
Commits: Point to trees and contain metadata such as author, date, and commit message.
Tags: Reference specific commits for marking release points24.

Branches:

Branches are pointers to commits, allowing multiple lines of development within the same repository. The default branch is usually called master or main.

Staging Area:

Before committing changes, files are added to the staging area using git add. This allows users to review changes before finalizing them in a commit.

Remote Repositories:

Git allows synchronization between local repositories and remote ones (like GitHub), enabling collaboration among multiple users.

Some of the commonly used terms are listed below :

Repository - is directory that contains all the project files.
Clone - it creates a working copy of local repository.
Branch - is created to encapsulate the development of new features or to fix bugs.
HEAD - points to the last commit.
Commit - commits changes to HEAD and not to remote repository.
Pull - Gets the changes from the remote repository to the local repository.
Push - Commits the local repository changes to the remote repository.

---------------------------------------------------------------------------------------------------------

To understand when to use "git fetch" instead of "git pull", let's break it down into simpler terms with an example.

git fetch: Think of this as checking for updates. It retrieves the latest changes from a remote repository (like GitHub) but does not apply those changes to your current working files. It’s like downloading new episodes of your favorite show but not watching them yet.

git pull: This command does two things at once: it fetches the latest changes and immediately merges them into your current files. It’s like downloading those episodes and then watching them right away.

When to Use "git fetch"

1. You Want to Review Changes First: If you're working on a project and you want to see what others have done before you integrate their changes, `git fetch` allows you to check for updates without making any changes to your own work right away.

2. Avoiding Conflicts : If you have uncommitted changes in your local files, using `git pull` could create conflicts because it tries to merge changes automatically. By using `git fetch`, you can first see what has changed and then decide how to handle it.

3. Working Offline or Later: If you're in a situation where you can't deal with merging right away (like being on a beach without Wi-Fi), you can fetch the updates now and merge them later when you're ready.

Example : Imagine you're working on a team project.

1. You’re writing code on your laptop.

2. Your teammate pushes some updates to the shared repository while you’re still working.

3. You want to see what they changed without disrupting your current work.

git fetch origin

This command checks for any updates from the remote repository named "origin" and downloads the latest information about those changes, but it doesn’t change any of your files yet.

4. After fetching, you can review what your teammate did:

git log origin/main

This shows you the commit history of the remote branch so you can see what’s new.

5. Once you're ready and have made sure there are no conflicts, you can merge those changes:

git merge origin/main

Now your local files are updated with your teammate’s changes.

This way , use "git fetch" when you want to keep your local repository updated without immediately affecting your working files. This gives you more control over how and when to integrate changes from others, especially useful in collaborative environments or when you're not ready to merge yet.

--------------------------------------------------------------------

The recursive clone of a Git repository:

This refers to the process of cloning a repository along with all its submodules. Submodules are essentially repositories nested inside another repository, allowing you to include external libraries or components as part of your project.

How It Works : When you perform a standard clone using the command:

git clone [repository-url]

It only clones the main repository. If that repository contains submodules, those submodule directories will be created but will remain empty until you initialize and update them.

To clone a repository along with its submodules, you use the `--recurse-submodules` option:

git clone --recurse-submodules [repository-url]

This command does the following:

1. Clones the Main Repository: It creates a local copy of the main repository.

2. Initializes Submodules: It automatically initializes any submodules defined in the ".gitmodules" file.

3. Updates Submodules: It fetches the content of those submodules, ensuring they are populated with the correct data.

Example: Imagine you have a project that relies on a library hosted in another Git repository. If you want to include this library as a submodule in your project, your main repository might look like this:

- Main Repository: "MyProject"

- Submodule: "ExternalLibrary"

If you were to clone MyProject without the recursive option, you'd end up with:

MyProject/

├── .git/

├── ExternalLibrary/ (empty)

└── other files...

with that option

git clone --recurse-submodules [repository-url]

You would get:

MyProject/

├── .git/

├── ExternalLibrary/ (with content)

└── other files...

Why use Recursive Clone?

Convenience: It saves time by automatically fetching and setting up all necessary components for your project.
Consistency: Ensures that all required libraries or modules are at the correct version specified in the main repository.

Using a recursive clone is essential when working with projects that depend on multiple repositories to ensure everything is correctly set up from the start.

$ git clone --recursive git@github.xyz.com:smpi/xyz-tests.git

Cloning into 'xyz-tests'...

remote: Enumerating objects: 24, done.

remote: Counting objects: 100% (24/24), done.

remote: Compressing objects: 100% (21/21), done.

remote: Total 18759 (delta 13), reused 7 (delta 3), pack-reused 18735

Receiving objects: 100% (18759/18759), 125.23 MiB | 33.33 MiB/s, done.

Resolving deltas: 100% (12772/12772), done.

Updating files: 100% (5233/5233), done.

Submodule 'smpi-ci/mtt' (https://github.com/open-mpi/mtt-legacy.git) registered for path 'smpi-ci/mtt'

Submodule 'tests/hpc-smpi-fvt' (git@github.xyz.com:smpi/hpc-smpi-fvt.git) registered for path 'tests/hpc-smpi-fvt'

Submodule 'tests/ompi-tests' (git@github.xyz.com:smpi/ompi-tests.git) registered for path 'tests/ompi-tests'

Cloning into '/data/nfs_smpi_ci/xyz-tests/smpi-ci/mtt'...

remote: Enumerating objects: 1abc, done.

remote: Counting objects: 100% (21/21), done.

remote: Compressing objects: 100% (18/18), done.

remote: Total 13184 (delta 7), reused 10 (delta 3), pack-reused 13163

Receiving objects: 100% (13184/13184), 3.94 MiB | 20.67 MiB/s, done.

Resolving deltas: 100% (8686/8686), done.

Cloning into '/data/nfs_smpi_ci/xyz-tests/tests/hpc-smpi-fvt'...

remote: Enumerating objects: 15511, done.

remote: Total 15511 (delta 0), reused 0 (delta 0), pack-reused 15511

Receiving objects: 100% (15511/15511), 81.99 MiB | 28.47 MiB/s, done.

Resolving deltas: 100% (10318/10318), done.

Cloning into '/data/nfs_smpi_ci/xyz-tests/tests/ompi-tests'...

remote: Enumerating objects: 36104, done.

remote: Total 36104 (delta 0), reused 0 (delta 0), pack-reused 36104

Receiving objects: 100% (36104/36104), 103.97 MiB | 26.81 MiB/s, done.

Resolving deltas: 100% (23820/23820), done.

Submodule path 'smpi-ci/mtt': checked out 'mnccd42d37c232883d3a600ac4151868a3327b7'

Submodule path 'tests/hpc-smpi-fvt': checked out 'abc5ab4afacfacccc04dacb2c55a41477d2c02'

Submodule path 'tests/ompi-tests': checked out 'xyx7f6e194df5bbc16e836f8d63556be363a94ca5'

------------

NOTE: With version 2.13 of Git and later, --recurse-submodules can be used instead of --recursive

With older Git , you can use:

git clone --recursive git://github.com/foo/bar.git

For already cloned repos, or older Git versions, use:

git clone git://github.com/foo/bar.git

cd bar

git submodule update --init --recursive

------------

The "git merge" :

The "git merge" command is a fundamental feature in Git that allows developers to combine changes from different branches into a single branch. This process is essential for integrating various lines of development, such as feature branches or bug fixes, into the main codebase.

What git merge Does?

1. Combines Changes: The primary function of "git merge" is to integrate the histories of two branches. When you execute "git merge <branch-name>", Git takes the changes from the specified branch and merges them into your current branch, which is often referred to as the "receiving branch".

2. Creates a Merge Commit: If the branches have diverged (i.e., both have new commits since they last shared a common ancestor), Git creates a new commit known as a merge commit. This commit has two parent commits: one from each branch being merged. This allows Git to maintain a complete history of changes.

3. Finds the Common Base: Before merging, Git identifies the common ancestor (or merge base) of the two branches. It then computes the differences (diffs) between this common ancestor and each of the branches to apply those changes simultaneously.

4. Handling Conflicts: If there are conflicting changes in the same part of a file from both branches, Git will pause the merge process and prompt you to resolve these conflicts manually. Once resolved, you can finalize the merge with a commit.

Types of Merging

1) Fast-Forward Merge: If the current branch is directly behind the branch being merged (meaning it has no new commits since their last common ancestor), Git can simply move the pointer of the current branch forward to the latest commit of the other branch without creating a new commit

2) Three-Way Merge : When both branches have diverged, a three-way merge occurs. Here, Git uses three points (the two branch tips and their common ancestor) to create a new merge commit that reconciles changes from both branches.

Example: To illustrate how "git merge" works, let's go through a practical example involving two branches: main and feature.

1. Initial Setup: You start with a repository that has a `main` branch and you create a new branch called `feature` for development.

git checkout -b feature main

2. Making Changes in the Feature Branch:

You make some changes in the `feature` branch and commit them.

# Edit some files

git add <file>

git commit -m "Add new feature"

3. Switching Back to Main:

After completing your work on the "feature" branch, you switch back to the "main" branch.

git checkout main

4. Making Additional Changes in Main: While you were working on the "feature", someone else made changes to the "main" branch. You can update it before merging:

git pull origin main

5. Merging the Feature Branch into Main: Now, you can merge the changes from the "feature" branch into the "main" branch.

git merge feature

If there are conflicting changes (e.g., both branches modified the same line in a file), Git will pause and prompt you to resolve these conflicts manually before completing the merge:

# Resolve conflicts in your editor, then stage the resolved files

git add <resolved-file>

# Complete the merge after resolving conflicts

git commit -m "Merge feature into main with conflict resolution"

Using "git merge", developers can effectively integrate changes from multiple branches, making it easier to collaborate on projects. Understanding how to handle different types of merges and resolve conflicts is crucial for maintaining a clean project history and ensuring smooth collaboration among team members.

-----------

The "git rebase" :

Git rebase is a powerful command in Git that allows you to integrate changes from one branch into another by moving the base of your current branch to a different commit. This process effectively rewrites the commit history, making it appear as though you started your work from a different point in the project history.

What Does Git Rebase Do?

1. Changing the Base: When you perform a rebase, you change the base of your current branch to a specified commit from another branch. This means that all the commits in your current branch will be reapplied on top of the new base commit, creating new commits in the process.

2. Linear History: One of the main advantages of rebasing is that it helps maintain a clean and linear project history. This is particularly useful for simplifying the commit log and making it easier to follow changes over time. Instead of having a branching structure, rebasing makes it look like all changes were made sequentially.

3. Interactive Rebasing: Git rebase can be performed in two modes: standard and interactive. The interactive mode allows you to modify commits during the rebase process, such as editing commit messages, squashing multiple commits into one, or even removing commits altogether.

To perform a standard rebase, you would typically use:

git rebase <base-branch>

For example, if you want to rebase your feature branch onto the main branch:

git checkout feature

git rebase main

This command will take all commits from "feature" and apply them on top of "main", effectively updating "feature" with the latest changes from `main`.

Example:

1. Initial Setup

- You have two branches: main and feature

- You make several commits on feature while others continue to update main.

2. Rebasing:

- To incorporate the latest changes from main into feature, you would execute:

git checkout feature

git rebase main

3. Resolving Conflicts:

- If there are conflicts during the rebase, Git will pause and allow you to resolve them. After resolving conflicts in the affected files, you would continue the rebase with:

git add <resolved-file>

git rebase --continue

Advantages of Git Rebase

Cleaner Commit History: By avoiding unnecessary merge commits, rebasing results in a more straightforward project history that is easier to read and understand.
Easier Debugging: A linear history simplifies tracking down bugs and understanding how features were developed over time.
Better Collaboration: Rebasing helps keep feature branches up-to-date with ongoing changes in the main branch, which can reduce conflicts when it's time to merge back into main.

What Happens During Rebase?

- Updating Your Work: When you run this command, Git looks at the latest commits in the `main` branch and applies your feature branch's commits as if you started working on them after the latest changes in main. This effectively updates your feature branch with any new changes that have been made in main.

- Cleaner History: Unlike merging, which creates a new commit that combines both branches (often making the history look complicated), rebasing keeps your project history neat and linear. It looks like all your work was done after the latest changes from main, even if it wasn't.

Example : Imagine you are working on a project:

1. You create a branch called feature to add a new feature.

2. While you're working, someone else adds new updates to the main branch.

3. To ensure your feature includes those updates, you use git rebase main.

4. After running this command, your commits from feature will be placed on top of the latest commits from main, making it look like you developed your feature with all the latest updates in mind

NOTE:

While rebasing can be very useful, it is important to note that rewriting history can lead to complications if not handled carefully—especially if you've already pushed your changes to a shared repository. It’s generally advised not to rebase public branches that others may be using.

Git rebase is a powerful tool for managing project history and integrating changes across branches while maintaining clarity and organization in your commit log.

Git rebase and the combination of `git pull` followed by `git merge` are related but not equivalent operations. Here’s a breakdown of their differences and how they function:

Git Pull vs. Git Rebase

1. Git Pull:

- The `git pull` command is essentially a combination of two commands: git fetch followed by git merge. When you run git pull, Git fetches the latest changes from the remote branch and then merges those changes into your current branch, creating a new merge commit if necessary. This can lead to a more complex commit history with multiple merge commits, especially in collaborative environments where many changes are being made concurrently.

git pull origin main

2. Git Rebase:

- The git rebase command, on the other hand, takes the commits from your current branch and re-applies them on top of another branch (often the updated main branch). This effectively rewrites the commit history, creating a linear progression of commits without additional merge commits. This can make the project history cleaner and easier to follow

git rebase main

Key Differences

History Structure:

- Merge: Results in a branching history with merge commits that reflect the integration points between branches.

- Rebase: Produces a linear history that appears as if all changes were made sequentially on top of the base branch.

Conflict Resolution:

- During a merge, if conflicts arise, you resolve them and then create a new merge commit.

- During a rebase, conflicts must be resolved at each commit being reapplied, and you continue the rebase process after resolving each conflict

Use Cases:

- Use git pull when you want to quickly integrate remote changes into your local branch, accepting the additional merge commit.

- Use git rebase when you want to keep a clean history, especially in feature branches that have not yet been shared with others

That way, use git fetch when you want to check for updates without altering your current work, git pull for quickly updating your branch with remote changes (but be cautious of conflicts), and git merge when you want to combine different branches of work.

Git Fork : A fork is a complete copy of a repository that allows you to make changes independently without affecting the original repository. It is often used in collaborative projects, especially on platforms like GitHub.

Purpose: Forking is typically used when you want to contribute to someone else's project. You can create your own version of the repository to experiment with or develop features without needing direct access to the original repository.

Isolation: A fork creates an entirely separate repository, meaning that changes made in your fork do not impact the original repository unless you explicitly submit a pull request.

Fprobe: Efficient Kernel Function Tracing in Linux

2024-10-24T12:49:00.005+05:30

Fprobe is a function entry/exit probe mechanism in the Linux kernel, built on top of the ftrace framework. It allows developers to attach callbacks to function entry and exit points, similar to kprobes and kretprobes, but with improved performance for multiple functions through a single handler. This makes fprobe particularly useful for tracing and debugging kernel functions without the overhead associated with full ftrace features.

Key Features of Fprobe:

Performance: Provides faster instrumentation for multiple functions compared to traditional kprobes and kretprobes.

Structure: The fprobe structure includes fields for entry and exit handlers, allowing customized behavior when functions are entered or exited.

Registration: Fprobes can be registered using various methods, including function-name filters or direct address registration.

The usage of fprobe :

The fprobe is a wrapper of ftrace (+ kretprobe-like return callback) to attach callbacks to multiple function entry and exit. User needs to set up the struct fprobe and pass it to register_fprobe(). Typically, fprobe data structure is initialized with the entry_handler and/or exit_handler as below.

NOTE: callbacks refer to functions that are executed in response to specific events, particularly when a function is entered or exited

A typical fprobe setup might look like this:

struct fprobe fp = {
.entry_handler = my_entry_callback,
.exit_handler = my_exit_callback,
};
register_fprobe(&fp, "func*", "func2");

-----------------------------------------------------

Simple C program that demonstrates how to use fprobe in the Linux kernel. This program sets up an fprobe to monitor the entry and exit of a specific kernel function. For this example, we'll assume that you want to probe a function named "target_function".

#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/fprobe.h>
#include <linux/init.h>

// Function prototypes
int my_entry_callback(struct fprobe *fp, unsigned long entry_ip, unsigned long args, struct pt_regs *regs, void *data);
void my_exit_callback(struct fprobe *fp, unsigned long entry_ip, unsigned long args, struct pt_regs *regs, void *data);
void target_function(void); // Declare target_function prototype

// Define the fprobe structure
static struct fprobe fp = {
.entry_handler = my_entry_callback,
.exit_handler = my_exit_callback,
};

// The function we want to probe
void target_function(void) {
printk(KERN_INFO "Inside target_function\n");
}

// Entry callback
int my_entry_callback(struct fprobe *fp, unsigned long entry_ip, unsigned long args, struct pt_regs *regs, void *data) {
printk(KERN_INFO "Entered target_function\n");
return 0; // Return an int as expected
}

// Exit callback
void my_exit_callback(struct fprobe *fp, unsigned long entry_ip, unsigned long args, struct pt_regs *regs, void *data) {
printk(KERN_INFO "Exited target_function\n");
}

// Module initialization
static int __init my_module_init(void) {
// Register the fprobe for the target function
if (register_fprobe(&fp, "target_function", NULL)) {
printk(KERN_ERR "Failed to register fprobe\n");
return -1;
}

printk(KERN_INFO "Fprobe registered successfully\n");

// Call the target function to see the callbacks in action
target_function();

return 0;
}

// Module cleanup
static void __exit my_module_exit(void) {
// Unregister the fprobe
unregister_fprobe(&fp);
printk(KERN_INFO "Fprobe unregistered successfully\n");
}

module_init(my_module_init);
module_exit(my_module_exit);

MODULE_LICENSE("GPL");
MODULE_DESCRIPTION("Fprobe Example Module");
MODULE_AUTHOR("SACHIN P BAPPALIGE");

--------------------------------------------------------------------------

Compilation and Loading:

Step 1 : Save the code in a file named fprobe_example.c.

Step 2 : Create a Makefile as shown below

obj-m += fprobe_example.o

all:
make -C /lib/modules/$(shell uname -r)/build M=$(PWD) modules

clean:
make -C /lib/modules/$(shell uname -r)/build M=$(PWD) clean

--------------------------------------

# make
make -C /lib/modules/6.12.0-rc2+/build M=/root/fprobe modules
make[1]: Entering directory '/root/linux'
CC [M] /root/fprobe/fprobe_example.o
MODPOST /root/fprobe/Module.symvers
CC [M] /root/fprobe/fprobe_example.mod.o
CC [M] /root/fprobe/.module-common.o
LD [M] /root/fprobe/fprobe_example.ko
BTF [M] /root/fprobe/fprobe_example.ko
make[1]: Leaving directory '/root/linux'

# ls

Makefile Module.symvers fprobe_example.c fprobe_example.ko fprobe_example.mod fprobe_example.mod.c fprobe_example.mod.o fprobe_example.o modules.order

--------------------------------------------------

Step 3 : Insert module--> insmod fprobe_example.ko

Step 4: Check kernel logs with dmesg to see output from the callbacks.

Step 5: Remove the module using --> rmmod fprobe_example

Here’s a C program that demonstrates how to use fprobe to monitor the do_fork function in the Linux kernel. This example sets up entry and exit callbacks to log when the do_fork function is called and when it returns.

-------------------------------------

#include <linux/module.h>

#include <linux/kernel.h>

#include <linux/fprobe.h>

#include <linux/init.h>

#include <linux/sched.h>

// Function prototypes for the entry and exit callbacks

int my_entry_callback(struct fprobe *fp, unsigned long entry_ip, unsigned long args, struct pt_regs *regs, void *data);

void my_exit_callback(struct fprobe *fp, unsigned long entry_ip, unsigned long args, struct pt_regs *regs, void *data);

// Define the fprobe structure

static struct fprobe fp = {

.entry_handler = my_entry_callback,

.exit_handler = my_exit_callback,

};

// Entry callback for do_fork

int my_entry_callback(struct fprobe *fp, unsigned long entry_ip, unsigned long args, struct pt_regs *regs, void *data) {

printk(KERN_INFO "do_fork called: PID = %d\n", current->pid);

return 0; // Return an int as required

}

// Exit callback for do_fork

void my_exit_callback(struct fprobe *fp, unsigned long entry_ip, unsigned long args, struct pt_regs *regs, void *data) {

printk(KERN_INFO "do_fork returned: PID = %d\n", current->pid);

}

// Module initialization function

static int __init my_module_init(void) {

// Register the fprobe for the do_fork function

if (register_fprobe(&fp, "do_fork", NULL)) {

printk(KERN_ERR "Failed to register fprobe for do_fork\n");

return -1;

}

printk(KERN_INFO "Fprobe registered successfully for do_fork\n");

return 0;

}

// Module cleanup function

static void __exit my_module_exit(void) {

// Unregister the fprobe

unregister_fprobe(&fp);

printk(KERN_INFO "Fprobe unregistered successfully for do_fork\n");

}

module_init(my_module_init);

module_exit(my_module_exit);

MODULE_LICENSE("GPL");

MODULE_DESCRIPTION("Fprobe Example for do_fork");

MODULE_AUTHOR("SACHIN P BAPPALIGE");

------------------------------------------------------------------------------

NOTE: Fprobe and kprobe are both mechanisms used in the Linux kernel for tracing and debugging, but they have distinct characteristics and use cases.

Kprobe :

Purpose: Kprobes allow developers to insert probes at almost any kernel function entry or exit point. They enable dynamic instrumentation by setting breakpoints on specified functions.

Structure: Each kprobe consists of a handler that is invoked when the probe is hit, allowing for custom actions or logging.

Performance: Kprobes can introduce overhead, especially when multiple probes are registered, as they rely on software breakpoints (SWBP) which can be slower compared to other methods.

Recursion Handling: Kprobes maintain a per-CPU variable (`current_kprobe`) to manage recursion safely, which prevents re-entry into the same probe handler.

Developers commonly probe various Linux kernel functions to gather insights about system performance, debug issues, and monitor behavior. Here are some typical functions that are often probed:

Commonly Probed Kernel Functions :

1) do_fork: Used for process creation. Probing this function can help track process spawning and resource allocation.

2) sys_read / sys_write: These functions handle system calls for reading from and writing to files. Probing them allows developers to monitor file I/O operations.

3) schedule: Responsible for context switching between processes. Probing this function can provide insights into scheduling behavior and CPU usage.

4) kmalloc / kfree: Functions for memory allocation and deallocation in the kernel. Probing these can help identify memory usage patterns and potential leaks.

5) vfs_read / vfs_write:Virtual filesystem layer functions that manage file operations. Probing these functions helps in understanding file access patterns across different filesystems.

7) tcp_sendmsg / tcp_recvmsg:Used for sending and receiving TCP messages. Probing these can assist in analyzing network performance and behavior.

8) netif_receive_skb: This function processes incoming packets in the network stack. Probing it can help monitor network traffic handling.

9) exit_notify: Handles the cleanup of a process upon exit. Probing this function can track process termination and resource cleanup.

10) __wake_up: Used to wake up processes waiting on a condition variable. Probing this can provide insights into synchronization and concurrency issues.

11) File System Operations: Functions like lookup, create, and unlink are also commonly probed to monitor filesystem interactions

In summary, while both kprobe and fprobe serve the purpose of tracing kernel functions, fprobe is tailored for efficiency and ease of use in scenarios requiring multiple function tracing. Kprobes offer broader capabilities but at the cost of performance overhead and complexity.

Reference :

1) https://docs.kernel.org/trace/fprobe.html

Kprobes in Action : Instrumenting and Debugging the Linux Kernel

2024-09-26T22:37:00.002+05:30

Kprobes (Kernel Probes) is a powerful feature in the Linux kernel that allows developers and system administrators to dynamically intercept and monitor any kernel function. It provides a mechanism for tracing and debugging the kernel by enabling you to inject custom code into almost any point in the kernel, allowing you to collect information, modify data, or even create entirely new behaviors. Kprobes are particularly useful for diagnosing kernel issues, performance tuning, and understanding kernel behavior without needing to modify the kernel source code or reboot the system.

Background on Kprobes:

Kprobes were introduced in the Linux kernel as a way to enable non-disruptive kernel tracing. The main use case is dynamic instrumentation, which allows developers to investigate how the kernel behaves at runtime without modifying or recompiling the kernel.

How Kprobes Work

Kprobes allow you to place a "probe" at a specific point in the kernel, known as a probe point. When the kernel execution reaches this probe point, the probe is triggered, and a handler function that you define is executed. Once the handler is done, the normal execution of the kernel resumes.

There are two types of handlers in Kprobes:

1. Pre-handler: This is executed just before the probed instruction.

2. Post-handler: This is executed after the probed instruction completes.

Key Components of Kprobes:

1. Kprobe Structure : Defines the probe, including the symbol name (function to be probed) and pointers to pre- and post-handlers.

- Example:

static struct kprobe kp = {
.symbol_name = "do_fork", // Name of the function to probe
};

2. Pre-Handler: Executed before the instruction at the probe point. It can be used to capture the state of the system (e.g., register values).

- Example:

static int handler_pre(struct kprobe *p, struct pt_regs *regs) {
printk(KERN_INFO "Pre-handler: register value is %lx\n", regs->ip);
return 0;
}

3. Post-Handler: Executed after the instruction at the probe point. This is useful for gathering information after the instruction has executed.

- Example:

static void handler_post(struct kprobe *p, struct pt_regs *regs, unsigned long flags) {
printk(KERN_INFO "Post-handler: instruction completed\n");
}

Inserting a Kprobe:

Once the Kprobe structure is set up, you register the probe using the `register_kprobe()` function, which activates the probe at the desired location in the kernel.

Example of inserting a probe:

int ret = register_kprobe(&kp);
if (ret < 0) {
printk(KERN_ERR "Kprobe registration failed\n");
} else {
printk(KERN_INFO "Kprobe registered successfully\n");
}

When you're done with the probe, it should be unregistered using `unregister_kprobe()`.

Use Cases for Kprobes:

1. Debugging: Inspect kernel function behavior and parameters at runtime without recompiling the kernel.

2. Performance Monitoring: Collect detailed performance statistics at various points in the kernel.

3. Dynamic Analysis: Understand kernel module or driver behavior in real-time.

4. Fault Injection: Inject faults at specific points in the kernel to test how the kernel reacts to errors.

5. Security Auditing: Monitor suspicious or unauthorized kernel activities.

Kprobes vs. Other Tracing Mechanisms:

- Ftrace: Another kernel tracing framework, but more focused on function-level tracing. Kprobes are more versatile as they allow you to probe any instruction.

- SystemTap**: Provides a higher-level interface that uses Kprobes under the hood.

- eBPF: A more modern, flexible, and performant tracing framework that has overlapping functionality with Kprobes.

Kprobe Variants:

1. Kprobes: A variant that allows you to specify the exact function signature for the probe. This feature is deprecated in latest kernels.

2. Kretprobes: A specialized form of Kprobes that hooks into the return path of functions, allowing you to trace function exits and the values returned by kernel functions.

Limitations of Kprobes:

- Probes introduce overhead, so excessive probing can impact system performance.

- Probing certain sensitive or timing-critical functions can lead to system instability.

- The handler code should be minimal and non-blocking to avoid disrupting the kernel execution flow.

Example Code:

Below is a basic example of how Kprobes can be used to monitor the `do_fork` function in the kernel, which is responsible for process creation:

#include <linux/kernel.h>
#include <linux/module.h>

#include <linux/kprobes.h>

static struct kprobe kp = {
.symbol_name = "do_fork", // The function to probe
};

static int handler_pre(struct kprobe *p, struct pt_regs *regs) {
printk(KERN_INFO "do_fork() called, IP = %lx\n", regs->ip);
return 0;
}

static void handler_post(struct kprobe *p, struct pt_regs *regs, unsigned long flags) {
printk(KERN_INFO "do_fork() completed\n");
}

static int __init kprobe_init(void) {
kp.pre_handler = handler_pre;
kp.post_handler = handler_post;

if (register_kprobe(&kp) < 0) {
printk(KERN_ERR "Kprobe registration failed\n");
return -1;
}
printk(KERN_INFO "Kprobe registered successfully\n");
return 0;
}

static void __exit kprobe_exit(void) {
unregister_kprobe(&kp);
printk(KERN_INFO "Kprobe unregistered\n");
}

module_init(kprobe_init);
module_exit(kprobe_exit);
MODULE_LICENSE("GPL");

This will print information to the kernel log each time the `do_fork()` function is invoked.

To compile and run the Kprobe example you provided, you need to follow these steps:

1. Prerequisites

- You need to have the Linux kernel headers installed.

- Make sure you have root (superuser) privileges, as you'll be loading kernel modules.

- You need `gcc` and `make` installed for compiling the kernel module.

yum search glibc-static
yum install glibc-static
yum update --allowerasing

2. Write the Kprobe Kernel Module

wget https://raw.githubusercontent.com/torvalds/linux/master/samples/kprobes/kprobe_example.c

Save the Kprobe code into a file, for example, `kprobe_example.c`:

3. Create a Makefile

Create a `Makefile` to automate the compilation of the kernel module. This Makefile should look like this:

makefile
obj-m += kprobe_example.o
all:
make -C /lib/modules/$(shell uname -r)/build M=$(PWD) modules
clean:
make -C /lib/modules/$(shell uname -r)/build M=$(PWD) clean

4. Compile the Kprobe Kernel Module

$ make

This will use the kernel headers and create a module file `kprobe_example.ko`.

5. Insert the Kernel Module

To insert the module into the running kernel, use the `insmod` command. You need root privileges to load kernel modules:

$ insmod kprobe_example.ko

6. Check Kernel Logs for Output

You can monitor the kernel logs to see the output of the Kprobe:

$ dmesg | tail

[18084.207866] kprobe_init: Planted kprobe at 00000000207be762

This will show you the success or failure of inserting the Kprobe and print any trace outputs when the kernel function (like `do_fork`) is invoked.

7. Trigger the Probed Function

You can manually trigger the `do_fork()` function by starting a new process, such as running any command: $ ls

Since `do_fork()` is involved in creating new processes, every time a new process is created, the Kprobe pre- and post-handlers will execute, and you'll see the output in `dmesg`.

8. Remove the Kernel Module

Once you're done with the Kprobe, you can remove the kernel module using `rmmod`:

$ rmmod kprobe_example

[root@myhost]# lsmod | grep probe
kprobe_example 3569 0
[root@myhost]# ls
[root@myhost]# rmmod kprobe_example
[root@myhost]# lsmod | grep probe
[root@myhost]#

Check the kernel log again to see the output confirming the Kprobe has been removed:

$ dmesg | tail

9. Clean Up

To clean the build directory and remove compiled files, you can run:

$ make clean

--------------------------------------------------------------------------------------------------------

Example Workflow:

$ vim kprobe_example.c # Write the module code

$ vim Makefile # Create the Makefile

$ make # Compile the module

$ insmod kprobe_example.ko # Insert the module

$ dmesg | tail # Check kernel logs for output

$ ls # Trigger the do_fork() function

$ dmesg | tail # Check logs again to see Kprobe output

$ sudo rmmod kprobe_example # Remove the module

$ make clean # Clean up the build files

-----------------------------------------------------------------------------------------------------------------

The concept of a trampoline in Linux tracing, particularly in the context of kprobes and ftrace, involves using a NOP (No Operation) instruction as a placeholder to facilitate efficient tracing of kernel functions.

What is a Trampoline?

A trampoline in this context refers to a small piece of code that is executed when a probe is hit. It typically consists of a NOP instruction, which serves as a placeholder in the original code. When the kernel function is called, the original instruction at that location is replaced with a jump to the trampoline code. This mechanism allows for minimal disruption to the normal execution flow while still enabling tracing capabilities.

How Does It Work?

1. Kprobes Registration: When kprobes are set up, a kprobe is registered at the function entry point. The original instruction (often a NOP) is replaced with a jump to the trampoline code.

2. Execution Flow: When the probed function executes and hits the kprobe:

- Control is transferred to the trampoline.

- The trampoline executes any necessary probe handlers.

- After handling, control returns to the original execution path by restoring the saved instruction pointer.

3. Detour Buffer: Kprobes also utilize a detour buffer, which contains code for saving CPU registers, calling the trampoline, restoring registers, and finally jumping back to continue execution. This detour minimizes performance overhead compared to traditional interrupt-based tracing methods.

Advantages of Using Trampolines

- Performance: Using trampolines reduces the overhead associated with traps (interrupts), which require additional context switching and handler execution. Instead, executing a jump to a trampoline incurs less cost and allows for faster tracing operations.

- Flexibility: Trampolines can be dynamically placed at various points in kernel functions, allowing for versatile instrumentation without requiring recompilation of kernel code.

In summary, trampolines in Linux tracing serve as efficient mechanisms for intercepting function calls with minimal performance impact, primarily utilizing NOP instructions as placeholders for jump instructions that redirect execution flow during tracing operations.

NFS (Network File System) : NFS Server and NFS Client configuration

2024-09-23T20:11:00.004+05:30

NFS, or Network File System, is a distributed file system protocol developed by Sun Microsystems in 1984. It allows a system to share directories and files with others over a network. With NFS, users and applications can access files on remote systems as if they were local, enabling seamless file sharing across different systems, architectures, and locations.

Key Features of NFS:

- Transparency: NFS provides users with transparent access to files on a remote server as if they were located on their local machine.

- Interoperability: NFS is platform-independent, allowing different operating systems to communicate and share files across a network.

- Statelessness: In its early versions, NFS was designed to be stateless, meaning that the server did not maintain session information for clients. This design provided resilience but also led to limitations in reliability and consistency.

- Security: Over time, NFS introduced several security improvements, including support for Kerberos authentication in NFSv4.

History and Evolution of NFS :

1. NFSv1 (1984):

- NFS was initially developed by Sun Microsystems as a way to provide network file sharing in SunOS.
- It was proprietary to Sun and served as an early attempt to allow file sharing across systems in a networked environment.

2. NFSv2 (1989):

- The first widely available version, NFSv2, was introduced in RFC 1094.
- It operated using the User Datagram Protocol (UDP) for fast, connectionless communication.
- NFSv2 supported only 32-bit file sizes, which limited its scalability as file sizes grew.
- It used a stateless protocol, meaning the server didn’t keep track of clients, which simplified server-side management but limited its capabilities for complex applications.

3. NFSv3 (1995):

- Introduced in RFC 1813, NFSv3 addressed many limitations of NFSv2.
- Improvements:
- Larger file size support with 64-bit file handling.
- Asynchronous Writes: To improve performance, NFSv3 allowed asynchronous writes, meaning the client could continue writing without waiting for server acknowledgment.
- Introduced support for TCP, allowing for better reliability in file transfers over unreliable networks.
- Still stateless but more robust in handling large workloads.

4. NFSv4 (2003):

- Defined in RFC 3530, NFSv4 represented a significant evolution from previous versions.
- Major Features:
- Stateful Protocol: NFSv4 introduced statefulness, allowing for better handling of file locks and recovery from network failures.
- Security: NFSv4 incorporated stronger security features, including support for Kerberos and improved ACLs (Access Control Lists).
- Single Port: NFSv4 used a single well-known port (2049), which simplified firewall configuration.
- Performance Optimizations: NFSv4 added features like compound operations and better caching mechanisms to enhance performance in WAN environments.
- Cross-Platform: As a stateful and more secure protocol, NFSv4 gained popularity across different Unix-like systems and was widely adopted in enterprise environments.

5. NFSv4.1 (2010):

- Introduced pNFS (parallel NFS), allowing clients to read and write files in parallel, which improved performance for large, distributed workloads.
- Sessions: NFSv4.1 introduced sessions, providing reliable and robust mechanisms for handling multiple file operations.

6. NFSv4.2 (2016):

- Added new features like server-side cloning, application I/O hints, and a standardized method for handling holes in sparse files.

NFS Server Configuration :

1. Install NFS Utilities:

First, install the NFS server utilities (`nfs-utils`) using `dnf`. This package provides essential NFS-related services and tools, including `rpcbind`, `nfsd`, and other utilities needed for the NFS server to function.

dnf install nfs-utils -y

2. Manage Firewall:

If the firewall is running, it can block NFS traffic by default. Use these commands to check the status, stop, or disable the firewall. If security is a concern, you should configure the firewall to allow NFS traffic on specific ports instead of turning it off entirely.

systemctl status firewalld

systemctl stop firewalld

systemctl status firewalld

If you need to keep the firewall running, ensure you open the following ports for NFS:

firewall-cmd --add-service=nfs --permanent

firewall-cmd --reload

3. Start and Enable NFS Services:

To make the NFS server persist across reboots, enable and start the `nfs-server.service`. The NFS server will provide access to the shared directories over the network.

systemctl start nfs-server.service

systemctl enable nfs-server.service

systemctl status nfs-server.service

4. Restart `rpcbind` and `nfs-utils` Services:

NFS requires RPC (Remote Procedure Call) services to operate, specifically the `rpcbind` service. Restart `rpcbind` and `nfs-utils` to ensure proper functioning of NFS.

systemctl restart rpcbind.service

systemctl restart nfs-utils.service

5. Create a Directory to Share:

Create the directory that you want to export (share) over NFS. For example, in this case:

mkdir /sachinpb

6. Configure NFS Exports:

Edit the `/etc/exports` file to configure the directory export settings. This file defines which directories will be shared over NFS and the permissions for each directory.

edit /etc/exports ==> /sachinpb *(rw,async,no_root_squash)

Add the following line to export the `/sachinpb` directory to all clients (`*`), with read-write access, asynchronous mode (`async`), and no root squashing (`no_root_squash`):

Here’s what these options mean:
- `rw`: Clients can read and write to the shared directory.
- `async`: Writes to the shared directory will be cached, improving performance.
- `no_root_squash`: The root user on the client will have root privileges on the NFS share. Use with caution, as it could pose security risks.

7. Apply Export Settings:

After updating `/etc/exports`, use `exportfs` to apply the export settings:

exportfs -a

You can verify the shared directories using `showmount`:

showmount -e

8. Test NFS Server:

Navigate to the shared directory and create a test file to verify that the NFS export is accessible.

cd /sachinpb
touch file1

NFS Client Configuration

1. Manage Firewall:

As with the server, the firewall on the client might block NFS traffic. If needed, stop the firewall or configure it to allow NFS traffic. Check and stop the firewall:

systemctl status firewalld
systemctl stop firewalld

2. Install NFS and RPC Utilities:

Install the required packages on the client side. The `nfs-utils` package provides the necessary utilities for mounting NFS shares, and `rpcbind` is essential for NFS communication.

dnf install rpc* -y
dnf install nfs-utils -y

3. Create a Mount Point:

Create a directory on the client machine where the NFS share will be mounted. This directory acts as the local mount point for the NFS share.

mkdir /sachinpb

4. Mount the NFS Share:

Use the `mount` command to mount the NFS share from the server. Replace `9.53.174.120` with the actual IP address or hostname of your NFS server:

mount -t nfs 9.x.y.12:/sachinpb /sachinpb

OR, you can also add this in /etc/fstab file as shown below :

# cat /etc/fstab
9.x.y.12:/sachinpb /sachinpb nfs defaults 0 0

followed by mount -a

This command mounts the `/sachinpb` directory from the NFS server onto the `/sachinpb` directory on the client.

5. Verify the Mount:

Change to the mounted directory and check if the contents from the NFS server are visible on the client machine:

cd /sachinpb

If the mount is successful, you should see the `file1` created earlier on the NFS server.

Mastering Remote Server Management with Python's Paramiko Library

2024-09-18T21:16:00.001+05:30

Paramiko is a powerful Python library that implements the SSH (Secure Shell) protocol, specifically SSHv2. It provides both client and server functionality, making it suitable for a variety of tasks involving secure remote connections. Here are the main features and uses of Paramiko.

1) SSH Protocol Implementation:

Paramiko is a pure-Python implementation of the SSHv2 protocol, allowing secure communication over unsecured networks. It supports both client and server functionalities, enabling users to create secure connections and transfer data securely.

2) Client and Server Functionality:

As a client, Paramiko can connect to remote servers, execute commands, and transfer files securely using SFTP (SSH File Transfer Protocol).

As a server, it can accept incoming SSH connections, allowing users to run their own SSH servers in Python.

3) High-Level API:

The library provides a high-level API that simplifies common tasks such as executing remote shell commands or transferring files. This makes it easier for developers to implement SSH functionality without dealing with the complexities of the underlying protocol.

4) Cryptography Support:

Paramiko relies on the cryptography library for cryptographic functions. This library uses C and Rust extensions to provide secure encryption methods necessary for implementing SSH.

5) Key Management:

It supports various authentication methods, including password-based authentication and public/private key pairs. Users can manage host keys and perform host key verification to enhance security.

6) Extensibility:

While Paramiko is powerful on its own, it also serves as the foundation for higher-level libraries like Fabric, which further simplify remote command execution and file transfers.

Common Use Cases

Remote Command Execution: Automating tasks by executing shell commands on remote servers.
File Transfers: Securely transferring files between local and remote systems using SFTP.
System Administration: Automating system administration tasks across multiple servers in a secure manner.
Integration with Other Tools: Often used in conjunction with other Python libraries and frameworks to enhance functionality in deployment scripts or DevOps tools.

Pre-requisites:

python -m pip install --upgrade pip
dnf install rust*

Consider using a virtual environment to avoid conflicts with system packages. You can create one using:

python3 -m venv myenv
source myenv/bin/activate

(myenv) [root@my-host ~]# pip3 install paramiko
Collecting paramiko
Successfully built bcrypt cryptography
Installing collected packages: pycparser, bcrypt, cffi, pynacl, cryptography, paramiko
Successfully installed bcrypt-4.2.0 cffi-1.17.1 cryptography-43.0.1 paramiko-3.5.0 pycparser-2.22 pynacl-1.5.0
(myenv) [root@my-host ~]#

To SSH into a remote host and run the ANY command using Python, you can use the Paramiko library. Below is a code snippet that demonstrates how to do this.

------------------python code-------------------------------------

import paramiko

def ssh_ls_command(hostname, username, password):

try:

# Create an SSH client instance

ssh = paramiko.SSHClient()

# Automatically add the server's host key (for simplicity)

ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())

# Connect to the remote host

ssh.connect(hostname, username=username, password=password)

# Execute the 'ls' command

stdin, stdout, stderr = ssh.exec_command('ls')

# Read the output from stdout

output = stdout.read().decode()

error = stderr.read().decode()

# Close the SSH connection

ssh.close()

if output:

print("Output of 'ls' command:")

print(output)

if error:

print("Error:")

print(error)

except Exception as e:

print(f"An error occurred: {e}")

# Example usage

hostname = 'remote-host.com'

username = 'root'

password = 'mypassword'

ssh_ls_command(hostname, username, password)

where :

Importing Paramiko: The script starts by importing the paramiko library.
Creating SSH Client: An instance of SSHClient is created to manage connections.
Host Key Policy: The set_missing_host_key_policy method is used to automatically add the server's host key. This is convenient for testing but should be handled more securely in production.
Connecting: The connect method establishes a connection to the remote host using the provided hostname, username, and password.
Executing Command: The exec_command method runs the specified command (ls in this case) on the remote server.
Reading Output: The standard output and error are read and printed.
Error Handling: Any exceptions during the process are caught and displayed.

--------------------------------------------------------------------------

OUTPUT:

(myenv) [root@my-host ~]# python3 ssh.py
Output of 'ls' command:
anaconda-ks.cfg
kernel-6.11.0-0.rc5.22.el10.src.rpm
linux
original-ks.cfg
rpmbuild
test_tunnel.c
test_tunnel_kern.c
(myenv) [root@my-host ~]#

----------------------------------------------------------------------------

Paramiko is an essential tool for Python developers who need to implement secure communication over SSH. Its ease of use, combined with robust features for managing SSH connections and executing commands remotely, makes it a popular choice for automation and system administration tasks.

CMA (Contiguous Memory Allocator) - Linux Memory Management Mechanism

2024-09-07T17:42:00.007+05:30

CMA (Contiguous Memory Allocator) in Linux is a memory management mechanism designed to provide large contiguous blocks of physical memory for specific use cases, such as DMA (Direct Memory Access) operations or device drivers that require continuous memory regions. When discussing CMA (Contiguous Memory Allocator), it's crucial to focus on physical memory*rather than virtual memory due to the specific requirements of devices and hardware components that rely on direct access to memory. physical memory is referenced instead of virtual memory in the context of CMA.

Purpose:

Some hardware, like certain device drivers or subsystems (e.g., graphics or networking devices), need large chunks of physically contiguous memory. However, as Linux uses a virtual memory system, physical memory can become fragmented over time. CMA ensures that these devices get the required memory, even if the system is fragmented.

How it Works:

- CMA reserves a portion of memory at boot time, which can later be allocated in contiguous blocks when requested.

- During normal system operation, this reserved memory isn't locked away—it can be used for general purpose allocations. However, when a contiguous allocation request comes, this memory is freed and given to the device or driver that requested it.

Device and Hardware Requirements:

- Certain hardware components (like GPUs, network cards, or other peripherals) require contiguous blocks of physical memory for DMA (Direct Memory Access) operations or other high-performance activities.

- DMA is a process where devices communicate directly with the physical memory without the CPU's intervention. For DMA to work efficiently, the hardware needs physically contiguous memory, meaning that the memory addresses are adjacent in physical memory.

- Virtual memory is designed to make efficient use of available memory for software processes, but virtual memory can be fragmented and non-contiguous in physical memory. This is because virtual memory maps logical addresses to scattered physical memory locations.

CMA Guarantees Physical Contiguity:

The key feature of CMA is that it reserves a contiguous region of physical memory that can be allocated on demand. Virtual memory, on the other hand, is not necessarily contiguous in physical terms.

- Even though processes use virtual addresses (which are convenient for applications), the underlying devices or drivers that require DMA or large contiguous blocks of memory need physical addresses, and CMA ensures that this need is met.

Physical vs. Virtual Memory:

- Physical Memory: Refers to the actual RAM installed in the system. It's where data is physically stored, and it must be contiguous for hardware operations.

- Virtual Memory: Is an abstraction provided by the operating system that allows applications to use more memory than physically available. It is divided into pages, which can be scattered across different locations in physical memory.

For example, a 4 GB virtual address space could be mapped to non-contiguous physical memory chunks. However, if a device needs to access a block of memory directly through DMA, the physical memory it accesses must be contiguous.

Virtual Memory Can’t Be Used Directly for DMA or Hardware Access:

- Virtual memory is designed for software abstraction and can be fragmented across the physical memory. This is fine for applications but unsuitable for devices that require access to memory in a sequential physical block.

- When devices perform DMA, they must work with real physical addresses. Therefore, allocating memory in virtual space doesn’t meet the requirement unless the physical memory behind those virtual addresses is also contiguous, which is why CMA allocates from physical memory directly.

How CMA Works with Physical Memory:

- CMA reserves a chunk of physical memory at boot time that can later be allocated in contiguous blocks when requested by the device drivers. It does so to ensure that even when the system’s physical memory becomes fragmented, there will still be a large contiguous block of physical memory available for hardware components.

- Although this memory is allocated from the physical address space, it can be used by virtual memory applications when it's not being actively used by a device.

CMA deals with physical memory because certain devices and hardware require contiguous blocks of physical RAM for tasks like DMA. Virtual memory, which can be fragmented across different physical locations, doesn't meet the needs of these operations. Physical memory contiguity ensures that devices can perform high-speed data transfers efficiently, whereas virtual memory, though beneficial for applications, cannot guarantee this contiguity.

CMA Allocation and Range:

- CMA typically reserves a contiguous memory range at boot time based on system configuration or kernel parameters. The location and size of the CMA region are either:

- Automatically determined by the kernel based on memory requirements.

- Specified manually using kernel boot parameters.

- CMA memory is reserved in the physical memory address space and is set aside as a separate region from the rest of the memory.

Kernel Boot Parameters:

- The size and location of the CMA region can be controlled via kernel boot parameters, such as:

- `cma=size[M/G]`: Specifies the size of the CMA region. For example, `cma=512M` would reserve 512 MB for CMA.

- `cma_start=address`: Specifies the starting address of the CMA region in the physical memory.

- `cma_end=address`: Specifies the ending address of the CMA region.

Example:

cma=256M cma_start=0x20000000 cma_end=0x30000000

This reserves 256 MB for CMA starting at the physical address `0x20000000`.

Where is CMA Allocated?:

- CMA is allocated during boot time and usually resides in the lower end of the physical memory to ensure that DMA or other hardware requests can access it easily.

- CMA allocations can be made in any part of the memory, but it usually starts from a specific predefined region (if not defined manually).

6. Checking CMA Information in Linux:

You can get details about CMA configuration by looking at certain files in the `/proc` or `/sys` filesystems:

/proc/meminfo: This file contains general memory information, including CMA. You can find the CMA reserved region under the entry `CmaTotal` and the currently used CMA memory under `CmaFree`.

Example:

CmaTotal: 262144 kB

CmaFree: 131072 kB

/sys/kernel/debug/cma: If CMA debugging is enabled, this directory will provide detailed information about the CMA memory allocations.

Example to View CMA Memory: cat /proc/meminfo | grep Cma

Output might look like this

CmaTotal: 524288 kB

CmaFree: 512000 kB

This tells you the total reserved CMA memory (`CmaTotal`) and the currently available CMA memory (`CmaFree`).

Summary:

- CMA is allocated at boot and reserves contiguous memory blocks for devices needing such memory.

- It can be specified using boot parameters (size, start, and end address).

- The reserved memory is used by the system when no contiguous memory requests are made and freed when needed for such operations.

- You can check the allocation and usage through system files like `/proc/meminfo`.

========================FADUMP=========================================

Firmware assisted dump (fadump) is a dump capturing mechanism provided as a reliable alternative to kdump on IBM POWER systems. The fadump utility captures the vmcore file from a fully-reset system with PCI and I/O devices. This mechanism uses firmware to preserve memory regions during a crash and then reuses the kdump userspace scripts to save the vmcore file. The memory regions consist of all system memory contents, except the boot memory, system registers, and hardware Page Table Entries (PTEs).

The fadump mechanism offers improved reliability over the traditional dump type, by rebooting the partition and using a new kernel to dump the data from the previous kernel crash.

README: /usr/share/doc/kexec-tools/fadump-howto.txt

In the Secure Boot environment, the GRUB2 boot loader allocates a boot memory region, known as the Real Mode Area (RMA). The RMA has a size of 512 MB, which is divided among the boot components and, if a component exceeds its size allocation, GRUB2 fails with an out-of-memory (OOM) error.

Options for Using fadump:

fadump=on: This is the default setting for enabling fadump. It reserves memory from a special area called **CMA (Contiguous Memory Allocator)**. Think of this as a memory-saving technique that allows some of this reserved memory to still be used by other parts of the system during normal operation. The idea is to avoid wasting memory that would otherwise sit idle.

fadump=nocma: This option tells the system not to use the special CMA-backed memory for fadump. Instead, it reserves a portion of memory separately and completely, which might be useful if you want to capture more detailed information, like user-level data, during a crash. By not using CMA, this memory is reserved exclusively for fadump and isn't used for other tasks while the system is running.

fadump=on: Imagine you have a spare room (memory) in your house. Normally, you leave it empty just in case you need to store something later (for fadump). But with this setting, you let guests use the room for sleeping when you don’t need it. When something goes wrong (system crash), you ask them to leave so you can use it to store important things (dump data).

fadump=nocma: Now, if you set the option to nocma, it's like keeping that spare room off-limits to guests at all times, so it's always ready for storing important stuff whenever you need it.

fadump=on (default): Allows the reserved memory to be used for other tasks when the system is working normally, saving memory.

fadump=nocma : Keeps the reserved memory off-limits to other tasks, ensuring that it's available for storing more detailed data during a crash.

The system with SLES distro will automatically choose whether to use `fadump=nocma` or `fadump=on`, depending on the KDUMP_DUMPLEVEL setting. On RHEL based systems , you can set fadump=on /fadump=nocma using grubby command followd by reboot. Or else you can add "/etc/default/grub" file to add these options and run "grub2-mkconfig -o /boot/grub2/grub.cfg"

KDUMP_DUMPLEVEL determines how much information is captured in a system crash. If it’s set to exclude user pages, the system will automatically use `fadump=on` (the default behavior). But, if user pages are **included** in the dump, it will switch to `fadump=nocma`.

Multipath setup in Linux

2023-09-08T10:38:00.002+05:30

A multipath setup in Linux is a configuration that allows multiple physical paths (usually represented by multiple physical storage devices or network connections) to be used to access a single logical device or storage target. The primary goal of multipath is to enhance redundancy and fault tolerance while providing load balancing and improved performance. Multipath is commonly used in storage area networks (SANs) and environments where high availability and reliability are critical.

Here are the key components and concepts of a multipath setup in Linux:

Multipath Devices (Multipath): In a multipath setup, there is a logical device known as a multipath device (often referred to as a multipath or mpath). This logical device represents a single storage target, such as a disk or LUN (Logical Unit Number), even though it is accessible through multiple physical paths.

Physical Paths: Physical paths are the actual connections or channels through which the storage target is accessible. These paths can be physical SCSI buses, Fibre Channel links, iSCSI connections, or any other transport mechanism. Each path is associated with a unique identifier, typically called a World Wide Name (WWN), device name, or other similar identifiers.

Path Management: The multipath software in Linux (such as multipathd and multipath-tools) manages the physical paths and ensures that they are utilized effectively. It monitors the status of the paths and makes decisions about which path to use for I/O operations. It can also detect and respond to path failures or changes in path availability.

Load Balancing: Multipath configurations often include load balancing mechanisms that distribute I/O requests across the available paths. This helps improve performance by distributing the workload and preventing one path from becoming a bottleneck.

Redundancy and Failover: Multipath setups provide redundancy and failover capabilities. If one path fails due to hardware or network issues, the system can automatically switch to an alternate path without interrupting I/O operations. This enhances system reliability and availability.

Device Mapper (DM-Multipath): In Linux, the Device Mapper subsystem is commonly used to manage multipath devices. DM-Multipath is a kernel component that works with the multipath software to create and manage multipath devices. It presents a single device to the operating system, which is actually a combination of the multiple physical paths.

Configuration Files: To set up multipath in Linux, administrators configure multipath settings using configuration files. The main configuration file is typically located at /etc/multipath.conf (or a similar location) and defines the behavior of the multipath devices.

Multipath Tools: The multipath tools package (multipath-tools or similar) includes utilities such as multipath and multipathd that are used to manage and configure multipath devices. These tools help monitor path status, configure load balancing policies, and perform other administrative tasks related to multipathing.

-----------------------------------

For Example : This system is using multipath and LVM for storage management to provide redundancy and flexibility in managing storage devices. Both sda and sdb are part of the multipath configuration, and their partitions are managed using LVM. This setup is commonly used in enterprise environments for high availability and fault tolerance

There are three partitions on the multipath device mpatha (which represents both sda and sdb due to multipathing).

The lsblk command is used to list block devices on this system, displaying information about disks, partitions, and their relationships. Let's break down the lsblk output:

----------------------------------------------------------------------------------------

#lsblk

NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT

sda 8:0 0 80G 0 disk

└─mpatha 253:0 0 80G 0 mpath

├─mpatha1 253:1 0 4M 0 part

├─mpatha2 253:2 0 1G 0 part /boot

└─mpatha3 253:3 0 79G 0 part

├─rhel_myhost-root 253:4 0 47.7G 0 lvm /

├─rhel_myhost-swap 253:5 0 8G 0 lvm [SWAP]

└─rhel_myhost-home 253:6 0 23.3G 0 lvm /home

sdb 8:16 0 80G 0 disk

└─mpatha 253:0 0 80G 0 mpath

├─mpatha1 253:1 0 4M 0 part

├─mpatha2 253:2 0 1G 0 part /boot

└─mpatha3 253:3 0 79G 0 part

├─rhel_myhost-root 253:4 0 47.7G 0 lvm /

├─rhel_myhost-swap 253:5 0 8G 0 lvm [SWAP]

└─rhel_myhost-home 253:6 0 23.3G 0 lvm /home

Here's a breakdown of the information in each column:

NAME: This column shows the name of the block device.

MAJ:MIN: Major and minor device numbers that uniquely identify the device to the operating system.

RM: This indicates whether the device is removable (1 for yes, 0 for no).

SIZE: The size of the device, often in gigabytes (G) or megabytes (M).

RO: This indicates whether the device is read-only (1 for yes, 0 for no).

TYPE: The type of device, which can be "disk" for physical disks or "part" for partitions. In this case, you also see "mpath" and "lvm," which are related to storage management.

MOUNTPOINT: The mount point where the device is currently mounted. If it's not mounted, this field will be empty.

Now, let's interpret the information based on the provided output:

There are two disk devices: sda and sdb.

Both sda and sdb are part of a multipath configuration, as indicated by the "mpath" type.

Each disk (sda and sdb) has three partitions (mpatha1, mpatha2, and mpatha3), and each of these partitions is used in an LVM (Logical Volume Management) setup.

The /boot partition (mpatha2) is mounted on both sda and sdb, and it contains the boot files.

The root (/) partition (rhel_myhost-root) is mounted on both sda and sdb, and it is the root filesystem.

The swap partition (rhel_myhost-swap) is also mounted on both sda and sdb and is used for swap space.

The /home partition (rhel_myhost-home) is mounted on both sda and sdb and is used for user home directories.

Here's what each of these partitions is typically used for:

mpatha1: This partition appears to be very small (only 4MB), and it is often used for storing bootloader-related files. Specifically, it might contain the GRUB bootloader's core files or other boot-related data. It's a common practice to allocate a small partition for bootloader files to ensure that they are easily accessible and less likely to be affected by changes or issues in the rest of the filesystem. A small partition like this is often sufficient for storing the essential boot files.

mpatha2: This partition is mounted as /boot, and it contains the kernel and initial ramdisk files needed for booting the system. /boot typically holds the Linux kernel, GRUB configuration files, and other boot-related data. Having a separate /boot partition is a common practice, especially in systems that use LVM or other complex storage configurations. It ensures that essential boot files are easily accessible and are less prone to issues that might affect other partitions.

mpatha3: This partition appears to be the largest and is not directly mounted as part of the root filesystem (/). Instead, it seems to be part of an LVM (Logical Volume Management) setup. It is divided into multiple logical volumes (rhel_myhost-root, rhel_myhost-swap, and rhel_myhost-home) that are used for various purposes:

rhel_myhost-root: This is the root filesystem (/) where most of the operating system and software are installed.

rhel_myhost-swap: This logical volume is used as swap space, which is used for virtual memory and can help improve system performance.

rhel_myhost-home: This logical volume is typically used for user home directories. User data and files are stored in the /home directory, which is often mounted on a separate filesystem to isolate user data from the root filesystem.

So, mpatha1 is likely used for bootloader files, mpatha2 is the /boot partition containing boot-related files, and mpatha3 represents an LVM setup with separate logical volumes for the root filesystem, swap space, and user home directories. This kind of partitioning and storage management allows for flexibility, scalability, and better system maintenance

You can map the UUID specified in the /etc/fstab file to the corresponding device name, such as /dev/sda1, /dev/sdb1, or /dev/mpatha1. The UUID (Universally Unique Identifier) is a unique identifier assigned to each filesystem or partition and is a more reliable way to identify devices than device names, which can change if hardware configurations are altered.

To map a UUID to the corresponding device name, you can use the blkid command.

[root@myhost ~]# blkid

/dev/mapper/rhel_myhost-root: UUID="XXXXXXXXXXXXXXX" BLOCK_SIZE="512" TYPE="xfs"

/dev/mapper/mpatha3: UUID="XXXXXXXXX" TYPE="LVM2_member" PARTUUID="XXXXXXXX"

/dev/sda: PTUUID="XXXXXXXX" PTTYPE="dos"

/dev/mapper/mpatha: PTUUID="abc" PTTYPE="dos"

/dev/sdb: PTUUID="XXXXXXX" PTTYPE="dos"

/dev/mapper/mpatha1: PARTUUID="abc-01"

/dev/mapper/mpatha2: UUID="XXXXXXXXXXXXXXXXXXXX" BLOCK_SIZE="512" TYPE="xfs" PARTUUID="XXXXX"

/dev/mapper/rhel_myhost-swap: UUID="abc123c" TYPE="swap"

/dev/mapper/rhel_myhost-home: UUID="xyz123" BLOCK_SIZE="512" TYPE="xfs"

[root@myhost ~]#

-----------------

[root@myhost ~]# cat /boot/grub2/device.map

# this device map was generated by anaconda

(hd0) /dev/mapper/mpatha

[root@myhost ~]#

--------------

[root@myhost ~]# cat /etc/fstab

# /etc/fstab

/dev/mapper/rhel_myhost-root / xfs defaults 0 0

UUID=XXXXXXXXXXXXXXXXXXXXXXX /boot xfs defaults 0 0

/dev/mapper/rhel_myhost-home /home xfs defaults 0 0

/dev/mapper/rhel_myhost-swap none swap defaults 0 0

[root@myhost ~]#

A multipath setup in Linux provides redundancy, load balancing, and fault tolerance for storage devices, ensuring that data remains accessible even if one path or connection fails. This technology is crucial in enterprise environments where continuous access to data is essential for operations.

----------------------------

The grub2-install command is a utility used in Linux to install the GRUB (Grand Unified Bootloader) bootloader onto a device, typically a hard disk or a partition.

The primary purpose of the grub2-install command is to install the GRUB bootloader on a specific device. You specify the target device as an argument to the command.

For example : grub2-install /dev/sda

In this example, the GRUB bootloader is installed on the MBR (Master Boot Record) of /dev/sda, which is typically the primary boot device.

Boot Device Configuration:

When GRUB is installed on a device, it configures the bootloader to locate and load the kernel and initial ramdisk (initrd) from the designated boot device or partition. It also stores configuration information, such as the location of the kernel and the root filesystem.

Device Map Configuration:

GRUB maintains a device map that associates BIOS drive numbers (e.g., (hd0), (hd1)) with actual device names (e.g., /dev/sda, /dev/sdb). The grub2-install command updates or creates this device map, ensuring that GRUB can correctly identify the boot device.

Bootloader Configuration File:

GRUB bootloader configurations are specified in the /boot/grub2/grub.cfg (or similar) file. This configuration file is automatically generated by GRUB utilities and scripts based on the system's configuration, such as the kernel and initrd locations, boot options, and menu entries.

Boot Menu:

GRUB provides a boot menu during system startup, allowing users to select from available kernels and boot options. The grub2-install command ensures that the necessary components for this boot menu are installed and configured correctly.

Updating GRUB Configuration:

In addition to installing GRUB, the grub2-install command also updates the bootloader's configuration to reflect changes in the system's disk layout or partitioning scheme. This includes updating device names and paths if necessary.

EFI and UEFI Support:

The behavior of grub2-install can differ depending on whether the system uses BIOS or UEFI (Unified Extensible Firmware Interface) for booting. For UEFI systems, the grub2-install command installs the UEFI version of GRUB and configures it accordingly.

Additional Options:

The grub2-install command supports various options to specify installation details, such as the target architecture, firmware type (BIOS or UEFI), and more. You can use the --target, --boot-directory, and other options to customize the installation.

----------------------------------------------------------

The PReP boot partition is a specialized partition used on PowerPC systems that follow the PReP boot standard to store firmware-specific bootloader and boot-related files. On the other hand, the /boot partition is a common convention on Linux systems, including PowerPC systems, to store kernel, initramfs, and bootloader configuration files, but it is not tied to any specific firmware standard and is used across various hardware architectures.

PReP Boot Partition:

The PReP boot partition is a specific partition type used in the context of the PReP boot standard, which is a firmware standard for booting PowerPC-based systems.

Its primary purpose is to store the bootloader and boot-related information required to initiate the boot process on PowerPC systems adhering to the PReP standard.

It typically contains essential firmware boot files, such as Open Firmware or IEEE 1275-compliant firmware, which are necessary to start the system.

/boot Partition:

The /boot partition is a common convention used on various Linux distributions, including those running on PowerPC systems.

Its purpose is to store the kernel, initramfs (initial RAM disk), bootloader configuration files, and other files required for the early stages of the boot process.

The /boot partition is part of the Linux filesystem structure and is used by the Linux bootloader (e.g., GRUB) to locate and load the kernel and initramfs during the boot process.

Firmware Dependency:

PReP Boot Partition:

The PReP boot partition's usage is closely tied to the firmware standard it follows, such as Open Firmware or IEEE 1275-compliant firmware. The firmware is responsible for loading the bootloader from this partition.

It may also contain firmware-specific files and configurations.

/boot Partition:

The /boot partition is not firmware-dependent and is part of the Linux filesystem. It is managed by the Linux bootloader (e.g., GRUB) and the operating system itself.

The bootloader reads the kernel and initramfs from the /boot partition during the boot process, and this partition is independent of the system's firmware.

Common Usage:

PReP Boot Partition:

Commonly used on older PowerPC-based systems that adhere to the PReP standard.

It's specific to the boot process defined by the firmware standard used on these systems

/boot Partition:

Widely used on various Linux distributions, including those on PowerPC systems.

It's part of the standard Linux filesystem structure and is used on many different hardware platforms.

Openstack Framework and components

2023-09-04T17:51:00.007+05:30

OpenStack is an open-source cloud computing platform that provides a set of software tools and components for building and managing public and private clouds. It enables organizations to create and manage cloud infrastructure services, including compute, storage, networking, and more. OpenStack is designed to be highly flexible, scalable, and customizable, making it a popular choice for building cloud solutions.

OpenStack is an open-source cloud computing platform that was initially launched in July 2010 as a joint project by Rackspace Hosting and NASA. Since then, it has grown into a vibrant open-source community with contributions from a wide range of organizations and individuals. Here's a brief history of OpenStack and an overview of its main components:

OpenStack History:

Launch (2010): OpenStack was publicly launched in July 2010 with the release of the first two core projects, Nova (compute) and Swift (object storage). It was created to address the need for an open and flexible cloud computing platform.

Expanding Community (2011-2012): The OpenStack community quickly expanded, with numerous companies joining the project. The community released new versions of OpenStack, including Diablo, Essex, and Folsom, each with additional core and supporting projects.

Foundation Establishment (2012): In September 2012, the OpenStack Foundation was established to oversee the project's development and ensure its long-term governance as an open-source project.

Maturing Ecosystem (2013-2015): OpenStack continued to evolve, with new releases like Grizzly, Havana, Icehouse, and Juno. During this period, more projects were added to the ecosystem, covering areas such as networking (Neutron), block storage (Cinder), and identity (Keystone).

Enterprise Adoption (2016-2017): OpenStack gained significant traction among enterprises and service providers. Projects like Heat (orchestration) and Magnum (containers) were introduced to support cloud automation and container orchestration.

Continued Growth (2018-Present): OpenStack has continued to grow and evolve, with new projects and features being added regularly. The community releases new versions of OpenStack every six months, with each version introducing enhancements and improvements.

Openstack Releases: Currently running Openstack is release is "xena". Austin was the 1st Openstack release and it obsolete now. For more details check the links below:

https://docs.openstack.org/puppet-openstack-guide/latest/install/releases.html

https://releases.openstack.org/

Austin (2010): The first official release of OpenStack, code-named "Austin."
Bexar (2011): The second release, code-named "Bexar."
Cactus (2011): The third release, code-named "Cactus."
Diablo (2011): The fourth release, code-named "Diablo."
Essex (2012): The fifth release, code-named "Essex."
Folsom (2012): The sixth release, code-named "Folsom."
Grizzly (2013): The seventh release, code-named "Grizzly."
Havana (2013): The eighth release, code-named "Havana."
Icehouse (2014): The ninth release, code-named "Icehouse."
Juno (2014): The tenth release, code-named "Juno."
Kilo (2015): The eleventh release, code-named "Kilo."
Liberty (2015): The twelfth release, code-named "Liberty."
Mitaka (2016): The thirteenth release, code-named "Mitaka."
Newton (2016): The fourteenth release, code-named "Newton."
Ocata (2017): The fifteenth release, code-named "Ocata."
Pike (2017): The sixteenth release, code-named "Pike."
Queens (2018): The seventeenth release, code-named "Queens."
Rocky (2018): The eighteenth release, code-named "Rocky."
Stein (2019): The nineteenth release, code-named "Stein."
Train (2019): The twentieth release, code-named "Train."
Ussuri (2020): The twenty-first release, code-named "Ussuri."
Victoria (2020): The twenty-second release, code-named "Victoria."
Wallaby (2021): The twenty-third release, code-named "Wallaby."
Xena (2021): The twenty-fourth release, code-named "Xena."
Yoga (2022): The twenty-fifth release, code-named "Yoga."
Zuul (2022): The twenty-sixth release, code-named "Zuul."

OpenStack's modular architecture allows organizations to choose the components that best fit their cloud computing needs, making it a versatile and customizable platform for building private, public, and hybrid clouds. OpenStack is built using a modular architecture, where each component provides a specific cloud service. These components can be combined to create a custom cloud infrastructure tailored to the organization's needs. OpenStack is composed of multiple projects, each providing a specific cloud service.

Multi-Tenancy: OpenStack supports multi-tenancy, allowing organizations to create isolated environments within the cloud infrastructure. This means that multiple users or projects can share the same cloud while maintaining security and resource separation.
Open Source: OpenStack is released under an open-source license, making it freely available for anyone to use, modify, and contribute to. This open nature has led to a vibrant community of developers and users collaborating on its development.
Integration and Compatibility: OpenStack is designed to integrate with various virtualization technologies, hardware vendors, and third-party tools. It can be used with different hypervisors, storage systems, and networking solutions.
Private and Public Clouds: Organizations can use OpenStack to create private clouds within their data centers or deploy public cloud services to offer cloud resources to external customers or users.
Hybrid Clouds: OpenStack can be part of a hybrid cloud strategy, where organizations combine private and public cloud resources to achieve flexibility and scalability

Here are some of the main components:

source

Nova (Compute): Manages and orchestrates virtual machines (instances) on hypervisors. It provides features for creating, scheduling, and managing VMs.
Swift (Object Storage): Offers scalable and durable object storage services for storing and retrieving data, including large files and unstructured data.
Cinder (Block Storage): Manages block storage volumes that can be attached to instances. It provides persistent storage for VMs.
Neutron (Networking): Handles networking services, including the creation and management of networks, subnets, routers, and security groups.
Keystone (Identity): Manages identity and authentication services, including user management, role-based access control (RBAC), and token authentication.
Glance (Image Service): Stores and manages virtual machine images (VM snapshots) that can be used to create instances.
Horizon (Dashboard): A web-based user interface that provides a graphical way to manage and monitor OpenStack resources.
Heat (Orchestration): Provides orchestration and automation services for defining and managing cloud application stacks.
Ceilometer (Telemetry): Collects telemetry data, including usage and performance statistics, for billing, monitoring, and auditing.
Trove (Database-as-a-Service): Manages database instances as a service, making it easier to provision and manage databases.
Ironic (Bare Metal): Manages bare-metal servers as a service, allowing users to provision physical machines in the same way as virtual machines.
Zaqar (Messaging and Queuing): Provides messaging and queuing services for distributed applications.
Magnum (Container Orchestration): Orchestrates container platforms like Kubernetes to manage containerized applications.

Postman provides a user-friendly interface for building and sending API requests, inspecting responses, and automating API testing. Internally, Postman is a comprehensive software tool that facilitates the process of sending HTTP requests to APIs, receiving responses, and performing various tasks related to API testing, monitoring, and development. It operates through a combination of user interactions and underlying components. Postman simplifies the process of sending HTTP requests to APIs by providing a user-friendly interface, generating HTTP requests based on user input, and enabling users to work with API responses. It also supports more advanced features such as scripting, automation, and test execution for comprehensive API testing and monitoring. It's widely used by developers to

Test APIs: Developers can use Postman to send requests to APIs and receive responses, making it easy to test how the API functions.
Automate Tests: Postman allows you to create and automate test scripts to ensure that your APIs are working as expected. You can set up tests to validate the response data, headers, and more.
Document APIs: You can use Postman to generate API documentation, which is useful for sharing information about how to use an API with others.
Monitor APIs: Postman can be used to monitor APIs and receive alerts when issues or errors occur.
Mock Servers: Postman provides the ability to create mock servers, which can simulate an API's behavior without the actual backend being implemented yet.

Here's how Postman is involved and invoked internally when working with the examples provided:

1) User Interface (UI): Postman provides a user-friendly graphical interface where users can create, manage, and send API requests. Users interact with this UI to input API details, such as request URLs, headers, parameters, and request bodies.

2) Request Configuration: When you create a request in Postman, you configure various aspects of the request, including the request method (e.g., GET, POST, PUT, DELETE), request URL, headers, query parameters, request body (if applicable), and authentication settings.

3) HTTP Request Generation: Postman internally generates the corresponding HTTP request based on the user's configuration. For example, if you configure a GET request to retrieve user data, Postman generates an HTTP GET request to the specified URL with the provided headers and parameters.

4) Request Sending: When you click the "Send" button within Postman, it sends the generated HTTP request to the target API endpoint using the configured settings (e.g., URL, headers, body). This request is sent via the HTTP protocol to the specified API server.

5) API Server Interaction: The HTTP request sent by Postman is received by the API server. The server processes the request based on the HTTP method, URL, and other request details. For example, in a RESTful API, a GET request may retrieve data, while a POST request may create new data.

6) Response Reception: After the API server processes the request, it sends an HTTP response back to Postman. This response includes data (e.g., JSON or XML) and metadata (e.g., status code, headers) generated by the server.

7) Response Handling: Postman receives the HTTP response and presents it to the user within its UI. The user can inspect the response content, status code, headers, and other details. Postman also provides tools for handling response data, such as extracting values or running tests.

8)Test Execution: Users can define tests and assertions within Postman using scripts (e.g., JavaScript). When a test script is defined, Postman internally executes the script and checks the results against the specified assertions.

9) Results Reporting: Postman provides feedback to the user about the outcome of the API request and any tests that were run. Users can view whether the request was successful, the response met the expected criteria, and any potential errors or issues.

10)Automation: Postman can be integrated into automated testing pipelines, continuous integration (CI) workflows, and monitoring systems. It can be invoked programmatically to run collections of requests, automate tests, and monitor APIs at specified intervals.

Examples: make sure you have access to a RESTful API that you want to test. Replace the URL, endpoints, and parameters with the appropriate values for your specific API.

1) GET Request to Retrieve Data . To retrieve data from an API using a GET request:

GET https://api.example.com/users

2) GET Request with Query Parameters.To retrieve data with query parameters:

GET https://api.example.com/users?id=123&name=John

3) POST Request to Create Data.To create data using a POST request with a JSON body:

POST https://api.example.com/users

Headers:
Content-Type: application/json

Body (JSON):
{
"name": "Alice",
"email": "alice@example.com"
}

4) PUT Request to Update Data.To update data using a PUT request with a JSON body:

PUT https://api.example.com/users/123

Headers:
Content-Type: application/json

Body (JSON):
{
"name": "Updated Name",
"email": "updated@example.com"
}

5) DELETE Request to Remove Data. To delete data using a DELETE request:

DELETE https://api.example.com/users/123

6) Headers and Authentication. You can add headers, such as authorization headers, to your requests. For example, to send an API key in the headers

GET https://api.example.com/resource

Headers:
Authorization: Bearer YOUR_API_KEY

7) Handling Response Data:After sending a request, you can inspect the response data. For example, to extract a specific value from the response, you can use JavaScript-like syntax in Postman's Tests tab:

// Extract the value of the "name" field from the JSON response
var jsonData = pm.response.json();
pm.environment.set("username", jsonData.name);

These are just some basic examples of how to use Postman to interact with RESTful APIs. You can create collections of requests, use variables, and write more complex tests to thoroughly test and validate your APIs.

Python code example that demonstrates how to make an HTTP GET request to a RESTful API using the popular requests library. In this example, we'll use the JSONPlaceholder API, which provides dummy data for testing and learning purposes:

import requests
# Define the API endpoint URL
api_url = "https://jsonplaceholder.typicode.com/posts/1"
try:
# Send an HTTP GET request to the API endpoint
response = requests.get(api_url)
# Check if the request was successful (status code 200)
if response.status_code == 200:
# Parse the JSON response
data = response.json()
# Print the response data
print("Title:", data["title"])
print("Body:", data["body"])
else:
print("HTTP Request Failed with Status Code:", response.status_code)
except requests.exceptions.RequestException as e:
# Handle any exceptions that may occur during the request
print("An error occurred:", e)

NOTE: We define the API endpoint URL (api_url) that we want to retrieve data from. In this example, we're fetching data for a specific post using its ID.

and use a try block to send an HTTP GET request to the API endpoint using requests.get(api_url).

We check the HTTP response status code. If it's 200, the request was successful, and we proceed to parse the JSON response.If the request was successful, we parse the JSON response using response.json() and print specific fields from the response (in this case, the post's title and body). If the request fails or encounters an exception, we handle it and print an error message.

OpenStack provides a set of RESTful APIs for managing cloud infrastructure resources. These APIs are used to create, manage, and interact with virtualized resources such as instances (virtual machines), volumes, networks, and more. Here are some common API endpoint examples with respect to OpenStack:

1) Identity (Keystone) API:

Authentication and token management.

Example: http://<OpenStack-IP>:5000/v3/

Compute (Nova) API:

2) Management of virtual machines (instances).

Example: http://<OpenStack-IP>:8774/v2.1/

Block Storage (Cinder) API:

3) Management of block storage volumes.

Example: http://<OpenStack-IP>:8776/v2/

Object Storage (Swift) API:

4) Storage and retrieval of objects (files and data).

Example: http://<OpenStack-IP>:8080/v1/

Image (Glance) API:

5) Management of virtual machine images (VM snapshots).

Example: http://<OpenStack-IP>:9292/v2/

Network (Neutron) API:

6) Management of network resources, including routers, subnets, and security groups.

Example: http://<OpenStack-IP>:9696/v2.0/

Orchestration (Heat) API:

7) Orchestration of cloud resources through templates.

Example: http://<OpenStack-IP>:8004/v1/

Telemetry (Ceilometer) API:

8) Collection of usage and performance data.

Example: http://<OpenStack-IP>:8777/v2/

Dashboard (Horizon) API:

9) Web-based user interface for OpenStack services.

Example: http://<OpenStack-IP>/dashboard/

Placement (Placement) API:

10) Management of resource placement and allocation.

Example: http://<OpenStack-IP>:8778/

These are just some examples of the core OpenStack APIs and their respective endpoint URLs.

--------

To check if a user exists in your OpenStack environment, you can use the Identity (Keystone) API, which manages authentication and user-related operations. Specifically, you can make a request to the Keystone API to list users and then check if the desired user is in the list. Here are the general steps to do this:

Step 1 :Authenticate with Keystone:

Before making any requests to the Keystone API, you need to authenticate. Typically, this involves sending a POST request with your credentials to the Keystone authentication endpoint. You'll receive a token in response, which you can use to make subsequent API requests.

List Users:

Step 2 : Make a GET request to the Keystone API's user listing endpoint to retrieve a list of all users in the OpenStack environment.

Example API endpoint for listing users: http://<OpenStack-IP>:5000/v3/users

Include the authentication token in the request headers.

Check User Existence:

Step 3 : After receiving the list of users, you can iterate through the user data and check if the desired user exists by comparing usernames, IDs, or other unique identifiers.

Here's a Python example using the requests library to check if a user exists in Keystone:

import requests
# Keystone authentication endpoint
auth_url = "http://<OpenStack-IP>:5000/v3/auth/tokens"
# Keystone user listing endpoint
users_url = "http://<OpenStack-IP>:5000/v3/users"
# Replace with your OpenStack credentials
auth_data = {
"auth": {
"identity": {
"methods": ["password"],
"password": {
"user": {
"name": "your_username",
"domain": {"name": "your_domain"},
"password": "your_password"
}
}
}
}
}

# Authenticate and get a token
response = requests.post(auth_url, json=auth_data)
if response.status_code == 201:
token = response.headers["X-Subject-Token"]

# List all users
headers = {"X-Auth-Token": token}
response = requests.get(users_url, headers=headers)

if response.status_code == 200:
users = response.json()["users"]

# Check if the user exists
target_user = "desired_username"
user_exists = any(user["name"] == target_user for user in users)
if user_exists:
print(f"User {target_user} exists.")
else:
print(f"User {target_user} does not exist.")
else:
print("Failed to list users.")
else:
print("Authentication failed.")

This example demonstrates how to authenticate with Keystone, list users, and check if a specific user exists by comparing usernames. Replace placeholders with your OpenStack-specific values and adjust the code as needed for your environment

-----------------------

OpenStack service overview:

source

Nova , Cinder, Swift and Neutron -these OpenStack services together provide a comprehensive cloud computing platform. Nova manages compute resources, Cinder offers block storage, Swift provides object storage, and Neutron handles networking, enabling organizations to build and manage private and public clouds tailored to their specific needs.

Nova (OpenStack Compute): Nova is the core compute service in OpenStack. It manages the creation, scheduling, and management of virtual machines (VMs) in a cloud environment. Nova is hypervisor-agnostic, supporting various virtualization technologies, and it provides features for live migration, scaling, and resource management.

Cinder (OpenStack Block Storage): Cinder is the block storage service in OpenStack. It offers block-level storage volumes that can be attached to VMs. Users can create, manage, and snapshot these volumes, making it suitable for data persistence in applications like databases.

Swift (OpenStack Object Storage): Swift is the object storage service in OpenStack. It is designed for the storage of large amounts of unstructured data, such as images, videos, and backups. Swift provides scalable, redundant, and highly available storage with easy-to-use APIs.

Neutron (OpenStack Networking): Neutron is the networking service in OpenStack. It enables users to create and manage networks, subnets, routers, and security groups for VMs. Neutron supports various network configurations, including flat networks, VLANs, and overlay networks, allowing for flexibility in network design.

--------

Key Differences between Cinder and swift : The object storage and block storage serve different purposes and have distinct access methods. Object storage is well-suited for handling unstructured data and large-scale content distribution, while block storage is preferred for applications requiring direct control over data blocks and high performance. Organizations often choose between these storage types based on their specific use cases and storage needs.

Access Level: Object storage uses a higher-level access method, where data is accessed and managed as whole objects using unique identifiers. Block storage provides lower-level access, treating data as raw blocks.

Use Cases: Object storage is ideal for storing large amounts of unstructured data and content distribution, while block storage is suited for applications requiring direct control over storage blocks.

Scalability: Object storage is known for its horizontal scalability and ease of expansion, whereas block storage scalability may require more planning and management.

Data Management: Object storage systems often manage data redundancy and durability internally, while block storage may rely on external solutions or the application to manage data redundancy.

Data Retrieval: Object storage is optimized for read-heavy workloads and large-scale data distribution, while block storage is designed for high performance and low-latency access.

------------

Ceph:

Ceph is an open-source, distributed storage system designed for both object and block storage. It is known for its flexibility, scalability, and ability to provide a unified storage platform. Ceph is often used in cloud computing environments, data centers, and high-performance computing (HPC) clusters.

Key components and features of Ceph include:

Object Storage (RADOS Gateway): Ceph provides object storage capabilities through its RADOS (Reliable Autonomic Distributed Object Store) Gateway. This allows users to store and retrieve objects using a RESTful API compatible with Amazon S3 and Swift.

Block Storage (RBD): Ceph's RADOS Block Device (RBD) allows users to create block storage volumes that can be attached to virtual machines or used as raw block storage. RBD is often integrated with virtualization platforms like KVM.

Scalability: Ceph scales seamlessly from a few nodes to thousands of nodes by distributing data across OSDs (Object Storage Daemons) and MONs (Monitor Daemons). This scalability makes it suitable for large-scale storage deployments.

Data Redundancy: Ceph replicates data across multiple OSDs to ensure redundancy and high availability. It uses a CRUSH algorithm to distribute data efficiently.

Self-Healing: Ceph can automatically detect and recover from hardware failures or data inconsistencies. It continuously monitors data integrity.

Unified Storage: Ceph provides a unified storage platform that combines object, block, and file storage, allowing users to access data in various ways, depending on their requirements.

Community and Ecosystem: Ceph has a vibrant open-source community and a wide ecosystem of tools and projects that integrate with it. This includes interfaces for OpenStack integration.

-------------------------

Neutron, the networking component of OpenStack, plays a crucial role in creating and managing networking resources within a cloud infrastructure.

source

Here are some interesting factors and capabilities related to Neutron:

Network Abstraction: Neutron abstracts network resources, allowing users to create and manage virtual networks, subnets, routers, and security groups through APIs or the dashboard. This abstraction simplifies complex networking tasks and provides a consistent interface.

Multi-Tenancy: Neutron supports multi-tenancy, enabling the isolation of network resources between different projects or tenants. This ensures that one tenant's network activities do not impact another's.

Pluggable Architecture: Neutron follows a pluggable architecture, allowing users to integrate with various networking technologies and solutions. This includes support for different plugins and drivers, enabling compatibility with a wide range of network devices and services.

Software-Defined Networking (SDN): Neutron can be used in conjunction with SDN controllers and solutions to provide advanced network automation, programmability, and flexibility. SDN allows for the dynamic configuration of network services and policies.

Networking Interfaces: Neutron allows the creation of various types of networking interfaces for virtual machines, including:

Port: Neutron manages ports, which represent virtual interfaces connected to a network. VMs attach to ports to access the network.

Router: Routers connect different subnets and provide inter-subnet routing. Neutron manages router interfaces and routing rules.

Floating IPs: Floating IPs provide external network access to VMs. Neutron can assign floating IPs dynamically or statically.

Bonding and Teaming: Neutron can manage bonded network interfaces (NIC bonding) for redundancy and increased network bandwidth. This is especially useful for ensuring high availability and load balancing of VMs.

Security Groups: Neutron's security groups feature allows users to define firewall rules and policies to control incoming and outgoing traffic to VMs. It enhances network security within the cloud environment.

L3 and L2 Services: Neutron supports Layer 3 (routing) and Layer 2 (bridging) services. This flexibility enables complex network topologies and scenarios.

Interoperability: Neutron integrates with various network technologies, including VLANs, VXLANs, GRE tunnels, and more. It provides interoperability with physical network infrastructure and external networks.

Communication Between VMs: Neutron ensures that VMs can communicate with each other within the same network or across networks using routing. It manages the routing tables and connectivity.

Load Balancing as a Service (LBaaS): Neutron offers LBaaS, allowing users to create and manage load balancers to distribute traffic among multiple VMs or instances.

High Availability (HA): Neutron can be configured for high availability, ensuring network services remain operational even in the event of network node failures.

---------------------------------------------------------

Containerization in OpenStack involves deploying and managing containers within an OpenStack cloud environment. This allows users to run containerized applications and microservices alongside traditional virtual machines (VMs).

source

Here's a step-by-step explanation of the design and components involved in containerization within OpenStack:

1. Container Orchestration Framework: OpenStack supports various container orchestration frameworks, with Kubernetes being one of the most popular choices. Kubernetes helps manage the deployment, scaling, and operation of application containers. It serves as the foundation for container orchestration in an OpenStack environment.

2. Container Runtime: Containers are run using a container runtime, such as Docker or containerd. This runtime manages the execution of containerized applications and provides isolation between containers. In an OpenStack-based containerization setup, a container runtime is installed on each compute node in the OpenStack cluster.

3. OpenStack Components:

Nova (Compute Service): Nova is responsible for managing compute resources, including VMs and, in a containerized environment, bare metal servers. It can provision servers specifically for running containers alongside traditional VMs.
Neutron (Networking Service): Neutron handles networking and connectivity for containers. It ensures that containers can communicate with each other, VMs, and external networks.
Cinder (Block Storage Service): Cinder provides block-level storage for containers when persistent storage is required. Containers can use Cinder volumes for data storage.

4. Magnum (Container Orchestration Service): OpenStack Magnum is a dedicated service for managing container orchestration clusters, such as Kubernetes, within the OpenStack cloud. It simplifies the deployment and management of container orchestration platforms.

5. Heat (Orchestration Service): Heat is an orchestration service in OpenStack that enables the automated deployment and scaling of infrastructure resources, including containers. It allows users to define templates describing the desired container infrastructure and then deploys and manages the resources accordingly.

6. Glance (Image Service): Glance is responsible for storing and managing container images. Containers are typically built from base images, and Glance helps manage these images within the OpenStack environment.

7. Keystone (Identity Service): Keystone provides authentication and authorization services for containerized applications and services. It ensures that only authorized users and services can access containers and container orchestration platforms.

8. Container Networking and Storage Plugins: In an OpenStack-based containerization environment, specialized networking and storage plugins are often used to integrate container networking and storage with OpenStack services. These plugins enable efficient communication and data storage for containers.

9. User Interface: Users interact with the containerization platform through the OpenStack dashboard (Horizon) or through the command-line interface (CLI). They can deploy and manage containers, container orchestration clusters, and associated resources.

10. Monitoring and Logging: Containerized applications generate logs and require monitoring for performance and resource usage. OpenStack can be integrated with monitoring and logging solutions like Prometheus, Grafana, and ELK (Elasticsearch, Logstash, and Kibana) to provide insights into containerized workloads.

11. External Services Integration: Containers often need to interact with external services and APIs. OpenStack allows for integration with external services through the use of network configurations, load balancers, and other relevant components.

In summary, containerization in OpenStack involves a combination of OpenStack services, container orchestration frameworks like Kubernetes, container runtimes, and specialized plugins to provide a seamless environment for deploying and managing containerized applications alongside traditional VMs within an OpenStack cloud infrastructure. This setup offers flexibility, scalability, and isolation for running containerized workloads in a cloud environment.

Watsonx AI and data platform with Foundation Models

2023-07-30T23:22:00.013+05:30

We are witnessing a fundamental shift in AI driven by self-supervision and by the ability to create foundation models that power generative AI. Several exciting new Foundation Model capabilities have been announced at IBM Think 2023. Watsonx is a new platform for foundation models and generative AI, offering a studio, data store, and governance toolkit. Let’s take a look what this platform intends to provide.

Why can't we build and reuse AI models? More data, more problems? Learn how AI foundation models change the game for training AI/ML from IBM Research AI VP Sriram Raghavan and Darío Gil, SVP and Director of IBM Research as they demystifies the technology and shares a set of principles to guide your generative AI business strategy. Experience watsonx, IBM’s new data and AI platform for generative AI and learn about the breakthroughs that IBM Research is bringing to this platform and to the world of computing. and to explore foundation models, an emerging approach to machine learning and data representation. Even in the age of big data when AI/ML is more prevalent, training the next generation of AI tools like NLP requires enormous data, and using AI models to new or different domains may be tricky. A foundation model can consolidate data from several sources so that one model may then be used for various activities. But how will foundation models be used for things beyond natural language processing? Don't miss this episode to explore how foundation models are a paradigm shift in how AI gets done.

You can bring your own data and AI models to watsonx or choose from a library of tools and technologies. You can train or influence training (if you want), then you can tune, that way you can have transparency and control over governing data and AI models. You can prompt it too. Instead of only one model, you can have family of models. The foundation models trained with your own data will become more valuable asset. Watsonx is a new integrated data platform to become a value creator. It consists of 3 primary parts, first watsonx.data is massive curated data repository that is ready to be tapped to train and fine-tune models with data management system. Watsonx.ai is an enterprise studio to train, validate, tune and deploy traditional ML and foundation models that provide generative AI capabilities. Watson.governance is a powerful set of tools to ensure your AI is executing responsibly. They work together seemlessly throughout the entire lifecycle of foundation models. Watsonx built on top of RedHat Openshift. The lifecycle consists of

STEP 1: preparing our data [Acquire, filter and pre-process, version & tag]. Each data set after being filtered , processed , it receives a data card. Data card has name and version of pile, specifies its content and filters that have been applied to it. We can have multiple data piles . They co-exists in .data and access different versions of data maintained for different purpose is managed seamlessly.

STEP2 : using it to train the model, validate the model, Tune the model and deploying applications and solutions. So we moved from .data to .AI and start picking a model architecture from the five families that IBM provides. These are bedrocks of models and they range from encoder only, encoder-decoder, decoder only and other novel architectures.

What Are Foundation Models? . Foundation models are AI neural networks trained on massive unlabeled datasets to handle a wide variety of jobs from translating text to analyzing medical images. We're witnessing a transition in AI. Systems that execute specific tasks in a single domain are giving way to broad AI that learns more generally and works across domains and problems. Foundation models, trained on large, unlabeled datasets and fine-tuned for an array of applications, are driving this shift. The models are pre-trained to support a range of natural language processing (NLP) type tasks including question answering, content generation and summarization, text classification and extraction. Future releases will provide access to a greater variety of IBM-trained proprietary foundation models for efficient domain and task specialization.

Source

Foundation models are trained with massive amounts of data that allow for generative AI capabilities with a broad set of raw data that can be applied to different tasks, such as natural language processing. Instead of one model built solely for one task, foundation models can be adapted across a wide variety of different scenarios, summarizing documents, generating stories, answering questions, writing code, solving math problems, synthesizing audio. A year after the group defined foundation models, other tech watchers coined a related term — generative AI. It’s an umbrella term for transformers, large language models, diffusion models and other neural networks capturing people’s imaginations because they can create text, images, music, software and more.

IBM has planned to offer a suite of foundation models, for example smaller encoder based models, but also encoder-decoder or just decoder based models.

source

Recognizing that one size doesn’t fit all, we’re building a family of language and code foundation models of different sizes and architectures. Each model family has a geology-themed code name —Granite, Sandstone, Obsidian, and Slate — which brings together cutting-edge innovations from IBM Research and the open research community. Each model can be customized for a range of enterprise tasks. While Foundation Models are in general good in performing multiple tasks, they have been trained with generic data. To optimize them, fine tuning with domain specific or proprietary data can be done.

Watsonx is our enterprise-ready AI and data platform designed to multiply the impact of AI across your business. The platform comprises three powerful products: the watsonx.ai studio for new foundation models, generative AI and machine learning; the watsonx.data fit-for-purpose data store, built on an open lakehouse architecture; and the watsonx.governance toolkit, to accelerate AI workflows that are built with responsibility, transparency and explainability. It consists of Watsonx.data, Watsonx.ai and Watsonx.governance

Source

Watsonx.data : An open, hybrid and governed data store

It makes it possible for enterprises to scale analytics and AI with a fit-for-purpose data store, built on an open lakehouse architecture, supported by querying, governance and open data formats to access and share data. With watsonx.data, you can connect to data in minutes, quickly get trusted insights and reduce your data warehouse costs. Now available as a service on IBM Cloud and AWS and as containerized software.

Watsonx.ai Studio: is an AI studio that combines the capabilities of IBM Watson Studio with the latest generative AI capabilities that leverage the power of foundation models. It provides access to high-quality, pre-trained, and proprietary IBM foundation models built with a rigorous focus on data acquisition, provenance, and quality. watsonx.ai is user-friendly. It’s not just for data scientists & developers, but also for business users. It provides a simple, natural language interface for different tasks. Watsonx.ai Studio with the new playground including easy to use Prompt Tuning. With watsonx.xi, you can train, validate, tune and deploy AI models.

WatsonX.governance : IBM has described watsonX.governance as a tool for building responsible, transparent and explainable AI workflows. According to IBM, watsonx.governance will also enable customers to direct, manage and monitor AI activities, map with regulatory requirements, and address ethical issues. The more AI is embedded into daily workflows, the more you need proactive governance to drive responsible, ethical decisions across the business. Watsonx.governance allows you to direct, manage, and monitor your organization’s AI activities, and employs software automation to strengthen your ability to mitigate risk, manage regulatory requirements and address ethical concerns without the excessive costs of switching your data science platform—even for models developed using third-party tools.

Source

IBM plans to provide Foundation Models as a Service with the capabilities of IBM’s first AI-optimized, cloud-native supercomputer Vela as a Service. The stack utilizes Red Hat OpenShift, so that it could also be run on multiple clouds or on-premises. It is based on popular open source frameworks and communities like PyTorch, Ray and Hugging Face.

Why we built an AI supercomputer in the cloud?. Introducing Vela, IBM’s first AI-optimized, cloud-native supercomputer.

IBM built Vela supercomputer designed specifically for training so-called “foundation” AI models such as GPT-3. According to IBM, this new supercomputer should become the basis for all its own research and development activities for these types of AI models.IBM’s Vela supercomputer uses x86-based standard hardware. In the Vela system, each node’s hardware consists of a pair of “regular” Intel Xeon Scalable processors. To this are added eight 80GB Nvidia A100 GPUs per node. Furthermore, each node within the supercomputer is connected to several 100 Gbps Ethernet network interfaces. Each Vela node also has 1.5TB of DRAM internal memory and four 3.2TB NVMe drives for storage.In addition, IBM has also built a new workload-scheduling system for the Vela, the MultiCluster App Dispatcher (MCAD) system. This should handle cloud-based job scheduling for training foundation AI models.

Multi-Cluster Application Dispatcher:

The multi-cluster-app-dispatcher is a Kubernetes controller providing mechanisms for applications to manage batch jobs in a single or mult-cluster environment. The multi-cluster-app-dispatcher (MCAD) controller is capable of (i) providing an abstraction for wrapping all resources of the job/application and treating them holistically, (ii) queuing job/application creation requests and applying different queuing policies, e.g., First In First Out, Priority, (iii) dispatching the job to one of multiple clusters, where a MCAD queuing agent runs, using configurable dispatch policies, and (iv) auto-scaling pod sets, balancing job demands and cluster availability.

What is prompt-tuning?

Prompt-tuning is an efficient, low-cost way of adapting an AI foundation model to new downstream tasks without retraining the model and updating its weights. Redeploying an AI model without retraining it can cut computing and energy use by at least 1,000 times, saving thousands of dollars. With prompt-tuning, you can rapidly spin up a powerful model for your particular needs. It also lets you move faster and experiment.

In prompt-tuning, the best cues, or front-end prompts, are fed to your AI model to give it task-specific context. The prompts can be extra words introduced by a human, or AI-generated numbers introduced into the model's embedding layer. Like crossword puzzle clues, both prompt types guide the model toward a desired decision or prediction. Prompt-tuning allows a company with limited data to tailor a massive model to a narrow task. It also eliminates the need to update the model’s billions (or trillions) of weights, or parameters. Prompt-tuning originated with large language models but has since expanded to other foundation models, like transformers that handle other sequential data types, including audio and video. Prompts may be snippets of text, streams of speech, or blocks of pixels in a still image or video. We don’t touch the model. It’s frozen.

For Example: How do AI art generators work?

AI art generators don’t know what an owl looks like in the wild. They don’t know what a sunset looks like in a physical sense. They can only understand details about features, patterns, and relationships within the datasets they’ve been trained on. Prompting for a “beautiful face” is not very helpful. It is more effective to prompt for specific features such as symmetry, big lips, and green eyes. Even if the bot doesn’t understand beauty, it can recognize the features you describe as beautiful and generate something relatively accurate. To get the best results from your AI art generator prompt, you’ll need to give clear and detailed instructions. An effective AI art prompt should include specific descriptions, shapes, colors, textures, patterns, and artistic styles. This allows the neural networks used by the generator to create the best possible visuals.

T5 (Text to test transfer transformer ) is an encoder decoder model pre trained on a multi-task mixture of unsupervised and supervised tasks. We have complete transformer. T5 provides simple way to train a single model on a wide variety of text tasks. FLAN is Fine-Tuning LANguage Model. FLAN already been fine tuned by google and you try your multiple tasks on already pre tuned Model by Google. If you fine tune, then you may destroy that fine tuned model by overwriting it. Flan-UL2 is an encoder decoder model based on the T5 architecture. It uses the same configuration as the UL2 model released earlier last year. It was fine tuned using the “Flan” prompt tuning and dataset collection. With its impressive 20 billion parameters, Flan-UL2 is a remarkable encoder-decoder model with exceptional performance. UL2 20B: An Open Source Unified Language Learner.In “Unifying Language Learning Paradigms”, we present a novel language pre-training paradigm called Unified Language Learner (UL2) that improves the performance of language models universally across datasets and setups. UL2 frames different objective functions for training language models as denoising tasks, where the model has to recover missing sub-sequences of a given input. During pre-training it uses a novel mixture-of-denoisers that samples from a varied set of such objectives, each with different configurations. We demonstrate that models trained using the UL2 framework perform well in a variety of language domains, including prompt-based few-shot learning and models fine-tuned for down-stream tasks. Additionally, we show that UL2 excels in generation, language understanding, retrieval, long-text understanding and question answering tasks.

Retrieval Augmented Generation (RAG):

Foundation models are usually trained offline, making the model agnostic to any data that is created after the model was trained. Additionally, foundation models are trained on very general domain corpora, making them less effective for domain-specific tasks. You can use Retrieval Augmented Generation (RAG) to retrieve data from outside a foundation model and augment your prompts by adding the relevant retrieved data in context. For more information about RAG model architectures, see Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.

With RAG, the external data used to augment your prompts can come from multiple data sources, such as a document repositories, databases, or APIs. The first step is to convert your documents and any user queries into a compatible format to perform relevancy search. To make the formats compatible, a document collection, or knowledge library, and user-submitted queries are converted to numerical representations using embedding language models. Embedding is the process by which text is given numerical representation in a vector space. RAG model architectures compare the embeddings of user queries within the vector of the knowledge library. The original user prompt is then appended with relevant context from similar documents within the knowledge library. This augmented prompt is then sent to the foundation model. You can update knowledge libraries and their relevant embeddings asynchronously.

Pre-requisites: data sets

1) Training data set ( contains question and answer )

2) Test data set

embedding generation -----> storing it in vector data base ---> giving a user question ----> convert it into embedding --->sending it to vector database ---> getting an answer ---> Finally creating a prompt ---->sending it to foundation model Flan-UL2 (encoder-decoder Model) ---> getting an answer

Be a value creator. You can build foundation models using watsonx Platform on your data and that will be under your control. It will become your most valuable asset. Don’t outsource that and don’t reduce your AI strategy to an API call. One model will not rule them all. Build responsibly, transparently and put governance into the heart of your AI lifecycle.

Reference:

https://www.youtube.com/watch?v=FrDnPTPgEmk

https://www.ibm.com/products/watsonx-ai

https://www.ibm.com/products/watsonx-governance

https://www.ibm.com/products/watsonx-data

Kubernetes - decommissioning a node from the cluster

2023-04-19T17:50:00.011+05:30

Kubernetes cluster is a group of nodes that are used to run containerized applications and services. The cluster consists of a control plane, which manages the overall state of the cluster, and worker nodes, which run the containerized applications.

The control plane is responsible for managing the configuration and deployment of applications on the cluster, as well as monitoring and scaling the cluster as needed. It includes components such as the Kubernetes API server, the etcd datastore, the kube-scheduler, and the kube-controller-manager.

The worker nodes are responsible for running the containerized applications and services. Each node typically runs a container runtime, such as Docker or containerd, as well as a kubelet process that communicates with the control plane to manage the containers running on the node.

In a Kubernetes cluster, applications are deployed as pods, which are the smallest deployable units in Kubernetes. Pods contain one or more containers, and each pod runs on a single node in the cluster. Kubernetes manages the deployment and scaling of the pods across the cluster, ensuring that the workload is evenly distributed and resources are utilized efficiently.

In Kubernetes, the native scheduler is a built-in component responsible for scheduling pods onto worker nodes in the cluster. When a new pod is created, the scheduler evaluates the resource requirements of the pod, along with any constraints or preferences specified in the pod's definition, and selects a node in the cluster where the pod can be scheduled. The native scheduler uses a combination of heuristics and policies to determine the best node for each pod. It considers factors such as the available resources on each node, the affinity and anti-affinity requirements of the pod, any node selectors or taints on the nodes, and the current state of the cluster. The native scheduler in Kubernetes is highly configurable and can be customized to meet the specific needs of different workloads. For example, you can configure the scheduler to prioritize certain nodes in the cluster over others, or to balance the workload evenly across all available nodes.

[sachinpb@remotehostn18 ~]$ kubectl get pods -n kube-system | grep kube-scheduler
kube-scheduler-remotehost18 1/1 Running 11 398d

kubectl cordon is a command in Kubernetes that is used to mark a node as unschedulable. This means that Kubernetes will no longer schedule any new pods on the node, but will continue to run any existing pods on the node.

The kubectl cordon command is useful when you need to take a node offline for maintenance or other reasons, but you want to ensure that the existing pods on the node continue to run until they can be safely moved to other nodes in the cluster. By marking the node as unschedulable, you can prevent Kubernetes from scheduling any new pods on the node, which helps to ensure that the overall health and stability of the cluster is maintained.

[sachinpb@remotenode18 ~]$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
remotenode01 Ready worker 270d v1.23.4
remotenode02 Ready worker 270d v1.23.4
remotenode03 Ready worker 270d v1.23.4
remotenode04 Ready worker 81d v1.23.4
remotenode07 Ready worker 389d v1.23.4
remotenode08 Ready worker 389d v1.23.4
remotenode09 Ready worker 389d v1.23.4
remotenode14 Ready worker 396d v1.23.4
remotenode15 Ready worker 81d v1.23.4
remotenode16 Ready worker 396d v1.23.4
remotenode17 Ready worker 396d v1.23.4
remotenode18 Ready control-plane,master 398d v1.23.4

[sachinpb@remotenode18 ~]$ kubectl cordon remotenode16
node/remotenode16 cordoned
[sachinpb@remotenode18 ~]$ kubectl uncordon remotenode16
node/remotenode16 uncordoned

[sachinpb@remotenode18 ~]$ kubectl cordon remotenode16
node/remotenode16 cordoned
[sachinpb@remotenode18 ~]$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
remotenode01 Ready worker 270d v1.23.4
remotenode02 Ready worker 270d v1.23.4
remotenode03 Ready worker 270d v1.23.4
remotenode04 Ready worker
remotenode07 Ready worker 389d v1.23.4
remotenode08 Ready worker 389d v1.23.4
remotenode09 Ready worker 389d v1.23.4
remotenode14 Ready worker 396d v1.23.4
remotenode15 Ready worker 81d v1.23.4
remotenode16 Ready,SchedulingDisabled worker 396d v1.23.4

remotenode17 Ready worker 396d v1.23.4
remotenode18 Ready control-plane,master 398d v1.23.4

[sachinpb@remotenode18 ~]$

After the node has been cordoned off, you can use the kubectl drain command to safely and gracefully terminate any running pods on the node and reschedule them onto other available nodes in the cluster. Once all the pods have been moved, the node can then be safely removed from the cluster.

kubectl drain is a command in Kubernetes that is used to gracefully remove a node from a cluster. This is typically used when performing maintenance on a node, such as upgrading or replacing hardware, or when decommissioning a node from the cluster.

Source

[sachinpb@remotenode18 ~]$ kubectl drain --ignore-daemonsets remote16
node/remote16 already cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-j749l, kube-system/fuse-device-plugin-daemonset-59lrp, kube-system/kube-proxy-v26k2, kube-system/nvidia-device-plugin-daemonset-w2k57, kube-system/rdma-shared-dp-ds-zdpfw, sys-monitor/prometheus-op-prometheus-node-exporter-rh4db
node/remote16 drained
[sachinpb@remotenode18 ~]$

By default kubectl drain is non-destructive, you have to override to change that behaviour. It runs with the following defaults:

--delete-local-data=false
--force=false
--grace-period=-1 (Period of time in seconds given to each pod to terminate gracefully. If negative, the default value specified in the pod will be used.)
--ignore-daemonsets=false
--timeout=0s

Each of these safeguard deals with a different category of potential destruction (local data, bare pods, graceful termination, daemonsets). It also respects pod disruption budgets to adhere to workload availability. Any non-bare pod will be recreated on a new node by its respective controller (e.g. daemonset controller, replication controller). It's up to you whether you want to override that behaviour (for example you might have a bare pod if running jenkins job. If you override by setting --force=true it will delete that pod and it won't be recreated). If you don't override it, the node will be in drain mode indefinitely (--timeout=0s))

Source

When a node is drained, Kubernetes will automatically reschedule any running pods onto other available nodes in the cluster, ensuring that the workload is not interrupted. The kubectl drain command ensures that the node is cordoned off, meaning no new pods will be scheduled on it, and then gracefully terminates any running pods on the node. This helps to ensure that the pods are shut down cleanly, allowing them to complete any in-progress tasks and save any data before they are terminated.

After the pods have been rescheduled, the node can then be safely removed from the cluster. This helps to ensure that the overall health and stability of the cluster is maintained, even when individual nodes need to be taken offline for maintenance or other reasons

When kubectl drain returns successfully, that indicates that all of the pods have been safely evicted. It is then safe to bring down the node. After maintenance work we can use kubectl uncordon to tell Kubernetes that it can resume scheduling new pods onto the node.

[sachinpb@remotenode18 ~]$ kubectl uncordon remotenode16
node/remotenode16 uncordoned

Let's try all the above steps and see :

1) Retrieve information from a Kubernetes cluster

--------------------------------

2) Kubernetes cordon is an operation that marks or taints a node in your existing node pool as unschedulable.

[sachinpb@remotenode18 ~]$ kubectl cordon remotenode16
node/remotenode16 cordoned
[sachinpb@remotenode18 ~]$

[sachinpb@remotenode18 ~]$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
remotenode01 Ready worker 270d v1.23.4
remotenode02 Ready worker 270d v1.23.4
remotenode03 Ready worker 270d v1.23.4
remotenode04 Ready worker 81d v1.23.4
remotenode07 Ready worker 389d v1.23.4
remotenode08 Ready worker 389d v1.23.4
remotenode09 Ready worker 389d v1.23.4
remotenode14 Ready worker 396d v1.23.4
remotenode15 Ready worker 81d v1.23.4
remotenode16 Ready,SchedulingDisabled worker 396d v1.23.4
remotenode17 Ready worker 396d v1.23.4
remotenode18 Ready control-plane,master 398d v1.23.4

3) Drain node in preparation for maintenance. The given node will be marked unschedulable to prevent new pods from arriving. Then drain deletes all pods

[sachinpb@remotenode18 ~]$ kubectl drain remotenode16 --grace-period=2400
node/remotenode16 already cordoned
error: unable to drain node "remotenode16" due to error:cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): kube-system/calico-node-j749l, kube-system/fuse-device-plugin-daemonset-59lrp, kube-system/kube-proxy-v26k2, kube-system/nvidia-device-plugin-daemonset-w2k57, kube-system/rdma-shared-dp-ds-zdpfw, sys-monitor/prometheus-op-prometheus-node-exporter-rh4db, continuing command...
There are pending nodes to be drained:
remotenode16
cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): kube-system/calico-node-j749l, kube-system/fuse-device-plugin-daemonset-59lrp, kube-system/kube-proxy-v26k2, kube-system/nvidia-device-plugin-daemonset-w2k57, kube-system/rdma-shared-dp-ds-zdpfw, sys-monitor/prometheus-op-prometheus-node-exporter-rh4db
[sachinpb@remotenode18 ~]$

NOTE:

The given node will be marked unschedulable to prevent new pods from arriving. Then drain deletes all pods except mirror pods (which cannot be deleted through the API server). If there are DaemonSet-managed pods, drain will not proceed without –ignore-daemonsets, and regardless it will not delete any DaemonSet-managed pods, because those pods would be immediately replaced by the DaemonSet controller, which ignores unschedulable markings. If there are any pods that are neither mirror pods nor managed–by ReplicationController, DaemonSet or Job–, then drain will not delete any pods unless you use –force.

----------------------------

4) Drain node with --ignore-daemonsets

[sachinpb@remotenode18 ~]$ kubectl drain --ignore-daemonsets remotenode16 --grace-period=2400
node/remotenode16 cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-j749l, kube-system/fuse-device-plugin-daemonset-59lrp, kube-system/kube-proxy-v26k2, kube-system/nvidia-device-plugin-daemonset-w2k57, kube-system/rdma-shared-dp-ds-zdpfw, sys-monitor/prometheus-op-prometheus-node-exporter-rh4db
node/remotenode16 drained

----------------------

5) Uncordon will mark the node as schedulable.

[sachinpb@remotenode18 ~]$ kubectl uncordon remotenode16
node/remotenode16 uncordoned
[sachinpb@remotenode18 ~]$

-----------------

6) Retrieve information from a Kubernetes cluster

How to automate above process creating Jenkins pipeline job to cordon ,drain and uncordon the nodes with the help of groovy script:

-------------------------Sample groovy script--------------------------------

node("Kubernetes-master-node") {
stage("1") {
sh 'hostname'
sh 'cat $SACHIN_HOME/manual//hostfile'
k8s_cordon_drain()
k8s_uncordon()
}
}

/*
* CI -Kubernetes cluster : This function will cordon/drain the worker nodes in hostfile

*/
def k8s_cordon_drain() {
def maxTries = 3 // the maximum number of times to retry the kubectl commands
def sleepTime = 5 * 1000 // the amount of time to wait between retries (in milliseconds)
def filename = '$SACHIN_HOME/manual/hostfile'
def content = readFile(filename)
def hosts = content.readLines().collect { it.split()[0] }
println "List of Hostnames to be cordoned from K8s cluster: ${hosts}"
hosts.each { host ->
def command1 = "kubectl cordon $host"
def command2 = "kubectl drain --ignore-daemonsets --grace-period=2400 $host"
def tries = 0
def result1 = null
def result2 = null
while (tries < maxTries) {
result1 = sh(script: command1, returnStatus: true)
if (result1 == 0) {
println "Successfully cordoned $host"
break
} else {
tries++
println "Failed to cordoned $host (attempt $tries/$maxTries), retrying in ${sleepTime/1000} seconds..."
sleep(sleepTime)
}
}
if (result1 == 0) {
tries = 0
while (tries < maxTries) {
result2 = sh(script: command2, returnStatus: true)
if (result2 == 0) {
println "Successfully drained $host"
break
} else {
tries++
println "Failed to drain $host (attempt $tries/$maxTries), retrying in ${sleepTime/1000} seconds..."
sleep(sleepTime)
}
}
}

if (result2 != 0) {
println "Failed to drain $host after $maxTries attempts"
}
}
}

/*
* CI - Kubernetes cluster : This function will uncordon the worker nodes in hostfile

*/
def k8s_uncordon() {
def maxTries = 3 // the maximum number of times to retry the kubectl commands
def sleepTime = 5 * 1000 // the amount of time to wait between retries (in milliseconds)
def filename = '$SACHIN_HOME/manual/hostfile'
def content = readFile(filename)
def hosts = content.readLines().collect { it.split()[0] }
println "List of Hostnames to be uncordoned from K8s cluster: ${hosts}"
hosts.each { host ->
def command1 = "kubectl uncordon $host"
def tries = 0
def result1 = null
while (tries < maxTries) {
result1 = sh(script: command1, returnStatus: true)
if (result1 == 0) {
println "Successfully cordoned $host"
break
} else {
tries++
println "Failed to uncordon $host (attempt $tries/$maxTries), retrying in ${sleepTime/1000} seconds..."
sleep(sleepTime)
}
}
if (result1 != 0) {
println "Failed to uncordon $host after $maxTries attempts"
}
}
}

------------------Jenkins Console output for pipeline job -----------------

Started by user jenkins-admin
[Pipeline] Start of Pipeline
[Pipeline] node
Running on Kubernetes-master-node in $SACHIN_HOME/workspace/test_sample4_cordon_drain
[Pipeline] {
[Pipeline] stage
[Pipeline] { (1)
[Pipeline] sh
+ hostname
kubernetes-master-node
[Pipeline] sh
+ cat $SACHIN_HOME/manual//hostfile
Remotenode16 slots=4
Remotenode17 slots=4
[Pipeline] readFile
[Pipeline] echo
List of Hostnames to be cordoned from K8s cluster: [Remotenode16, Remotenode17]
[Pipeline] sh
+ kubectl cordon Remotenode16
node/Remotenode16 cordoned
[Pipeline] echo
Successfully cordoned Remotenode16
[Pipeline] sh
+ kubectl drain --ignore-daemonsets --grace-period=2400 Remotenode16
node/Remotenode16 already cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-j749l, kube-system/fuse-device-plugin-daemonset-59lrp, kube-system/kube-proxy-v26k2, kube-system/nvidia-device-plugin-daemonset-w2k57, kube-system/rdma-shared-dp-ds-zdpfw, sys-monitor/prometheus-op-prometheus-node-exporter-rh4db
node/Remotenode16 drained
[Pipeline] echo
Successfully drained Remotenode16
[Pipeline] sh
+ kubectl cordon Remotenode17
node/Remotenode17 cordoned
[Pipeline] echo
Successfully cordoned Remotenode17
[Pipeline] sh
+ kubectl drain --ignore-daemonsets --grace-period=2400 Remotenode17
node/Remotenode17 already cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-hz5zh, kube-system/fuse-device-plugin-daemonset-dj72m, kube-system/kube-proxy-g87dc, kube-system/nvidia-device-plugin-daemonset-tk5x8, kube-system/rdma-shared-dp-ds-n4g5w, sys-monitor/prometheus-op-prometheus-node-exporter-gczmz
node/Remotenode17 drained
[Pipeline] echo
Successfully drained Remotenode17
[Pipeline] readFile
[Pipeline] echo
List of Hostnames to be uncordoned from K8s cluster: [Remotenode16, Remotenode17]
[Pipeline] sh
+ kubectl uncordon Remotenode16
node/Remotenode16 uncordoned
[Pipeline] echo
Successfully cordoned Remotenode16
[Pipeline] sh
+ kubectl uncordon Remotenode17
node/Remotenode17 uncordoned
[Pipeline] echo
Successfully cordoned Remotenode17
[Pipeline] }
[Pipeline] // stage
[Pipeline] }
[Pipeline] // node
[Pipeline] End of Pipeline
Finished: SUCCESS

-----------------------------------------------------------------

Reference:

https://kubernetes.io/docs/home/

IBM Spectrum Symphony and LSF with Apache Hadoop

2023-04-13T17:14:00.012+05:30

IBM Spectrum Symphony (formerly known as IBM Platform Symphony) is a high-performance computing (HPC) and grid computing software platform that enables organizations to process large amounts of data and run compute-intensive applications at scale. It provides a distributed computing infrastructure that can be used for a wide range of data-intensive workloads, such as scientific simulations, financial modeling, and big data analytics. IBM Spectrum Symphony is a parallel services middleware and cluster manager. It is widely used in banks for risk analytics, data analytics in a shared, multi-user, multi-application, multi-job environment. IBM Spectrum Symphony also works with IBM Spectrum LSF (for batch workloads) in the same cluster to allow both batch and parallel services workloads to share the same cluster.

Some of the key features of IBM Spectrum Symphony include:

Distributed computing: The platform allows organizations to distribute computing workloads across a large number of nodes, which can be located in different data centers or cloud environments.
Resource management: IBM Spectrum Symphony provides a resource management framework that allows organizations to allocate and manage compute, storage, and network resources more efficiently.
High availability: The platform is designed to provide high availability and fault tolerance, ensuring that applications can continue to run even if individual nodes or components fail.
Performance optimization: IBM Spectrum Symphony includes a range of performance optimization features, such as load balancing and data caching, which can help organizations to achieve faster processing times and better overall performance.
Support for multiple programming languages: The platform supports a wide range of programming languages, including Java, Python, and C++, which makes it easy for developers to build and deploy applications on the platform.

IBM Spectrum LSF (Load Sharing Facility) is another software platform that is often used in conjunction with IBM Spectrum Symphony to manage and optimize workloads in a distributed computing environment. LSF provides a range of features for resource management, workload scheduling, and job prioritization, which can help organizations to improve performance and efficiency.

When used together, IBM Spectrum Symphony and IBM Spectrum LSF can provide a comprehensive solution for managing and optimizing large-scale distributed computing environments. IBM Spectrum Symphony provides the distributed computing infrastructure and application management capabilities, while IBM Spectrum LSF provides the workload management and optimization features.

Some of the key features of LSF that complement IBM Spectrum Symphony include:

Advanced job scheduling: LSF provides sophisticated job scheduling capabilities, allowing organizations to prioritize and schedule jobs based on a wide range of criteria, such as resource availability, job dependencies, and user priorities.
Resource allocation: LSF can manage the allocation of resources, ensuring that jobs are run on the most appropriate nodes and that resources are used efficiently.
Job monitoring: LSF provides real-time monitoring of job progress and resource usage, allowing organizations to quickly identify and resolve issues that may impact performance.
Integration with other tools: LSF can be integrated with a wide range of other HPC tools and applications, including IBM Spectrum Symphony, providing a seamless workflow for managing complex computing workloads.

Integrating LSF with Hadoop can help organizations to optimize the use of their resources and achieve better performance when running Hadoop workloads.

Apache Hadoop ("Hadoop") is a framework for large-scale distributed data storage and processing on computer clusters that uses the Hadoop Distributed File System ("HDFS") for the data storage and MapReduce programming model for the data processing. Since MapReduce workloads might only represent a small fraction of overall workload, but typically requires their own standalone environment, MapReduce is difficult to support within traditional HPC clusters. However, HPC clusters typically use parallel file systems that are sufficient for initial MapReduce workloads, so you can run MapReduce workloads as regular parallel jobs running in an HPC cluster environment. Use the IBM Spectrum LSF integration with Apache Hadoop to submit Hadoop MapReduce workloads as regular LSF parallel jobs.

To run your Hadoop application through LSF, submit it as an LSF job. Once the LSF job starts to run, the Hadoop connector script (lsfhadoop.sh) automatically provisions an open source Hadoop cluster within LSF allocated resources, then submits actual MapReduce workloads into this Hadoop cluster. Since each LSF Hadoop job has its own resource (cluster), the integration provides a multi-tenancy environment to allow multiple users to share the common pool of HPC cluster resources. LSF is able to collect resource usage of MapReduce workloads as normal LSF parallel jobs and has full control of the job life cycle. After the job is complete, LSF shuts down the Hadoop cluster.

By default, the Apache Hadoop integration configures the Hadoop cluster with direct access to shared file systems and does not require HDFS. This allows you to use existing file systems in your HPC cluster without having to immediately invest in a new file system. Through the existing shared file system, data can be stored in common share locations, which avoids the typical data stage-in and stage-out steps with HDFS.

The general steps to integrate LSF with Hadoop:

Install and configure LSF: The first step is to install and configure LSF on the Hadoop cluster. This involves setting up LSF daemons on the cluster nodes and configuring LSF to work with the Hadoop Distributed File System (HDFS).
Configure Hadoop for LSF: Hadoop needs to be configured to use LSF as its resource manager. This involves setting the yarn.resourcemanager.scheduler.class property in the Hadoop configuration file to com.ibm.platform.lsf.yarn.LSFYarnScheduler.
Configure LSF for Hadoop: LSF needs to be configured to work with Hadoop by setting up the necessary environment variables and resource limits. This includes setting the LSF_SERVERDIR and LSF_LIBDIR environment variables to the LSF installation directory and configuring LSF resource limits to ensure that Hadoop jobs have access to the necessary resources.
Submit Hadoop jobs to LSF: Hadoop jobs can be submitted to LSF using the yarn command-line tool with the -Dmapreduce.job.submithostname and -Dmapreduce.job.queuename options set to the LSF submit host and queue, respectively.
Monitor Hadoop jobs in LSF: LSF provides a web-based user interface and command-line tools for monitoring and managing Hadoop jobs running on the cluster. This allows users to monitor job progress, resource usage, and other metrics, and to take corrective action if necessary.

LSF can be used as a standalone workload management software for Hadoop clusters, without the need for IBM Spectrum Symphony. LSF provides advanced job scheduling and resource management capabilities, which can be used to manage and optimize Hadoop workloads running on large HPC clusters. By integrating LSF with Hadoop, organizations can ensure that Hadoop jobs have access to the necessary resources and are scheduled and managed efficiently, improving overall performance and resource utilization.

In addition, IBM Spectrum Symphony provides additional capabilities beyond workload management, such as distributed computing infrastructure, data movement, and integration with other data center software. If an organization requires these additional capabilities, they may choose to use IBM Spectrum Symphony alongside LSF for even greater benefits. But LSF can be used independently as a workload manager for Hadoop clusters.

Submitting LSF jobs to a Hadoop cluster involves creating an LSF job script that launches the Hadoop job and then submitting the job to LSF using the bsub command. . LSF will then schedule the job to run on the cluster. To submit LSF jobs to a Hadoop cluster, you need to follow these general steps:

Write the Hadoop job: First, you need to write the Hadoop job that you want to run on the cluster. This can be done using any of the Hadoop APIs, such as MapReduce, Spark, or Hive.
Create the LSF job script: Next, you need to create an LSF job script that will launch the Hadoop job on the cluster. This script will typically include the Hadoop command to run the job, along with any necessary environment variables, resource requirements, and other LSF-specific settings.
Submit the LSF job: Once the job script is ready, you can submit it to LSF using the bsub command. This will add the job to the LSF queue and wait for available resources to run the job.
Monitor the job: LSF provides several tools for monitoring and managing jobs running on the cluster, such as the bjobs command and the LSF web interface. You can use

Example 1: bsub command that can be used to submit a Hadoop job to an LSF-managed Hadoop cluster:

bsub -J my_hadoop_job -oo my_hadoop_job.out -eo my_hadoop_job.err -R "rusage[mem=4096]" -q hadoop_queue hadoop jar my_hadoop_job.jar input_dir output_dir

where:

-J: Specifies a name for the job. In this case, we're using "my_hadoop_job" as the job name.

-oo: Redirects the standard output of the job to a file. In this case, we're using "my_hadoop_job.out" as the output file.

-eo: Redirects the standard error of the job to a file. In this case, we're using "my_hadoop_job.err" as the error file.

-R: Specifies resource requirements for the job. In this case, we're requesting 4 GB of memory (mem=4096) for the job.

-q: Specifies the LSF queue to submit the job to. In this case, we're using the "hadoop_queue" LSF queue.

After the bsub command options, we specify the Hadoop command to run the job (hadoop jar my_hadoop_job.jar) and the input and output directories for the job (input_dir and output_dir). This will submit the Hadoop job to LSF, which will then schedule and manage the job on the Hadoop cluster. For more details please refer these links.

Example 2: How to submit a Hadoop job using bsub command with LSF?

bsub -q hadoop -J "Hadoop Job" -n 10 -o hadoop.log -hadoop /path/to/hadoop/bin/hadoop jar /path/to/hadoop/examples.jar pi 10 1000

This command will submit a Hadoop job to the LSF scheduler and allocate resources as necessary based on the job's requirements.

where:

-q hadoop specifies that the job should be submitted to the Hadoop queue.
-J "Hadoop Job" specifies a name for the job.
-n 10 specifies the number of cores to use for the job.
-o hadoop.log specifies the name of the output log file.
-hadoop specifies that the command that follows should be executed on a Hadoop cluster.
/path/to/hadoop/bin/hadoop specifies the path to the Hadoop executable.
jar /path/to/hadoop/examples.jar pi 10 1000 specifies the command to run the Hadoop job, which in this case is the pi example program with 10 mappers and 1000 samples.

Example 3: How to submit a wordcount MapReduce job using bsub with LSF ?

bsub -q hadoop -J "MapReduce Job" -n 10 -o mapreduce.log -hadoop /path/to/hadoop/bin/hadoop jar /path/to/hadoop/examples.jar wordcount /input/data /output/data

where:

-q hadoop specifies that the job should be submitted to the Hadoop queue.

-J "MapReduce Job" specifies a name for the job.

-n 10 specifies the number of cores to use for the job.

-o mapreduce.log specifies the name of the output log file.

-hadoop specifies that the command that follows should be executed on a Hadoop cluster.

/path/to/hadoop/bin/hadoop specifies the path to the Hadoop executable.

jar /path/to/hadoop/examples.jar wordcount /input/data /output/data specifies the command to run the MapReduce job, which in this case is the wordcount example program with input data in /input/data and output data in /output/data.

Example 4: How to submit a terasort MapReduce job using bsub with LSF?

bsub -q hadoop -J "MapReduce Job" -n 20 -o mapreduce.log -hadoop /path/to/hadoop/bin/hadoop jar /path/to/hadoop/examples.jar terasort -Dmapred.map.tasks=100 -Dmapred.reduce.tasks=50 /input/data /output/data

where:

-q hadoop specifies that the job should be submitted to the Hadoop queue.

-J "MapReduce Job" specifies a name for the job.

-n 20 specifies the number of cores to use for the job.

-o mapreduce.log specifies the name of the output log file.

-hadoop specifies that the command that follows should be executed on a Hadoop cluster.

/path/to/hadoop/bin/hadoop specifies the path to the Hadoop executable.

jar /path/to/hadoop/examples.jar terasort -Dmapred.map.tasks=100 -Dmapred.reduce.tasks=50 /input/data /output/data specifies the command to run the MapReduce job, which in this case is the terasort example program with input data in /input/data and output data in /output/data, and specific configuration parameters to control the number of map and reduce tasks.

Example 5: How to submit a grep MapReduce job using bsub with LSF?

bsub -q hadoop -J "MapReduce Job" -n 30 -o mapreduce.log -hadoop /path/to/hadoop/bin/hadoop jar /path/to/hadoop/examples.jar grep -input /input/data -output /output/data -regex "example.*"

where:

-q hadoop specifies that the job should be submitted to the Hadoop queue.

-J "MapReduce Job" specifies a name for the job.

-n 30 specifies the number of cores to use for the job.

-o mapreduce.log specifies the name of the output log file.

-hadoop specifies that the command that follows should be executed on a Hadoop cluster.

/path/to/hadoop/bin/hadoop specifies the path to the Hadoop executable.

jar /path/to/hadoop/examples.jar grep -input /input/data -output /output/data -regex "example.*" specifies the command to run the MapReduce job, which in this case is the grep example program with input data in /input/data, output data in /output/data, and a regular expression pattern to search for.

Example 6: How to submit a non MapReduce hadoop job using bsub with LSF?

bsub -q hadoop -J "Hadoop Job" -n 10 -o hadoopjob.log -hadoop /path/to/hadoop/bin/hadoop fs -rm -r /path/to/hdfs/directory

where:

-q hadoop specifies that the job should be submitted to the Hadoop queue.

-J "Hadoop Job" specifies a name for the job.

-n 10 specifies the number of cores to use for the job.

-o hadoopjob.log specifies the name of the output log file.

-hadoop specifies that the command that follows should be executed on a Hadoop cluster.

/path/to/hadoop/bin/hadoop fs -rm -r /path/to/hdfs/directory specifies the command to run the Hadoop job, which in this case is to remove a directory in HDFS at /path/to/hdfs/directory.

This command will submit a non-MapReduce Hadoop job to the LSF scheduler and allocate resources as necessary based on the job's requirement

Example 7: If you have a Hadoop cluster with YARN and Spark installed, you can submit Spark jobs to the cluster using bsub as shown in the example.

bsub -q normal -J "Spark Job" -n 20 -o sparkjob.log /path/to/spark/bin/spark-submit --class com.example.MyApp --master yarn --deploy-mode cluster /path/to/my/app.jar arg1 arg2

where:

-q normal specifies that the job should be submitted to the normal queue.

-J "Spark Job" specifies a name for the job.

-n 20 specifies the number of cores to use for the job.

-o sparkjob.log specifies the name of the output log file.

/path/to/spark/bin/spark-submit specifies the path to the spark-submit script.

--class com.example.MyApp specifies the main class of the Spark application.

--master yarn --deploy-mode cluster specifies the mode to run the application in.

/path/to/my/app.jar arg1 arg2 specifies the path to the application jar file and its arguments.

The above example does not explicitly require Hadoop to be installed or used. However, it assumes that the Spark cluster is running in YARN mode, which is typically used in a Hadoop cluster. In general, Spark can be run in various modes, including standalone, YARN, and Mesos. There are various other parameters and configurations that can be specified. Some examples include:

--num-executors: Specifies the number of executor processes to use for the job.

--executor-cores: Specifies the number of cores to allocate per executor.

--executor-memory: Specifies the amount of memory to allocate per executor.

--driver-memory: Specifies the amount of memory to allocate for the driver process.

--queue: Specifies the YARN queue to submit the job to.

--files: Specifies a comma-separated list of files to be distributed with the job.

--archives: Specifies a comma-separated list of archives to be distributed with the job.

These parameters can be used to fine-tune the resource allocation and performance of Spark jobs in a Hadoop cluster. Additionally, there are other options that can be used to configure the behavior of the Spark application itself, such as --conf to specify Spark configuration options and --jars to specify external JAR files to be used by the application

Here is an example LSF configuration file (lsf.conf) that includes settings for running Spark applications:

# LSF Configuration File

# Spark settings

LSB_JOB_REPORT_MAIL=N

LSB_DEFAULTGROUP=spark

LSB_DEFAULTJOBGROUP=spark

LSB_JOB_ACCOUNTING_INTERVAL=60

LSB_SUB_LOGLEVEL=3

LSB_JOB_PROLOGUE="/opt/spark/current/bin/load-spark-env.sh"

LSB_JOB_WRAPPER="mpirun -n 1 $LSF_BINDIR/lsb.wrapper $LSB_BINARY_NAME"

LSB_HOSTS_TASK_MODEL=cpu

An example Spark configuration file (spark-defaults.conf) that includes settings for running Spark applications using LSF:

# Spark Configuration File

# LSF settings

spark.master=yarn

spark.submit.deployMode=cluster

spark.yarn.queue=default

spark.executor.instances=2

spark.executor.memory=2g

spark.executor.cores=2

spark.driver.memory=1g

spark.driver.cores=1

spark.yarn.am.memory=1g

spark.yarn.am.cores=1

spark.yarn.maxAppAttempts=2

spark.eventLog.enabled=true

spark.eventLog.dir=hdfs://namenode:8020/spark-event-logs

spark.history.fs.logDirectory=hdfs://namenode:8020/spark-event-logs

spark.scheduler.mode=FAIR

spark.serializer=org.apache.spark.serializer.KryoSerializer

This configuration file sets several parameters for running Spark applications on a YARN cluster managed by LSF, including specifying the number of executor instances, executor memory, and executor cores, as well as setting the queue and memory allocation for the Spark ApplicationMaster.

Configure the Apache Hadoop integration

Run a Hadoop application on LSF

Using LSF as the scheduler for Hadoop can provide better resource utilization, job scheduling, queuing, integration with other workloads, and monitoring and management capabilities than the built-in YARN scheduler. This can help improve the performance, scalability, and efficiency of Hadoop clusters, especially in large, complex environments.

Better resource utilization: LSF has advanced resource allocation and scheduling algorithms that can improve resource utilization in Hadoop clusters. This can lead to better performance and reduced infrastructure costs.
Better job scheduling: LSF has more advanced job scheduling features than YARN, such as support for job dependencies, job preemption, and priority-based job scheduling. This can help optimize job execution and reduce waiting times.
Advanced queuing: LSF allows for more flexible and advanced queuing mechanisms, including job prioritization and preemption, multiple queues with different priorities, and customizable scheduling policies.
Integration with other workloads: LSF is a general-purpose job scheduler that can be used to manage a wide range of workloads, including Hadoop, MPI, and other distributed computing frameworks. This allows for better integration and coordination of workloads on the same infrastructure.
Advanced monitoring and management: LSF provides more advanced monitoring and management tools than YARN, including web-based interfaces, command-line tools, and APIs for job management, resource monitoring, and performance analysis.

LSF is a versatile job scheduler that can be used for a wide range of workloads, including batch and real-time scheduling. While LSF is often used for batch scheduling workloads, it can also be used for real-time scheduling workloads like Apache Kafka, thanks to its advanced scheduling capabilities and integration capabilities with other distributed computing frameworks.

LSF has advanced scheduling capabilities that can help optimize the allocation of resources for real-time workloads, including support for job prioritization, preemption, and multiple queues with different priorities. This can help ensure that real-time workloads are allocated the necessary resources in a timely and efficient manner.

Furthermore, LSF has integration capabilities with other distributed computing frameworks like Apache Kafka. For example, LSF can be used to manage the resource allocation and scheduling of Kafka brokers, consumers, and producers. This can help optimize the performance and scalability of Kafka clusters.

Examples for applications with real time scheduling:

A major financial services company uses Hadoop and LSF to process real-time financial data. LSF is used to manage the allocation of compute resources for Hadoop, including managing the cluster's memory, CPU, and disk resources. This setup enables the company to process real-time financial data with low latency and high throughput.
A large e-commerce company uses Hadoop and LSF to process large volumes of customer data in real-time. LSF is used to schedule and manage jobs across multiple Hadoop clusters, optimizing the allocation of resources to ensure that real-time processing is prioritized. This setup enables the company to personalize customer experiences and deliver targeted marketing campaigns in real-time.
A global telecommunications company uses Hadoop and LSF to process real-time data from its network infrastructure. LSF is used to manage job scheduling and resource allocation, ensuring that data is processed quickly and efficiently. This setup enables the company to monitor and optimize network performance in real-time, providing a better customer experience.

Overall, the combination of Hadoop and LSF can provide a powerful and flexible platform for processing both historical as well as real-time data in production environments. By leveraging the advanced resource management and scheduling capabilities of LSF, organizations can optimize performance, reduce latency, and improve the overall efficiency of their Hadoop clusters.

Reference:

https://www.sachinpbuzz.com/2019/08/spectum-lsf-multicluster-job-forwarding.html
https://www.sachinpbuzz.com/2021/08/spectrum-lsf-101-installation-and.html
https://www.ibm.com/docs/en/spectrum-lsf/10.1.0?topic=hadoop-about-lsf-apache
https://www.ibm.com/docs/en/spectrum-lsf/10.1.0?topic=hadoop-run-application-lsf
https://hadoop.apache.org/
https://www.edureka.co/blog/videos/hadoop-architecture/

Linux Test Harness : avocado and op-test framework

2023-04-04T18:31:00.032+05:30

A Test Harness, also known as a testing framework or testing tool, is a software tool or library that provides a set of functions, APIs, or interfaces for writing, organizing, and executing tests. Test harnesses provide a structured way to write tests and automate the testing process.

Linux avocado test framework and Linux op-test framework are both open-source testing frameworks designed for testing and validating Linux-based systems. Both frameworks are widely used in the Linux community and have a strong user base. The choice between the two depends on the specific testing needs and requirements of the user.

The Linux avocado test framework is a modular and extensible testing framework that allows users to write and run tests for different levels of the Linux stack, including the kernel, user space, and applications. It provides a wide range of plugins and tools for testing, including functional, performance, and integration testing. The framework is easy to install and use and supports multiple test runners and reporting formats.

On the other hand, the Linux op-test framework is a set of Python libraries and utilities that automate the testing of hardware and firmware components in Linux-based systems. It provides a high-level Python API for interacting with hardware and firmware interfaces, as well as a set of pre-built tests for validating various hardware components such as CPU, memory, and storage. The framework is highly flexible and customizable, allowing users to create their own tests and integrate with other testing tools and frameworks.

While both frameworks are designed for testing Linux-based systems, the Linux avocado test framework provides a broad range of testing capabilities across different levels of the Linux stack, while the Linux op-test framework focuses specifically on automating hardware and firmware testing. The choice between the two depends on the specific testing needs and requirements of the user.

The Linux avocado test framework provides a plugin called "avocado-vt" which can be used to run tests that require a reboot between different test stages. This plugin enables the framework to run destructive tests, like kernel crash dump (kdump) testing, that require the system to be rebooted multiple times.

Similarly, the Linux op-test framework also provides support for testing scenarios that require system reboot. The framework includes a "reboot" library that allows users to reboot the system under test and wait for it to come back up before continuing with the test. This library can be used to test scenarios like kdump and fadump that require system reboot.

The community maintained avocado tests repository:

Avocado is a set of tools and libraries to help with automated testing. One can call it a test framework with benefits. Native tests are written in Python and they follow the unittest pattern, but any executable can serve as a test.

This repository contains a collection of miscellaneous tests and plugins for the Linux Avocado test framework that cover a wide range of functional, performance, and integration testing scenarios. The tests are designed to be modular and easy to use, and can be integrated with the Avocado test framework to extend its capabilities.

https://github.com/avocado-framework-tests/avocado-misc-tests

How to run avocado misc tests :

To run the Avocado Misc Tests, you first need to install the Linux Avocado test framework on your system. Once you have installed the framework, you can clone the Avocado Misc Tests repository from GitHub by running the following command in a terminal:

git clone https://github.com/avocado-framework-tests/avocado-misc-tests.git

git clone git@github.com:avocado-framework-tests/avocado-misc-tests.git

# git clone git@github.com:avocado-framework-tests/avocado-misc-tests.git
Cloning into 'avocado-misc-tests'...
remote: Enumerating objects: 18087, done.
remote: Counting objects: 100% (451/451), done.
remote: Compressing objects: 100% (239/239), done.
remote: Total 18087 (delta 242), reused 368 (delta 208), pack-reused 17636
Receiving objects: 100% (18087/18087), 6.15 MiB | 16.67 MiB/s, done.
Resolving deltas: 100% (11833/11833), done.
#

This repository is dedicated to host any tests written using the Avocado API. It is being initially populated with tests ported from autotest client tests repository, but it's not limited by that.

After cloning the repository, you can navigate to the avocado-misc-tests directory and run the tests using the avocado run command. For example, to run all the tests in the network category, you can run the following command:

cd avocado-misc-tests
avocado run network/

This will run all the tests in the network category. You can also run individual tests by specifying the path to the test file, like this:

avocado run network/test_network_ping.py

This will run the test_network_ping.py test in the network category.

Before running the tests, you may need to configure the Avocado framework to use the appropriate test runner, test environment, and plugins for your system. You can find more information on how to configure and use the Avocado framework in the official documentation:

https://avocado-framework.readthedocs.io/en/latest/

$ avocado run avocado-misc-tests/generic/stress.py
JOB ID : 0018adbc07c5d90d242dd6b341c87972b8f77a0b
JOB LOG : $HOME/avocado/job-results/job-2016-01-18T15.32-0018adb/job.log
TESTS : 1
(1/1) avocado-misc-tests/generic/stress.py:Stress.test: PASS (62.67 s)
RESULTS : PASS 1 | ERROR 0 | FAIL 0 | SKIP 0 | WARN 0 | INTERRUPT 0
JOB HTML : $HOME/avocado/job-results/job-2016-01-18T15.32-0018adb/html/results.html
TIME : 62.67 s

There are a few more interesting things about the Avocado test framework and its usability and use cases:

Flexible test design: The Avocado test framework is designed to be flexible and adaptable to a wide range of testing scenarios. It supports various test types, including functional, integration, performance, and stress tests, and can be used to test software at different levels of abstraction, from system-level to individual components. Avocado also provides a wide range of plugins and interfaces for integrating with other tools and frameworks, making it easy to customize and extend its capabilities.
Easy to use: Avocado is designed to be easy to use, even for users who are new to testing or have limited programming experience. It uses a simple YAML-based syntax for defining tests and test plans, and provides a user-friendly command-line interface for running tests and viewing results. Avocado also includes detailed documentation and tutorials to help users get started quickly.
Scalability and distributed testing: Avocado supports distributed testing across multiple systems, making it easy to scale up testing to handle large workloads. It includes a built-in job scheduler for managing test execution across multiple systems, and can be integrated with various cloud-based services for running tests in the cloud.
Community support: Avocado is an open-source project maintained by a vibrant community of developers and testers. The community provides regular updates and bug fixes, and is actively involved in improving the usability and functionality of the framework. The Avocado community also provides support through various channels, including GitHub, mailing lists, and IRC.
Use cases: Avocado is used by various organizations and companies for testing different types of software, including operating systems, virtualization platforms, container platforms, and cloud services. It is particularly well-suited for testing complex, distributed systems that require a high degree of automation and scalability. Some of the organizations that use Avocado include Red Hat, IBM, Intel, and Huawei.

License

Except where otherwise indicated in a given source file, all original contributions to Avocado are licensed under the GNU General Public License version 2 (GPLv2) or any later version. By contributing you agree that these contributions are your own (or approved by your employer) and you grant a full, complete, irrevocable copyright license to all users and developers of the Avocado project, present and future, pursuant to the license of the project.

==============================================================

OpenPower Test Framework

===============================================================

The community maintained, op-tests repository.

https://github.com/open-power/op-test

git clone git@github.com:open-power/op-test.git

# git clone git@github.com:open-power/op-test.git
Cloning into 'op-test'...
remote: Enumerating objects: 8716, done.
remote: Counting objects: 100% (623/623), done.
remote: Compressing objects: 100% (275/275), done.
remote: Total 8716 (delta 416), reused 480 (delta 347), pack-reused 8093
Receiving objects: 100% (8716/8716), 23.89 MiB | 23.39 MiB/s, done.
Resolving deltas: 100% (6488/6488), done.
#

Pre-requisites for op-tests: [Please do not forget to do it on remote host]

1) yum install sshpass

2) pip3 install pexpect

3) echo "set enable-bracketed-paste off" > .inputrc ; export INPUTRC=$PWD/.inputrc

bind 'set enable-bracketed-paste off'

How to run testcase:

CASE 1:

./op-test -c machine.conf --run testcases.RunHostTest --host-cmd ls

Testcase: https://github.com/open-power/op-test/blob/master/testcases/RunHostTest.py

where machine.conf :

[op-test]
bmc_type=OpenBMC /EBMC_PHYP/FSP_PHYP
bmc_username=abc
bmc_ip=w39
bmc_username=root
bmc_password=0penBmc
hmc_ip=a.b.c.d
hmc_username=hmcuser
hmc_password=hmcpasswd123
host_ip=x.y.x.k
host_user=hostuser
host_password=hostpasswd123
system_name=power10
lpar_name=lpar_name_1
lpar_prof=default_profile

CASE2:

# cat ebmc-type.conf
[op-test]
bmc_type=EBMC_PHYP
bmc_username=service
bmc_password=0penBmc!
bmc_ip=A.B.C.D
hmc_ip=myhmc.com
hmc_username=hmcuser
hmc_password=hmcpassword
system_name=myhost
lpar_name=myhost-lpar7-sachinpb
lpar_prof=default_profile
lpar_gateway=A.B.C.1
lpar_subnet=255.255.255.0
lpar_hostname=myhost-lpar7.com
lpar_mac=CD:EF:GH:IJ:KL
host_ip=X.Y.Z.U
host_user=user
host_password=userpassword
dump_server_ip=
dump_server_pw=
dump_path=/mnt
linux_src_dir=
kernel_image=
initrd_image=
num_of_iterations=

./op-test -c ebmc-type.conf --run testcases.RunHostTest --host-cmd-file cmd.conf

where :

# cat cmd-script.sh
echo "welcome SACHIN P B"
hostname
uptime
date
#

OUTPUT: should have run the above script as shown below:

# ./op-test -c ebmc-type.conf --run testcases.RunHostTest --host-cmd-file cmd-script.sh
Logs in: /root/fix_kdump_FalsePositives/op-test/test-reports/test-run-20230829095648
2023-08-29 09:56:48,758:op-test:setUpLoggerFile:INFO:Preparing to set location of Log File to /root/fix_kdump_FalsePositives/op-test/test-reports/test-run-20230829095648/20230829145648758035.main.log
2023-08-29 09:56:48,758:op-test:setUpLoggerFile:INFO:Log file: /root/fix_kdump_FalsePositives/op-test/test-reports/test-run-20230829095648/20230829145648758035.main.log
2023-08-29 09:56:48,758:op-test:setUpLoggerDebugFile:INFO:Preparing to set location of Debug Log File to /root/fix_kdump_FalsePositives/op-test/test-reports/test-run-20230829095648/20230829145648758291.debug.log
[console-expect]#which whoami && whoami
/usr/bin/whoami
root
[console-expect]#echo $?
echo $?
0
[console-expect]#echo "welcome SACHIN P B"
echo "welcome SACHIN P B"
welcome SACHIN P B
[console-expect]#echo $?
echo $?
0
[console-expect]#hostname
hostname
myhost.com
[console-expect]#echo $?
echo $?
0
[console-expect]#uptime
uptime
09:58:15 up 7:50, 2 users, load average: 0.08, 0.02, 0.01
[console-expect]#echo $?
echo $?
0
[console-expect]#date
date
Tue Aug 29 09:58:15 CDT 2023
[console-expect]#echo $?
echo $?
0
ok
Ran 1 test in 7.510s
OK
2023-08-29 09:58:17,787:op-test:<module>:INFO:Exit with Result errors="0" and failures="0"

------------------------------------------------------------------------------------------------------------------

Example 2:

python3 op-test -c machine.conf --run testcases.PowerNVDump.KernelCrash_disable_radix

python3 op-test -c machine.conf --run testcases.PowerNVDump.KernelCrash_XIVE_off

python3 op-test --run-suite osdump-suite -c CR-machine.conf

python3 op-test --run testcases.RunHostTest -c CR-Machine.conf --host-cmd-file CR-Machine_command.conf --host-cmd-timeout 286400

Example 3: [Testcase file : PowerNVDump.py]

1) How to execute ONLY kdump tests :

python3 op-test --run-suite osdumpkdumpsuite -c machine.conf

2) How to execute ONLY Fadump tests :

python3 op-test --run-suite osdumpfadumpsuite -c machine.conf

3) How to run sanity tests includes basic kdump and fadump

python3 op-test --run-suite osdumpsanitysuite -c machine.conf

Example 4: [ Testcase file : OpTestKexec.py]

How to run the kexec test in op-tests framework

./op-test -c machine.conf --run testcases.OpTestKexec.OpTestKexec.test_load_unload

./op-test -c machine.conf --run testcases.OpTestKexec.OpTestKexec.test_load_and_exec

./op-test -c machine.conf --run testcases.OpTestKexec.OpTestKexec.test_syscall_load_and_exec

./op-test -c machine.conf --run testcases.OpTestKexec.OpTestKexec.test_kexec_unsigned_kernel

./op-test -c machine.conf --run testcases.OpTestKexec.OpTestKexec.test_kexec_in_loop

where machine.conf :

[op-test]
bmc_type=FSP_PHYP
bmc_username=bmcadmin
bmc_password=**********
bmc_ip=ABC-fsp.america.com
hmc_ip=HMC1.america.com
hmc_username=adminhmc
hmc_password=********
system_name=System123
lpar_name=system123-lp4_SACHINPB
lpar_prof=default_profile
lpar_gateway=9.x.y.1
lpar_subnet=255.255.255.0
lpar_hostname=System123-lp4.com
lpar_mac=A:B:C:D
host_ip=9.X.y.z
host_user=root
host_password=**********
dump_server_ip=9.m.n.c
dump_server_pw=**********
dump_path=/mnt
linux_src_dir=
kernel_image=
initrd_image=
num_of_iterations=100

===============================================================

How to analyze op-test output :

Traverse this directory path: op-test/test-reports/test-run-$date

There are are 3 log files to investigate the test failure or life-cycle of testsuite

# pwd

/root/fix_kdump_FalsePositives/op-test/test-reports/test-run-$DATE

# ls -1

$DATE.log

$DATE.main.log

$DATE.debug.log

1 ) $DATE.log ====> You could see console related commands and outputs

For example :

lssyscfg -m Serverx0 -r lpar --filter lpar_names=Serverx0-lp6 -F state

lsrefcode -m Serverx0 -r lpar --filter lpar_names=Serverx0-lp6 -F refcode

chsysstate -m Serverx0 -r lpar -n Server-lp6 -o shutdown --immed

2) $DATE.main.log

If you add any statements in log.info , that will be logged in this file

log.info("=============== Testing kdump/fadump over ssh ===============")

3) $DATE.debug.log

If you add any comments with log.debug, that will be logged in this file

log.debug("SACHIN_DEBUG: In loop1")

============================================================

Listed are some interesting things about the op-test framework and its use cases:

Testing hardware systems: The op-test framework is designed for testing hardware systems, particularly servers, using the OpenPOWER architecture. It includes a wide range of tests that cover different aspects of hardware functionality, such as power management, CPU, memory, and I/O.
Integration with OpenBMC: The op-test framework integrates with the OpenBMC project, an open-source implementation of the Baseboard Management Controller (BMC) firmware that provides out-of-band management capabilities for servers. This integration allows users to control and monitor server hardware using the OpenBMC interface, and to run tests on the hardware using the op-test framework.
UEFI and firmware testing: The op-test framework includes support for testing UEFI firmware and other low-level system components, such as the Hostboot bootloader. This allows users to test the system firmware and ensure that it is functioning correctly.
Easy to use: The op-test framework is designed to be easy to use, even for users who are not familiar with hardware testing. It uses a simple command-line interface and provides detailed documentation and tutorials to help users get started quickly.
Scalability: The op-test framework is designed to be scalable and can be used to test multiple systems in parallel. This makes it suitable for testing large server farms and data centers.
Community support: The op-test framework is an open-source project with an active community of developers and testers. The community provides regular updates and bug fixes, and is actively involved in improving the usability and functionality of the framework. The op-test community also provides support through various channels, including GitHub, mailing lists, and IRC.
Use cases: The op-test framework is used by various organizations and companies for testing hardware systems, including server manufacturers, data center operators, and cloud service providers. Some of the organizations that use the op-test framework include IBM, Google, and Rackspace.

How to contribute to op-test framework open source community :

1) mkdir kdump_xive_off_check

2) cd kdump_xive_off_check

3) git clone git@github.com:SACHIN-PB/op-test.git

Fork the repository from master : https://github.com/open-power/op-test

NOTE: In Git, forking a repository means creating a copy of the original repository into your own GitHub account.

This is typically done when you want to contribute to an open-source project or collaborate with other developers.

4) git config user.email

5) git config user.name

NOTE: To get proper username and email . Please do the following setup at /root directory

# cat .gitconfig

[user]

email = sachin@linux.XYZ.com

name = Sachin P B

6) git branch

7) git remote -v

origin git@github.com:SACHIN-PB/op-test.git (fetch)

origin git@github.com:SACHIN-PB/op-test.git (push)

8) git remote add upstream git@github.com:open-power/op-test.git

9) git remote -v

origin git@github.com:SACHIN-PB/op-test.git (fetch)

origin git@github.com:SACHIN-PB/op-test.git (push)

upstream git@github.com:open-power/op-test.git (fetch)

upstream git@github.com:open-power/op-test.git (push)

10) git checkout -b "kdump_xive_off_check"

11) git branch

12) vi testcases/PowerNVDump.py

13) git diff

14) git status

15) git add testcases/PowerNVDump.py

16) git status

17) git commit -s

18) git branch

19) git push origin kdump_xive_off_check

Enumerating objects: 7, done.

Counting objects: 100% (7/7), done.

Delta compression using up to 16 threads

Compressing objects: 100% (4/4), done.

Writing objects: 100% (4/4), 880 bytes | 880.00 KiB/s, done.

Total 4 (delta 3), reused 0 (delta 0), pack-reused 0

remote: Resolving deltas: 100% (3/3), completed with 3 local objects.

remote:

remote: Create a pull request for 'kdump_xive_off_check' on GitHub by visiting:

remote: https://github.com/SACHIN-PB/op-test/pull/new/kdump_xive_off_check

remote:

To github.com:SACHIN-PB/op-test.git

* [new branch] kdump_xive_off_check -> kdump_xive_off_check

20) Create PR using the link created at step 19 and request for the review

Example https://github.com/open-power/op-test/pull/7XYZ4:

21) You can update your PR by running these commands

git commit --amend
git push -f origin kdump_xive_off_check

======================

Reference:

1) https://github.com/open-power/op-test/blob/master/testcases/RunHostTest.py

2) https://github.com/avocado-framework-tests/avocado-misc-tests

3) https://avocado-framework.readthedocs.io/en/latest/

Linux security and kernel Lockdown - kernel image access prevention feature

2023-03-29T17:45:00.007+05:30

Linux has a long history of security-focused development and has been used in many high-security environments, such as military and government organizations. Linux is highly customizable, which allows administrators to tailor security configurations to their specific needs. For example, security modules like SELinux and AppArmor can be configured to enforce highly granular access control policies. Many Linux distributions include security-focused features, such as hardening patches and secure boot support, by default.The open-source nature of Linux allows for community-driven development and auditing, which can help to uncover security vulnerabilities and improve the overall security of the system. Linux containers, such as Docker and Kubernetes, have become increasingly popular in recent years and offer a more secure alternative to traditional virtualization solutions. Linux is widely used in cloud environments and has many built-in features for secure cloud deployments, such as network isolation and encryption. It is constantly being updated and improved with new security features and bug fixes, making it one of the most secure operating systems available.

The Kernel Lockdown feature is designed to prevent both direct and indirect access to a running kernel image, attempting to protect against unauthorized modification of the kernel image and to prevent access to security and cryptographic data located in kernel memory, whilst still permitting driver modules to be loaded. This is security feature, the Linux Security Module (LSM, nicknamed “lockdown”). It does promise to bring additional security to one of the most widely-used and hardened kernels on the market. The lockdown feature’s aim is to restrict various pieces of kernel functionality. There are two modes available to the lockdown module: Integrity and Confidentiality. When in Integrity mode, kernel features which would allow userland code to modify the running kernel are disabled. When in Confidentiality mode, userland code to extract confidential information from the kernel will be disabled.First off, it will restrict access to kernel features that may allow arbitrary code execution by way of code supplied by any application or service outside of the kernel (aka “userland”). The new feature will also block processes from reading/writing to /dev/mem and /dev/kmem memory, as well as block access to opening /dev/port (as a means to prevent raw ioport access). Other features include:

Enforcing kernel module signatures.
Prevents even the root account from modifying the kernel code.
Kexec reboot (in case secure boot being enabled does not keep the secure boot mode in new kernel).
Lockdown of hardware that could potentially generate direct memory addressing (DMA).
Lockdown of KDADDIO, KDDELIO, KDENABIO and KDDISABIO console ioctls.

where

The KDADDIO, KDDELIO, KDENABIO, and KDDISABIO console ioctls are used to manage console input/output (I/O) on Linux systems. Here's a brief overview of each of these ioctls:
KDADDIO: This ioctl is used to add a new input/output device to the console. When a new device is added using KDADDIO, it can be used to send input to or receive output from the console.
KDDELIO: This ioctl is used to remove an input/output device from the console. When a device is removed using KDDELIO, it is no longer able to send input to or receive output from the console.
KDENABIO: This ioctl is used to enable input/output from a specific device on the console. When a device is enabled using KDENABIO, it can be used to send input to or receive output from the console.
KDDISABIO: This ioctl is used to disable input/output from a specific device on the console. When a device is disabled using KDDISABIO, it is no longer able to send input to or receive output from the console.

NOTE: The "KD" in these console ioctls stands for Keyboard Display. The term "keyboard display" is used to refer to the console on a computer system, which includes the keyboard and screen used to interact with the system

If a prohibited or restricted feature is accessed or used, the kernel will emit a message that looks like:

Lockdown: X: Y is restricted, see man kernel_lockdown.7

where X indicates the process name and Y indicates what is restricted. On an EFI-enabled x86 or arm64 machine, lockdown will be automatically enabled if the system boots in EFI Secure Boot mode.

Coverage: When lockdown is in effect, a number of features are disabled or have their use restricted. This includes special device files and kernel services that allow direct access of the kernel image:

/dev/mem
/dev/kmem
/dev/kcore
/dev/ioports
BPF
kprobes

and the ability to directly configure and control devices, so as to prevent the use of a device to access or modify a kernel image:The use of module parameters that directly specify hardware parameters to drivers through the kernel command line or when loading a module.

The term "lockdown" refers to a set of security features in the Linux kernel that are designed to prevent even privileged users, such as the root user, from bypassing certain security restrictions. These features are intended to provide an additional layer of protection against malicious software and unauthorized access to sensitive information.

There are two main components to the lockdown feature:

Integrity measurement: This feature prevents changes to the kernel's security settings, such as disabling secure boot or loading unsigned kernel modules, even by users with root privileges.

Confidentiality protection: This feature prevents user space processes from accessing certain sensitive information, such as kernel memory or hardware resources, even if the processes are running with root privileges.

The lockdown feature is a powerful tool for enhancing the security of Linux systems, particularly in high-security environments or those where data privacy is a top concern. However, it can also limit the flexibility of the system, so it's important to carefully consider the trade-offs before enabling this feature.

---------------------------------------------------------

The lockdown feature and SELinux are both security features in the Linux kernel, but they serve different purposes and work independently of each other.

SELinux is a mandatory access control (MAC) system that enforces a set of security policies to determine what processes and users can access specific resources, such as files or network ports. It operates by labeling resources with a security context and assigning labels to users and processes. The security policies defined in SELinux are enforced by the kernel and can prevent unauthorized access and other security breaches.

The lockdown feature, on the other hand, is designed to prevent even privileged users, including those with root privileges, from bypassing certain security restrictions. It achieves this by restricting access to certain kernel features and preventing modifications to the kernel's security settings.

When both SELinux and the lockdown feature are enabled, they work together to provide a comprehensive security solution. SELinux enforces mandatory access controls to restrict access to resources, while the lockdown feature ensures that even privileged users cannot bypass certain security restrictions. This can help to prevent security breaches caused by malicious software or unauthorized access to sensitive information.

The combination of SELinux and the lockdown feature provides a powerful security solution for Linux systems. It's important to carefully configure and manage these features to ensure that they do not interfere with normal system operations or cause unintended consequences.

The idea of effectively rendering the root account less capable of working with a system (on a kernel level), might be considered (to some) a disservice to Linux (and Linux administrators). However, in the realm of business, absolute security is a necessity — especially on machines that house sensitive business/customer data. When the root account is under a form of strict lockdown, malicious code would be significantly more challenging to run rampant on a system. This could lead to fewer data breaches. And because the kernel developers are making the lockdown feature “optional,” it is possible for enterprise admins to enable the feature on production machines that store such sensitive data. Conversely, on standard desktop machines (or developer machines) the feature can remain disabled.

Linux kernel has several security features built into it to protect against various types of security threats. Some of the additional security features of Linux kernel include:

1) AppArmor: AppArmor is a mandatory access control (MAC) system that restricts the capabilities of individual applications or processes. It can be used to enforce security policies that limit the actions of individual applications, such as restricting access to certain files or network resources.

2) Control Groups (cgroups): cgroups provide a way to organize and manage system resources, such as CPU, memory, and I/O bandwidth, among different processes. This helps to prevent individual processes from monopolizing system resources, which can improve system performance and stability.

3) Kernel SamePage Merging (KSM): KSM allows multiple identical memory pages to be merged into a single page, reducing memory usage and improving system performance. However, this feature also presents a potential security risk, as it could allow an attacker to create a malicious page that looks like a legitimate page, thereby bypassing memory protection measures.

4) Executable Space Protection: Executable Space Protection is a security feature that prevents execution of code from memory pages that are marked as data or stack. This helps to prevent buffer overflow and other types of attacks that rely on executing code in memory regions that are not intended for code execution.

5) Secure Boot: Secure Boot is a security feature that ensures that only trusted software is executed during system boot-up. It uses cryptographic signatures to verify the authenticity of boot loaders and other critical components of the system, preventing unauthorized or malicious software from running at boot time.

6) Address Space Layout Randomization (ASLR): This feature randomizes the memory layout of user space programs, making it more difficult for attackers to exploit vulnerabilities in the program's code.

7) Seccomp: This feature provides a mechanism for filtering system calls that can be made by a process, allowing administrators to restrict the system calls that can be made by certain programs.

8) Trusted Platform Module (TPM): This is a hardware-based security feature that provides a secure storage area for cryptographic keys and other sensitive data. It can be used to enhance the security of system booting, disk encryption, and other security-related functions.

9) SELinux similar to AppArmor. These are two popular security modules that provide mandatory access control (MAC) enforcement in the Linux kernel. They use security policies to define what resources, such as files and network ports, can be accessed by which processes and users.

-----------------------------

The Linux kernel includes a variety of cryptography algorithms that can be used to provide secure communication, storage, and other security-related functions. Here are some of the key cryptography algorithms and features in the Linux kernel:

1) Advanced Encryption Standard (AES): AES is a symmetric encryption algorithm that is widely used for data encryption. The Linux kernel includes an implementation of AES that can be used by applications and other kernel subsystems.

2) RSA: RSA is an asymmetric encryption algorithm that is used for digital signatures and key exchange. The Linux kernel includes an implementation of RSA that can be used by applications and other kernel subsystems.

3) SHA: SHA (Secure Hash Algorithm) is a family of cryptographic hash functions that are used for digital signatures, data integrity checking, and other security-related functions. The Linux kernel includes implementations of several SHA algorithms, including SHA-1, SHA-256, and SHA-512.

4) Random Number Generation: Random number generation is a critical component of many cryptographic functions. The Linux kernel includes several sources of entropy that are used to generate high-quality random numbers for use in cryptography algorithms.

5) Cryptographic API: The Linux kernel includes a Cryptographic API that provides a standard interface for using cryptographic functions in kernel modules and applications. The API includes support for a wide range of cryptographic algorithms and features, including those listed above.

6) Filesystem Encryption: The Linux kernel includes support for encrypting filesystems using the dm-crypt subsystem. This allows for encrypted storage of sensitive data and can be used to protect against data theft in the event of a system breach.

7) IPSec: IPSec is a protocol suite for securing IP communications, including VPNs and other types of network connections. The Linux kernel includes support for IPSec, which can be used to secure network communications between Linux systems.

Linux kernel has a wide range of built-in security features that can help to protect against various types of security threats. Linux-based security offers a wide range of benefits, including customizability, community-driven development, and a long history of use in high-security environments. These factors have helped to make Linux a popular choice for organizations looking to enhance the security of their systems and data.

Reference:

https://man7.org/linux/man-pages/man7/kernel_lockdown.7.html

Non-Uniform Memory Access Architecture

2023-03-28T22:53:00.008+05:30

Non-Uniform Memory Access (NUMA) is a computer memory design used in multiprocessors, where the memory access time depends on the distance between the CPU and the memory. In a NUMA system, each CPU has access to its own local memory as well as remote memory, which can cause performance issues if not managed properly.

In the Non-Uniform Memory Access (NUMA) architecture, the path from processor to memory is non-uniform. This organization enables the construction of systems with a large number of processors, and hence the association with very large systems. A NUMA system with cache-coherent memory running a single OS image is still an SMP system. A general representation of the NUMA system is shown below.

NUMA Architecture

The system is comprised of multiple nodes each with 2-4 processors, a memory controller, memory and perhaps IO. There might be a separate node controller, or the MC and NC could be integrated. The nodes could be connected by a shared bus, or may implement a cross-bar.

By classifying memory location bases on signal path length from the processor to the memory, latency and bandwidth bottlenecks can be avoided. This is done by redesigning the whole system of processor and chipset. AMD Opteron family was introduced featuring integrated memory controllers with each CPU owning designated memory banks. Each CPU has now its own memory address space. A NUMA optimized operating system such as ESXi allows workload to consume memory from both memory addresses spaces while optimizing for local memory access. Let’s use an example of a two CPU system to clarify the distinction between local and remote memory access within a single system.

Source

The memory connected to the memory controller of the CPU1 is considered to be local memory. Memory connected to another CPU socket (CPU2)is considered to be foreign or remote for CPU1. Remote memory access has additional latency overhead to local memory access, as it has to traverse an interconnect (point-to-point link) and connect to the remote memory controller. As a result of the different memory locations, this system experiences “non-uniform” memory access time.

HP Prema:

A node comprises four processors, memory, IO and a pair of node controllers. There are three IOH devices in the system. The processors and memory on the CPU board are connected to the XNC board with the node controllers.

Source

The "isolcpus" kernel parameter is used to isolate one or more CPUs from the kernel scheduler. This is typically used for running real-time or high-performance applications that require dedicated CPU resources. However, it does not have any direct relationship with NUMA nodes.

vi /etc/default/grub

Find the line that starts with "GRUB_CMDLINE_LINUX_DEFAULT" and add "numa=off" to the end of the line:

GRUB_CMDLINE_LINUX_DEFAULT="quiet splash numa=off"

Regenerate the GRUB configuration file by running the following command:

update-grub

grub2-mkconfig -o /boot/grub2/grub.cfg

Reboot the system for the changes to take effect

Note that after disabling NUMA, the system will treat all memory as a single, uniform memory pool. This may not always improve performance

---------------

Certainly, here's an example of how to use the "isolcpus" command to isolate CPU cores from the kernel scheduler:

Find out the number of available CPU cores by running:

cat /proc/cpuinfo | grep processor | wc -l

To isolate one or more CPU cores from the kernel scheduler, append the "isolcpus" parameter to the kernel boot command in the GRUB configuration file. For example, to isolate CPU core 0, edit the GRUB configuration file by running:

/etc/default/grub

Add the "isolcpus" parameter followed by the CPU core number(s) to the end of the "GRUB_CMDLINE_LINUX_DEFAULT" line. For example:

GRUB_CMDLINE_LINUX_DEFAULT="quiet splash isolcpus=0"

If you want to isolate multiple cores, separate them with commas. For example:

GRUB_CMDLINE_LINUX_DEFAULT="quiet splash isolcpus=0,2"

save and update-grub

Reboot the system for the changes to take effect.

After isolating the specified CPU cores with the "isolcpus" parameter, you can assign them to a specific process using the "taskset" command. For example, to run a process on CPU core 0, run:

taskset -c 0 <command>

Note that isolating CPU cores can affect system performance, so it's important to test your application's performance before and after isolating CPU cores to see if it has any effect on performance.

# cat /proc/cpuinfo | grep processor | wc -l
16
# taskset -c 0 hostname
host1
# taskset -c 18 hostname
taskset: failed to set pid 11896's affinity: Invalid argument
# taskset -c 16 hostname
taskset: failed to set pid 12131's affinity: Invalid argument
# taskset -c 15 hostname
host1
# taskset -c 11,15 hostname
host1
#

on PPC arch : /etc/grub.conf

Find the line that starts with "append" and add "isolcpus" parameter followed by the CPU core number(s) to the end of the line. For example

append="quiet splash isolcpus=0,2"

Ater isolating the specified CPU cores with the "isolcpus" parameter, you can assign them to a specific process using the "taskset" command

check your kernel supports isolcpus:

grep -i isolcpus /boot/config-$(uname -r)

This command searches for the "isolcpus" parameter in the kernel configuration file for the currently running kernel.

If the output of the command shows a line that looks like this:

CONFIG_ISOLCPU_PROC=y

then your kernel supports isolating CPU cores with the "isolcpus" parameter.

Some Linux distributions may not include the kernel configuration file in the /boot directory. In that case, you may need to install the "kernel-devel" or "kernel-source" package to access the kernel configuration file

NOTE: Another way of making CPU offline:
echo 0 > /sys/devices/system/cpu/cpu7/online (to offline cpu)
change cpu number with required cpu

Source

Simultaneous multithreading (SMT) is a processor design that combines hardware multithreading with superscalar processor technology. Simultaneous multithreading can use multiple threads to issue instructions each cycle.

Example: How enable SMT and check on power architecture(PPC):

# cat smt.sh
while [ 1 ]
do
ppc64_cpu --smt=off
ppc64_cpu --smt
ppc64_cpu --smt=on
ppc64_cpu --smt
ppc64_cpu --smt=2
ppc64_cpu --smt
ppc64_cpu --smt=4
ppc64_cpu --smt
done

-----------------------------------END--------------------------------------------------------------

High Performance Network Adapters and protocols

2023-02-23T22:51:00.018+05:30

High performance network adapters are designed to provide fast and efficient data transfer between servers, storage systems, and other devices in a data center or high-performance computing environment. They typically offer advanced features such as high bandwidth, low latency, RDMA support, and offload capabilities for tasks such as encryption and compression. These adapters are often used in high-performance computing, cloud computing, and data center environments to support large-scale workloads and high-speed data transfer. Some examples of high-performance network adapters include:

Mellanox ConnectX-6 and ConnectX-6 Dx
Intel Ethernet Converged Network Adapter X710 and X722
Broadcom BCM957810A1008G Network Adapter
QLogic QL45212HLCU-CK Ethernet Adapter
Solarflare XtremeScale X2522/X2541 Ethernet Adapter
Chelsio T6 and T6E-CR Unified Wire Adapters

High-performance network adapters typically use specialized protocols that are designed to provide low-latency and high-bandwidth communication between systems. Some examples of these protocols include:

Remote Direct Memory Access (RDMA): A protocol that allows data to be transferred directly between the memory of one system and another, without involving the CPU of either system.
RoCE (RDMA over Converged Ethernet): An extension of RDMA that allows RDMA traffic to be carried over Ethernet networks.
iWARP: A protocol that provides RDMA capabilities over standard TCP/IP networks.
InfiniBand: A high-speed interconnect technology that provides extremely low-latency and high-bandwidth communication between systems.

These protocols are typically used in high-performance computing (HPC) environments, where low-latency and high-bandwidth communication is critical for achieving maximum performance. They are also used in other applications that require high-speed data transfer, such as machine learning, data analytics, and high-performance storage systems. Some examples of adapter features include:

Advanced offloading capabilities: High-performance adapters can offload CPU-intensive tasks such as packet processing, encryption/decryption, and compression/decompression, freeing up server resources for other tasks.
Low latency: Many high-performance adapters are designed to minimize latency, which is especially important for applications that require fast response times, such as high-frequency trading, real-time analytics, and scientific computing.
Scalability: Some adapters support features such as RDMA and SR-IOV, which allow multiple virtual machines to share a single adapter while maintaining high performance and low latency.
Security: Many high-performance adapters have hardware-based security features such as secure boot, secure firmware updates, and hardware-based encryption/decryption, which can help protect against attacks and data breaches.
Management and monitoring: High-performance adapters often come with tools for monitoring and managing network traffic, analyzing performance, and troubleshooting issues.

A network adapter, also known as a network interface card (NIC), is a hardware component that allows a computer or other device to connect to a network. It typically includes a connector for a cable or antenna, as well as the necessary electronics to transmit and receive data over the network. Network adapters can be internal, installed inside the computer or device, or external, connected via USB or other ports. They are used for wired or wireless connections and support different types of networks such as Ethernet, WiFi, Bluetooth, and cellular networks.

source

A host bus adapter (HBA) is a hardware component that connects a server or other device to a storage area network (SAN). It is responsible for managing the flow of data between the server and the storage devices on the SAN. HBAs typically include a connector for a Fibre Channel or iSCSI cable, as well as the necessary electronics to transmit and receive data over the SAN. They are used to connect servers to storage devices such as disk arrays, tape libraries, and other storage systems.

Common Network Protocols Used in Distributed Storage:

IB: used for the front-end storage network in the DPC scenario.
RoCE: used for the back-end storage network.
TCP/IP: used for service network.

Network adapters are used to connect a computer or device to a network, while host bus adapters are used to connect a computer or device to a storage area network. Network adapters are used for data transmission over networks, while host bus adapters are used for data transmission over storage area networks. There are several network adapters that are commonly used in servers, and the best option will depend on the specific needs of the server and the network it will be connecting to. Some popular options include:

Intel Ethernet Converged Network Adapter X520-DA2: This is a 10 Gigabit Ethernet adapter that is designed for use in data center environments. It supports both copper and fiber connections and is known for its high performance and reliability.
Mellanox ConnectX-4 Lx EN: This is another 10 Gigabit Ethernet adapter that is designed for use in data centers. It supports both copper and fiber connections and is known for its low latency and high throughput.
Broadcom BCM57416 NetXtreme-E: This is a 25 Gigabit Ethernet adapter that is designed for use in data centers. It supports both copper and fiber connections and is known for its high performance and reliability.
Emulex LPe1605A: This is a 16 Gbps Fibre Channel host bus adapter (HBA) that is designed for use in storage area networks (SANs). It supports both N_Port ID Virtualization (NPIV) and N_Port Virtualization (NPV) and is known for its high performance and reliability.

IBM produces a wide range of servers for various types of environments, here are a few examples of IBM servers:

IBM Power Systems: These servers are designed for high-performance computing and big data workloads, and are based on the Power architecture. They support IBM's AIX, IBM i, and Linux operating systems.
IBM System x: These servers are designed for general-purpose computing and are based on the x86 architecture. They support a wide range of operating systems, including Windows and Linux.
IBM System z: These servers are designed for mainframe computing and support IBM's z/OS and z/VM operating systems.
IBM BladeCenter: These servers are designed for blade server environments and support a wide range of operating systems, including Windows and Linux.
IBM Storage: These servers are designed for storage and data management workloads, and support a wide range of storage protocols and operating systems.
IBM Cloud servers: IBM Cloud servers are designed for cloud-based computing and are based on the x86 architecture. They support a wide range of operating systems, including Windows and Linux.

Emulex Corporation Device e228 is a network adapter produced by Emulex Corporation. It is an Ethernet controller, which means it is responsible for controlling the flow of data packets over an Ethernet network. The Emulex Corporation Device e228 is part of the Emulex OneConnect family of network adapters, which are designed for use in data center environments. These adapters are known for their high performance, low latency, and high throughput. They also provide advanced features such as virtualization support, Quality of Service (QoS) and offloads (TCP/IP, iSCSI, and FCoE) to improve network performance. It supports 10Gbps Ethernet and can be used in both copper and fiber connections. This adapter is typically used in servers and storage systems that require high-speed network connections and advanced features to support data-intensive applications. The "be2net" kernel driver is a Linux device driver that is used to control the Emulex Corporation Device e228 network adapter. A kernel driver is a type of low-level software that interfaces with the underlying hardware of a device, such as a network adapter. It provides an interface between the hardware and the operating system, allowing the operating system to communicate with and control the device. The "be2net" driver is specifically designed to work with the Emulex Corporation Device e228 network adapter, and is responsible for managing the flow of data packets between the device and the operating system. It provides the necessary functionality for the operating system to access the adapter's features and capabilities, such as configuring network settings, monitoring link status and performance, and offloading network processing tasks. The be2net driver is typically included with the Linux operating system and it's loaded automatically when the device is detected. It's also available as a separate package, that can be installed and configured manually.

The Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function] is a network adapter produced by Mellanox Technologies. It is an Ethernet controller, which means it is responsible for controlling the flow of data packets over an Ethernet network. This adapter is part of the Mellanox ConnectX-5 Ex family of network adapters, which are designed for use in data center environments. These adapters are known for their high performance, low latency, and high throughput. They support 100 Gbps Ethernet, RoCE v2 and InfiniBand protocols and provide advanced features such as virtualization support, Quality of Service (QoS), and offloads to improve network performance. It's worth noting that the Mellanox ConnectX-5 Ex Virtual Function is a specific type of adapter that is designed to be used in virtualized environments. It allows multiple virtual machines to share a single physical adapter, thus providing better flexibility and resource utilization. This adapter is typically used in servers, storage systems, and other high-performance computing devices that require high-speed network connections and advanced features to support data-intensive applications such as big data analytics, machine learning, and high-performance computing.

The Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function] and the Emulex Corporation Device e228 are both network adapters, but there are some key differences between them: Speed and protocol support: The Mellanox ConnectX-5 Ex supports 100 Gbps Ethernet, RoCE v2 and InfiniBand protocols, while the Emulex Device e228 supports 10 Gbps Ethernet. This means that the Mellanox adapter is capable of higher data transfer speeds and can support multiple protocols for different types of networks.

Advanced features: Both adapters offer advanced features such as virtualization support, Quality of Service (QoS), and offloads. However, the Mellanox ConnectX-5 Ex also supports features like hardware-based time stamping, hardware-based packet filtering and dynamic rate scaling.

Target market: The Mellanox ConnectX-5 Ex is designed for use in data center environments, while the Emulex Device e228 is also designed for data center environments. Mellanox ConnectX-5 Ex is more geared towards high-performance computing and big data analytics, while the Emulex Device e228 is more geared towards general data center use.

Virtualization: Mellanox ConnectX-5 Ex Virtual Function is a specific type of adapter that is designed to be used in virtualized environments, allowing multiple virtual machines to share a single physical adapter, thus providing better flexibility and resource utilization. Emulex Device e228 supports virtualization, but it does not have a specific virtual function version. In summary, the Mellanox ConnectX-5 Ex is a high-speed, high-performance network adapter that offers advanced features and support for multiple protocols, while the Emulex Device e228 is a lower-speed, general-purpose network adapter that is geared towards data center environments.

Mellanox Technologies produces networking equipment, including network adapters. Some of the Mellanox adapters that support CNA (Converged Network Adapter) are:

Mellanox ConnectX-5 CNA: This adapter supports both Ethernet and Fibre Channel over Ethernet (FCoE) on a single adapter, and provides high-performance, low-latency data transfer.
Mellanox ConnectX-6 CNA: This adapter supports 100 GbE and 200 GbE speeds and provides hardware offloads for RoCE, iWARP and TCP/IP, in addition to supporting FC and FCoE protocols.
Mellanox ConnectX-5 EN CNA: This adapter supports both Ethernet and InfiniBand protocols, providing high-performance, low-latency data transfer for data center and high-performance computing environments.
Mellanox ConnectX-6 Lx CNA: This adapter supports 25 GbE and 50 GbE speeds, and provides hardware offloads for RoCE, iWARP, and TCP/IP, in addition to supporting FC and FCoE protocols.

Slingshot is a high-performance network fabric developed by the company Cray, now owned by Hewlett Packard Enterprise. It is designed to provide low-latency and high-bandwidth communication between nodes in high-performance computing systems, such as supercomputers and data centers. It is based on a packet-switched network architecture, with each node connected to a network switch. It supports a range of network topologies, including fat-tree, hypercube, and dragonfly. The fabric is designed to be scalable, with support for thousands of nodes. It uses a range of advanced features to optimize performance, including adaptive routing, congestion control, and quality-of-service (QoS) mechanisms. It also includes support for features such as remote direct memory access (RDMA) and messaging passing interface (MPI) offload, which can further improve application performance. Overall, Slingshot is designed to provide high-performance, low-latency communication for demanding HPC workloads, making it a popular choice for large-scale scientific simulations, data analytics, and other compute-intensive applications.

RDMA Types As discussed before , there are three types of RDMA networks: Infiniband, RDMA over Converged Ethernet (RoCE), and iWARP.

source

The InfiniBand network is specially designed for RDMA to ensure reliable transmission at the hardware level. The technology is advanced, but the cost is high. RoCE and iWARP are both Ethernet-based RDMA technologies, which enable RDMA with high speed, ultra-low latency, and extremely low CPU usage to be deployed on the most widely used Ethernet.

source

The three RDMA networks have the following characteristics:

InfiniBand: RDMA is considered at the beginning of the design to ensure reliable transmission at the hardware level and provide higher bandwidth and lower latency. However, the cost is high because IB NICs and switches must be supported.
RoCE: RDMA based on Ethernet consumes less resources than iWARP and supports more features than iWARP. You can use common Ethernet switches that support RoCE NICs.
iWARP: TCP-based RDMA network, which uses TCP to achieve reliable transmission. Compared with RoCE, on a large-scale network, a large number of TCP connections of iWARP occupy a large number of memory resources. Therefore, iWARP has higher requirements on system specifications than RoCE. You can use common Ethernet switches that support iWARP NICs.

Comparison between RoCE and InfiniBand

Infiniband is a high-performance, low-latency interconnect technology used to connect servers, storage, and other data center equipment. It uses a switched fabric topology and supports both data and storage traffic. InfiniBand adapters are specialized network interface cards (NICs) that are designed to work with InfiniBand networks. Here are a few examples of InfiniBand adapters:

Mellanox ConnectX-4/5: These adapters support both 40 Gb/s and 100 Gb/s InfiniBand and provide high-performance, low-latency data transfer for data center and high-performance computing environments.

Mellanox ConnectX-6: This adapter supports 200 Gb/s InfiniBand, providing hardware offloads for RoCE, iWARP and TCP/IP, in addition to supporting FC and FCoE protocols.

Intel Omni-Path Architecture (OPA) 100 Series: This adapter supports 100 Gb/s InfiniBand and provides high-performance, low-latency data transfer for data center and high-performance computing environments.

Qlogic InfiniPath HTX: This adapter supports 10 Gb/s InfiniBand and provides high-performance, low-latency data transfer for data center and high-performance computing environments.

Mellanox ConnectX-4 Lx: This adapter supports 25 Gb/s and 50 Gb/s InfiniBand and provides hardware offloads for RoCE, iWARP, and TCP/IP, in addition to supporting FC and FCoE protocols.

RoCE (RDMA over Converged Ethernet) is a network protocol that allows for low-latency, high-throughput data transfer over Ethernet networks. It leverages Remote Direct Memory Access (RDMA) capabilities to accelerate communications between applications hosted on clusters of servers and storage arrays. It is based on the Remote Direct Memory Access (RDMA) protocol, which allows for direct memory access over a network without involving the CPU, resulting in low-latency and high-bandwidth data transfer. RoCE uses standard Ethernet networks and devices, so it is simpler to set up and manage than traditional RDMA over Infiniband. RoCE is designed for use in data center environments, and is particularly well-suited for use with high-performance computing and big data analytics applications, which require high-speed, low-latency data transfer. Some features of RoCE are:

Low-latency: RoCE allows for very low-latency data transfer, which is critical for high-performance computing and big data analytics applications.
High-throughput: RoCE allows for high-bandwidth data transfer, which is necessary for handling large amounts of data.
RDMA support: RoCE is based on the RDMA protocol, which allows for direct memory access over a network, resulting in low-latency and high-bandwidth data transfer.
Converged Ethernet: RoCE uses standard Ethernet networks and devices, making it simpler to set up and manage than traditional RDMA over Infiniband.
Quality of Service (QoS) support: RoCE can provide Quality of Service (QoS) feature, which allows for guaranteed bandwidth and low-latency for critical applications.
Virtualization support: RoCE can be used with virtualized environments, allowing multiple virtual machines to share a single physical adapter, thus providing better flexibility and resource utilization.

RoCE Overview RDMA over Converged Ethernet (RoCE) is a network protocol that leverages Remote Direct Memory Access (RDMA) capabilities to accelerate communications between applications hosted on clusters of servers and storage arrays. RoCE incorporates the IBTA RDMA semantics to allow devices to perform direct memory-to-memory transfers at the application level without involving the host CPU. Both the transport processing and the memory translation and placement are performed by the hardware which enables lower latency, higher throughput, and better performance compared to software-based protocols.

Infiniband RDMA to RoCE :

Both InfiniBand RDMA and RoCE can implement remote memory access network protocols. The two currently have their own advantages and disadvantages on the market, and both are used in HPC cluster architecture or large-scale data centers. Comparing the two, InfiniBand has better performance. But InfiniBand is a dedicated network technology. It cannot inherit the user's accumulation and platform of operation on the IP network, which causes the high cost in operation and maintenance. Therefore, carrying RDMA based on the traditional Ethernet network is also inevitable for the large-scale application of RDMA. To guarantee RDMA performance and network layer communication, many network switches use RoCEv2 to carry high-performance distributed applications.

CNA (Converged Network Adapter) is a type of network adapter that supports multiple protocols, such as Ethernet and Fibre Channel over Ethernet (FCoE), on a single adapter. A CNA typically includes both a NIC and a Host Bus Adapter (HBA) to support both data and storage traffic. When using a CNA with SRIOV (Single Root I/O Virtualization) and ROCE (RDMA over Converged Ethernet), multiple virtual functions (VFs) can be created on the CNA, each with its own MAC address, VLAN ID, and other network attributes. Each VF can be assigned to a different virtual machine (VM) or a container, and each VM or container can have its own network configuration and parameters. Each VF can be configured to support ROCE, allowing for low-latency, high-throughput data transfer over Ethernet networks. This can be particularly useful in high-performance computing and big data analytics environments, where low-latency and high-bandwidth data transfer is critical.

converged network adapter (CNA)

SRIOV with ROCE on a CNA can provide the following benefits: Improved resource utilization: By allowing multiple VMs or containers to share a single physical adapter, SRIOV with ROCE on a CNA can improve resource utilization and reduce costs.

Improved network performance: ROCE allows for low-latency, high-throughput data transfer over Ethernet networks, which can improve network performance in high-performance computing and big data analytics environments.

Fine-grained control of network resources: SRIOV with ROCE on a CNA allows for fine-grained control of network resources, allowing each VM or container to have its own network configuration.

I/O Virtualization Overview: CNA, SR-IOV

Differences between RoCE, Infiniband RDMA, and TCP/IP.

The fastest network adapter available today depends on the specific application and the network infrastructure. Generally, there are different types of network adapters that support different speeds and protocols, and each one is suitable for different use cases.

For example, for data center environments, 100 GbE (gigabit ethernet) adapters are currently considered as the fastest option, providing high-bandwidth and low-latency data transfer. These adapters use the latest technologies such as SFP28 and QSFP28 connectors and support both copper and fiber cabling. Mellanox ConnectX-6, Intel Omnipath and Marvell FastLinQ are some examples of 100 GbE adapters.

For High-Performance Computing (HPC) and Artificial Intelligence (AI) applications, Infiniband adapters are considered as the fastest option providing low-latency and high-bandwidth data transfer. Mellanox ConnectX-6 HDR and Intel OPA 100 series are examples of 200 Gb/s Infiniband adapters. For storage, Fibre Channel (FC) and Fibre Channel over Ethernet (FCoE) adapters are considered as the fastest option providing low-latency and high-bandwidth data transfer. Mellanox ConnectX-6, Emulex Gen 6 Fibre Channel and Qlogic Gen 6 Fibre Channel are examples of these adapters.

Supercomputer systems, like the Summit and Sierra, developed by Oak Ridge National Laboratory and Lawrence Livermore National Laboratory, respectively, use a high-performance interconnect technology called Infiniband for their internal communication. Mellanox Technologies is the company that provides the Infiniband adapters and host bus adapters (HBAs) for these supercomputers.

Summit and Sierra use Mellanox's Connect-IB adapter which supports 100 Gb/s InfiniBand and provides hardware offloads for RoCE, iWARP and TCP/IP. The Connect-IB adapters are designed to handle the high-bandwidth and low-latency requirements of large-scale supercomputing applications. The Host Bus Adapter (HBA) Mellanox ConnectX-4 Lx is used for these systems. ConnectX-4 Lx is a single-port 100 Gb/s InfiniBand adapter that supports both 25 Gb/s and 50 Gb/s speeds. The adapters provide hardware offloads for RoCE, iWARP, and TCP/IP, in addition to supporting FC and FCoE protocols.

Frontier is a planned supercomputer that is being developed by Oak Ridge National Laboratory and Cray Inc. It is the world's fastest supercomputer in 2021. According to the information available, Frontier uses high-performance interconnect technology called Slingshot, developed by Cray, for its internal communication. Slingshot is a next-generation interconnect technology that promises to provide low-latency, high-bandwidth, and high-message-rate data transfer.

In terms of network adapters and host bus adapters (HBAs), the information available is not specific, but it's known that Cray has collaborated with Mellanox Technologies to provide the network interconnect technology for Frontier. This suggests that Mellanox's InfiniBand and/or Slingshot adapters may be used in Frontier.

A host bus adapter (HBA) is a specialized type of network interface card (NIC) that connects a host computer to a storage area network (SAN). HBAs provide a bridge between the host computer and the storage devices, allowing the host to access and manage the storage devices as if they were locally attached.

Here are a few key things to know to get familiar with storage HBAs:

Protocols: HBAs support different storage protocols such as Fibre Channel (FC), Fibre Channel over Ethernet (FCoE), and iSCSI. FC and FCoE are commonly used in enterprise environments, while iSCSI is more commonly used in smaller, SMB environments. Speed: HBAs are available in different speeds, such as 8 Gb/s, 16 Gb/s, and 32 Gb/s. Higher speeds provide faster data transfer and improved performance.
Multi-Path Support: HBAs often support multi-path I/O, which allows multiple paths to the storage devices to be used for failover and load balancing. Compatibility: HBAs are designed to work with specific operating systems, so it's important to check the compatibility of the HBA with the operating system you are using.
Management and Monitoring: Many HBAs include management and monitoring software that allows administrators to view and configure the HBA's settings, such as Fibre Channel zoning, and to monitor the performance of the HBA and the storage devices it is connected to. Driver and Firmware: HBA's require driver and firmware to work properly, so it's important to ensure that the HBA has the latest driver and firmware updates installed.
Vendor Support: It's important to consider the vendor support of the HBA, as well as the warranty and technical support options available, as these can be critical factors when choosing an HBA. Architecture: Some HBAs are based on ASIC (Application-Specific Integrated Circuit) while others on FPGA (Field-Programmable Gate Array) architecture, both have their own pros and cons.

Power10 is the latest generation of IBM's Power Architecture designed for high-performance computing and big data workloads, and is intended to deliver significant performance and efficiency improvements over its predecessor, Power9. Some of the key features of the Power10 architecture include:

Higher core count: Power10 processors have a higher core count than Power9 processors, which allows for more parallel processing and improved performance.
Improved memory bandwidth: Power10 processors have more memory bandwidth than Power9 processors, which allows for faster data transfer between the processor and memory. Enhanced security features: Power10 processors include enhanced security features, such as hardware-enforced memory encryption and real-time threat detection, to protect against cyber-attacks.
Improved energy efficiency: Power10 processors are designed to be more energy efficient than Power9 processors, which can help to reduce power consumption and cooling costs. Optimized for AI workloads: Power10 processors are optimized for AI workloads and have better support for deep learning and other AI-related tasks.
More flexible and open: Power10 architecture is more flexible and open. It supports more operating systems, and it has more open interfaces and more standard protocols to connect to other devices. Example: IBM Power Systems AC922.

AI workloads refer to tasks that involve the use of artificial intelligence and machine learning algorithms, such as:

Natural Language Processing (NLP): This includes tasks such as speech recognition, text-to-speech, and machine translation.
Computer Vision: This includes tasks such as image recognition, object detection, and facial recognition. Predictive analytics: This includes tasks such as forecasting, anomaly detection, and fraud detection.
Robotics: This includes tasks such as navigation, object manipulation, and decision making. Recommender Systems: This includes tasks such as personalized product recommendations, content recommendations, and sentiment analysis.
Generative Models: These include tasks such as image and video generation, text generation and music generation. Reinforcement learning: These include tasks such as game playing, decision making and control systems.
Deep Learning: These include tasks such as Image and speech recognition, natural language processing and predictive analytics.

These are just a few examples of AI workloads, there are many more possible applications of AI in various industries such as healthcare, finance, transportation, and manufacturing. As AI technology continues to advance, new possibilities for AI workloads will continue to emerge.

OpenMPI and UCX are both middleware for high-performance computing, but they are not directly connected to adapter design. However, they can utilize hardware-specific features and optimizations of network adapters to improve performance.

MPI (Message Passing Interface) and AI (Artificial Intelligence) are interrelated because MPI can be used to distribute the computational workload of AI applications across multiple nodes in a distributed computing environment. Many AI algorithms, such as deep learning, machine learning, and neural networks, require a significant amount of computational resources, memory, and data storage. These algorithms can be parallelized and run in a distributed environment using MPI, which allows them to take advantage of the computing power of multiple nodes. MPI can be used to distribute the data and the workload of AI applications across multiple nodes, enabling parallel processing and reducing the time required to complete the computation. This can significantly improve the performance of AI applications and enable researchers to train and optimize more complex models. Moreover, MPI can be integrated with other libraries, such as OpenMP, CUDA, and UCX, to further improve the performance of AI applications. For example, CUDA is a parallel computing platform that enables programmers to use GPUs (Graphics Processing Units) for general-purpose processing, and MPI can be used to distribute the workload across multiple GPUs and nodes. In summary, MPI provides a scalable and efficient way to distribute the computational workload of AI applications across multiple nodes, enabling researchers and developers to build and run more complex models and achieve better performance. The choice of MPI communication method that is best suited for AI workloads depends on the specific characteristics of the workload and the system architecture. However, some general guidelines can help in selecting the appropriate MPI communication method for AI workloads. For AI workloads that involve large amounts of data, non-blocking point-to-point communication and collective communication methods are generally preferred. Non-blocking point-to-point communication methods, such as MPI_Isend and MPI_Irecv, allow the application to continue processing while the communication is in progress, which can help reduce the overall communication time. Collective communication methods, such as MPI_Allreduce and MPI_Allgather, can also be highly effective in AI workloads, as they enable efficient data sharing and synchronization among multiple nodes. These methods can be used to distribute the workload of an AI application across multiple nodes, enabling parallel processing and reducing the time required to complete the computation. Additionally, the choice of MPI communication method may also depend on the underlying system architecture. For example, on a system with a high-speed interconnect, such as InfiniBand, the use of MPI communication methods that take advantage of the RDMA (Remote Direct Memory Access) capabilities of the interconnect, such as UCX, can provide significant performance benefits. The best MPI communication method for AI workloads depends on the specific characteristics of the workload and the system architecture. However, non-blocking point-to-point communication and collective communication methods are generally preferred for AI workloads that involve large amounts of data, and the use of RDMA-enabled MPI communication methods can provide significant performance benefits on high-speed interconnects.

The mapping of adapters in supercomputers and network adapters is an important aspect of designing and building a supercomputer. In general, supercomputers use high-performance network adapters that can handle large amounts of data at high speeds, with low latency and high bandwidth. The choice of network adapter depends on the specific requirements of the supercomputer, such as the type and size of data being processed, the number of nodes in the system, and the desired performance characteristics. Some of the common network adapters used in supercomputers include InfiniBand adapters, Ethernet adapters, and Omni-Path adapters. These adapters are typically integrated with the server hardware, either as separate network interface cards (NICs) or as part of the motherboard design. These adapters provide low-latency, high-bandwidth interconnects between nodes in a cluster, enabling parallel computing and large-scale data processing. In addition to high-performance interconnects, HPC also relies on specialized hardware accelerators like GPUs, FPGAs, and ASICs to offload compute-intensive tasks from the CPU and improve overall system performance. These accelerators are often used in combination with high-performance network adapters to enable faster data transfer and processing in HPC environments.

IBM offloaded Watson Health assets to investment firm Francisco Partners

2023-02-23T20:13:00.000+05:30

IBM Watson is a question-answering computer system capable of answering questions posed in natural language, developed in IBM's DeepQA project by a research team led by principal investigator David Ferrucci. Watson was named after IBM's founder and first CEO, industrialist Thomas J. Watson. IBM’s then- CEO Ginni Rometty called the project a “moon shot,” but her replacement was less enthused about the business. The computer system was initially developed to answer questions on the quiz show Jeopardy!.

IBM launched Watson Health in early 2015 and made a series of acquisitions that cost $4 billion. They included Merge Healthcare, Truven Health Analytics, Phytel, and Explorys. IBM sold Watson Health for $1B, which is 25% of what it paid to acquire four strong businesses. The assets involved include Health Insights, MarketScan, Clinical Development, Social Program Management, Micromedex, and imaging software products. IBM offloaded Watson Health this year because it doesn't have the requisite vertical expertise in the healthcare sector.

Talking at stock market analyst Bernstein's 38th Annual Strategic Decisions Conference, the big boss was asked to outline the context for selling the healthcare data and analytics assets of the business to private equity provider Francisco Partners for $1 billion in January.

The Watson brand will be carrier for AI.

It's a question of verticals versus horizontals. IBM believes that they are best positioned to take these technologies.They will always have an industry lens but through their consulting team. They want to work on technologies that are horizontal across all industries."

Verticals should belong to people who really have all of the domain expertise, they have credibility in that vertical. And healthcare companies, people in medical devices, they will have the credibility to carry out how AI is applied to health in depth i.e AI as applied to healthcare, to financial services, to compliance, in that case, regulatory compliance, is going to be a massive market.

To succeed in health, they need doctors and nurse practitioners to speak to the buyers of Watson Health. That's not the IBM go-to-market field force, so there's a misalignment. Ditto in Promontory, that needed ex-regulators and accountants to go talk to people worrying about financial compliance. So, that's a little bit different than IBM. IBM still sells Watson solutions in financial services, advertising, business automation, and video streaming and hosting. As for AI in the enterprise where inflation, labor costs and the world undergoing a "demographic shift" means that "there are fewer people with the skills" and so AI and automation will be "applied to more and more domains." Trend is going to reverse in the next few decades."

IBM’s Watson Health is being sold for parts. The technology giant has agreed to sell the division’s data and analytics assets to private equity firm Francisco Partners. Terms weren’t disclosed, although Bloomberg values the deal at more than $1 billion. Launched in 2015, Watson Health’s goal was to revolutionize medicine through AI. After years of pricey expansion — it spent more than $4 billion on acquisitions, per Axios — and reports of ineffectiveness, the unit scaled back its ambitions.

Once viewed as a flagship of AI in medcine and life science, IBM Watson Health couldn't live up to its ambitious promises to transform everything from drug discovery to cancer care. It would be interesting to see how the new firm who bought this giant from IBM will transform its data and analytic assets and realize their full potential.

BIG MPI

2023-02-23T19:54:00.001+05:30

In order to describe a structured region of memory, the routines in the MPI standard use a (count, datatype) pair. The C specification for this convention uses an int type for the count. Since C int types are nearly always 32 bits large and signed, counting more than 2 power 31 elements poses a challenge. Instead of changing the existing MPI routines, and all consumers of those routines, the MPI Forum asserts that users can build up large datatypes from smaller types. To evaluate this hypothesis and to provide a user-friendly solution to the large-count issue, we have developed BigMPI, a library on top of MPI that maps large count MPI-like functions to MPI-3 standard features. BigMPI demonstrates a way to perform such a construction, reveals shortcomings of the MPI standard, and uncovers bugs in MPI implementations

References:

https://www.mcs.anl.gov/papers/P5210-1014.pdf

https://github.com/jeffhammond/BigMPI

MPI [ Message Passing Interfaces] - behind the scenes

2023-02-23T19:50:00.003+05:30

Parallel computing is accomplished by splitting up large and complex tasks across multiple processors. In order to organize and orchestrate parallel processing, our program must consider automatically decomposing the problem at hand and allowing the processors to communicate with each other when necessary while performing their work. This introduces a new overhead, the synchronization and the communication itself.

Computing parallelism can be roughly classified as Distributed Memory (DM) or Shared Memory(SM) class. In Distributed Memory (DM), each processor has its own memory which are connected through a network that can exchange data, thus, limiting the DM performance and scalability. In Shared

Memory (SM), each processor can access all of the memory, resulting in automatic distribution of proce durally iterations over several processors - autoparallelization, explicit distribution of work over the pro cessors by compiler directives or function calls to threading libraries. If this overhead is not accounted. It can create several issues like bottlenecks in the parallel computer design and load imbalances.

MPI is an API for message passing between entities with separated memory spaces - processes. The standard doesn't care where those processes run - it could be on networked computers (clusters), it could be on a big shared memory machine or it could be any other architecture that provide the same semantics (e.g. IBM Blue Gene)

OpenMPI is a widely used message passing interface (MPI) library for parallel computing. It provides an abstraction layer that allows application developers to write parallel code without worrying about the underlying hardware details. However, OpenMPI also provides support for hardware-specific optimizations, including those for network adapters. For example, it supports the use of high-speed interconnects such as InfiniBand and RoCE, and it can take advantage of hardware offload capabilities such as Remote Direct Memory Access (RDMA).

UCX (Unified Communication X) is another middleware library for communication in distributed systems. It is designed to be highly scalable and to support a wide range of hardware platforms, including network adapters. UCX provides a portable API that allows applications to take advantage of hardware-specific features of network adapters, such as RDMA and network offloading. UCX can also integrate with other system-level libraries such as OpenMPI and hwloc to optimize performance on specific hardware configurations.

Hwloc (Hardware Locality) is a library for topology discovery and affinity management in parallel computing. It provides a portable API for discovering the hierarchical structure of the underlying hardware, including network adapters, and it allows applications to optimize performance by binding threads and processes to specific hardware resources. Hwloc can be used in conjunction with OpenMPI and UCX to optimize communication and data movement on high-performance computing systems.

TCP/IP is a family of networking protocols. IP is the lower-level protocol that's responsible for getting packets of data from place to place across the Internet. TCP sits on top of IP and adds virtual circuit/connection semantics. With IP alone you can only send and receive independent packets of data that are not organized into a stream or connection. It's possible to use virtually any physical transport mechanism to move IP packets around. For local networks it's usually Ethernet, but you can use anything. There's even an RFC specifying a way to send IP packets by carrier pigeon.

Sockets is a semi-standard API for accessing the networking features of the operating system. Your program can call various functions named socket, bind, listen, connect, etc., to send/receive data, connect to other computers, and listen for connections from other computers. You can theoretically use any family of networking protocols through the sockets API--the protocol family is a parameter that you pass in--but these days you pretty much always specify TCP/IP. (The other option that's in common use is local Unix sockets.)

When you are interested to write a parallel programming application, you should probably not be looking at TCP/IP or sockets as those things are going to be much lower level than you want. You'll probably want to look at something like MPI or any of the PGAS languages like UPC, Co-array Fortran, Global Arrays, Chapel, etc. They're going to be far easier to use than essentially writing your own networking layer.

When you use one of these higher level libraries, you get lots of nice abstractions like collective operations, remote memory access, and other features that make it easier to just write your parallel code instead of dealing with all of the OS stuff underneath. It also makes your code portable between different machines/architectures.

MPI is free to use any available communication path(s) for MPI messages in the new communicator; the socket is only used for the initial handshaking.

A common problem is the one of two processes each opening connections to each other. The socket code assume that the sockets are bidirectional, thus only one socket is needed by each pair of connected processes, not one socket for each member of the pair. Then it should refactor the states and state machine into a clear set of VC connection states and connection states.

There are three related objects used during a connection event. They are the connection itself (a structure specific to the communication method, sockets in the case of this note), the virtual connection, and the process group to which the virtual connection belongs

If a socket connection between two processes is established, there are always two sides: The connecting side and the accepting side. The connecting side sends an active message to the accepting side. This sides first accepts the connection. However, if both processes try to connect to each other (head-to-head situation), the processes n have both, a connecting and an accepting connection. In this situation, one of the connections is refused/discarded - while the other connection is established. This is decided on the accepting side.

State machines for establishing a connection:

Connect side :

The connecting side tries to establish a connection by sending an active message to the remote side. If the connection is accepted, the pg_id is send to the remote side. Now, the process waits, until the connection is finally accepted or refused. For this decision, the remote side requires the gp_id . Based on the answer from the remote side (ack = yes or ack = no) the connection is connected or closed.

Accept side:

The accept side receives a connection request on the listening socket. In the first instance, it accepts the connection an allocates the required structures (conn, socket). Then, the connection waits for the pg_id of the remote side to assign the socket-connection to a VC. The decision, if a connection is accepted or refused, is based on the following steps:

The VC has to aktive connection (vc->conn == NULL) : The new connection is accepted
The VC has an aktive connection
If my_pg_id < remote_pg_id: accept and discard other connection
If my_pg_id > remote_pg_id: refuse

The answer is send to the remote note.

Data Types Required by the MPI Standard:

MPI point-to-point communication sends messages between two different MPI processes. One process performs a send operation while the other performs a matching read

MPI collectives: MPI provides a set of routines for communication patterns that involve all the processes of a certain communicator, so-called collectives. The two main advantages of using collectives are:

1) Less programming effort.

2) Performance optimization, as the implementations are usually efficient, especially if optimized for specific architectures

For collective communication, significant gains in performance can be achieved by implementing topology- and performance-aware collectives.

Three common blocking collectives are Barrier(), Bcast() and Reduce().

Allreduce(). Combination of reduction and a broadcast so that the output is available for all processes.
Scatter(). Split a block of data available in a root process and send different fragments to each process.
Gather(). Send data from different processes and aggregate it in a root process.
Allgather(). Similar to Gather() but the output is aggregated in buffers of all the processes.
Alltoall(). All processes scatter data to all processes.

Reference:

https://www.sciencedirect.com/topics/computer-science/point-to-point-communication

https://wiki.mpich.org/mpich/index.php/Establishing_Socket_Connections

https://aist-itri.github.io/gridmpi/publications/cluster04-slide.pdf

High performance computing

2023-02-23T19:36:00.008+05:30

High-Performance Computing (HPC or supercomputer) is omnipresent in today’s society. For example, every time you watch Netflix, the recommendation algorithm leverages HPC resources remotely to offer you personalized suggestions. HPC stands for High-Performance Computing. The ability to carry out large scale computations to solve complex problems, that either need to process a lot of data, or to have a lot of computing power at their disposal. Basically, any computing system that doesn’t fit on a desk can be described as HPC.

HPC systems are actually networks of processors. The key principle of HPC lies in the possibility to run massively parallel code to benefit from a large acceleration in runtime. A common HPC capability is around 100,000 cores. Most HPC applications are complex tasks which require the processors to exchange their results. Therefore, HPC systems need very fast memories and a low-latency, high-bandwidth communication systems (>100Gb/s) between the processors as well as between the processors and the associated memories.

We can differentiate two types of HPC systems: the homogeneous machines and the hybrid ones. Homogeneous machines only have CPUs while the hybrids have both GPUs and CPUs. Tasks are mostly run on GPUs while CPUs oversee the computation.

They have more computing power since GPUs can handle millions of threads simultaneously and are also more energy efficient. GPUs have faster memories, require less data transfer and are capable to exchange with other GPUs, which is the most energy-intensive part of the machine.

source

High Performance Computing used to be strictly defined with high speed network to allow strong interconnections between cores. The rise of AI applications led to an architecture based on more independent clusters but still massively parallel.

HPC systems also include the software stack. That can be divided into three categories. First the user environment encompasses the applications known as workflows. Then the middleware linking applications and their implementation on the hardware. It includes the runtimes and frameworks. Last, the Operating system, at system level with the job scheduler, management software for load balancing and data availability. Its role is to assign tasks to the processors and organize the exchange of data between the processors and the memories to ensure the best performance.

HPC applications

HPC provides many benefits and value when used for commercial and industrial applications. Applications that can be classified in five categories:

- Fundamental research aims to improve scientific theories to better understand natural or other phenomena. HPC enables more advanced simulations leading to breakthrough discoveries.

- Design simulation allows industries to digitally improve the design of their products and test their properties. It enables companies to limit prototyping and testing, making the designing process quicker and less expensive.

- Behavior prediction enables companies to predict the behavior of a quantity which they can’t impact but depend on, such as the weather or the stock market trends. HPC simulations are more accurate and can look farther into the future thanks to their superior computing abilities. It is especially important for predictive maintenance and weather forecasts.

- Optimization is a major HPC use case. It can be found in most professional fields, from portfolio optimization to process optimization, to most manufacturing challenges faced by the industry.

HPC is more and more used for data analysis. Business models, industrial processes and companies are being built on the ability to connect, analyze and leverage data, making supercomputers a necessity in analyzing massive amounts of data.

The 5 fields of HPC Applications.

Another major application for HPC is in the fields of medical and material advancements. For instance, HPC can be deployed to:

Combat cancer: Machine learning algorithms will help supply medical researchers with a comprehensive view of the U.S. cancer population at a granular level of detail.

Identify next-generation materials: Deep learning could help scientists identify materials for better batteries, more resilient building materials and more efficient semiconductors.

Understand patterns of disease: Using a mix of artificial intelligence (AI) techniques, researchers will identify patterns in the function, cooperation, and evolution of human proteins and cellular systems.

HPC needs are skyrocketing. A lot of sectors are beginning to understand the economic advantage that HPC represents and therefore are developing HPC applications.

Industrial companies in the field of aerospace, automotive, energy or defence are working on developing digital twins of a machine or a prototype to test certain properties. This requires a lot of data and computing power in order to accurately represent the behavior of the real machine. This will, moving forward, render prototypes and physical testing less and less standard.

The HPC dynamics and industrial landscape

source

The limits of a model :

Unfortunately, supercomputers are revealing some limits. First of all, some problems are not currently solvable by a supercomputer. The race to the exascale (a supercomputer able to realize 10^18 floating point operations per second) is not necessarily going to solve this issue. Some problems or simulations might remain unsolvable, or at least, unsolvable in an acceptable length of time. For example, in the case of digital twins or molecular simulation, calculations have to be greatly simplified in order for current computers to be able to make them in an acceptable length of time (for product or drug design).

Moreover, a second very important challenge is the power consumption. The consumption of computing and data centers represents 1% of power consumption in the world and this is bound to significantly increase. It shows that this model is unsustainable in the long term, especially since exascale supercomputers will most surely consume more than current ones. Not only is it technically unsustainable, it is also financially so. Indeed, a supercomputer can cost as much as 10mUSD per year in electricity consumption.

The new chips revolution

CPUs and GPUs are not the only solutions to tackle the two previously stated issues.

Although most efforts are focused on developing higher-performance CPU and GPU-powered supercomputers in order to reach the exascale, new technologies, in particular “beyond Silicon”, are emerging. Innovative chip technologies could act as accelerators like GPUs did in the 2010s and significantly increase the computing power. Moreover, some technologies, such as quantum processors for example, would be able to solve new categories of problems that are currently beyond our reach.

In addition, 70% of the energy consumption in a HPC is accounted for by the processors. Creating new chips, more powerful and more energy efficient would enable us to solve both problems at once. GPUs were the first step towards this goal. Indeed, for some applications, GPUs can replace up to 200 or 300 CPUs. Although one GPU individually consumes a bit more than a CPU (400W against 300W approximately), overall, a hybrid supercomputer will consume less than a homogeneous supercomputer of equal performance.

The model needs to be reinvented to include disruptive technologies. Homogeneous supercomputers should disappear, and it is already underway. In 2016, only 10 out of the supercomputers in the Top500 were hybrid. By 2020, within only four years, it rose to 333 out of 500, including 6 in the top 10.

At Quantonation, are convinced that innovative chips integrated in hetereogeneous supercomputing architectures, as well as optimized softwares and workflows, will be key enablers to face societal challenges by significantly increasing sustainability and computing power. We trust that these teams are ready to face the challenge and be part of the future of compute:

Pasqal’s neutral atoms quantum computer, highly scalable and energy efficient;
Lighton’s Optical Processing Unit, a special purpose ligh-based AI chip fitted for tasks such as Natural Language Processing;
ORCA Computing’s fiber based photonic systems for simulation and fault tolerant quantum computing;
Quandela’s photonic qubit sources that will fuel next generation of photonic devices;
QuBit Pharmaceuticals software suites leveraging HPC and quantum computing resources to accelerate drug discovery ;
Multiverse Computing’s solutions using disruptive mathematics to resolve finance’s most complex problems on a range of classical and quantum technologies.

IBM Cloud HPC IaaS for building HPC environments using IBM’s Virtual Private Cloud (VPC). It enables you to create your own configuration for Compute Instances; High-performance Storage and Networking like Public Gateways, Load Balancers and Routers. Multiple connectivity options are available upto 80Gbps and IBM Cloud offers the highest level of security and encryption with FIPS 140-2 Level 4. Also available is IBM Code Engine, a fully managed serverless platform to run containers, applications or batch jobs.

– Spectrum Computing provides intelligent dynamic hybrid cloud capabilities which enables organizations to use cloud resources according to defned policies. Spectrum LSF and Symphony allows you to burst workloads to the cloud, dynamically provision cloud resources and intelligently move data to manage egress costs. It also enables the ability for auto scaling to take full advantage of consumption-based pricing and pay for cloud resources only when they are needed.

– Spectrum Scale is an enterprise grade High Performance File System (HPFS) that delivers scalable capacity and performance to handle demanding data analytics, content repositories and HPC workloads. Spectrum Scale architecture allows it to handle tens of thousands of clients, billions of fles and petabytes of data written and retrieved as fles or objects with low latency. Optionally, IBM Aspera can be used for high speed data movement using the FASP protocol.

Use Cases

– Financial Services: Monte Carlo simulation, risk modeling, actuarial sciences

– Health and Life Sciences: Genome analysis, drug discovery, bio-sequencing, clinical treatments, molecular modeling

– Automotive: Vehicle drag coeffcient analysis, crash simulation, engine combustion analysis, air flow modeling

– Aerospace: Structural, fluid dynamics, thermal, electromagnetic and turbine flow analysis

– Electronic Design Automation (EDA): Integrated Circuit (IC) and Printed Circuit Board (PCB) design and analysis

– Oil and Gas: Subsurface terrain modeling, reservoir simulation, seismic analysis

– Transportation: Routing logistics, supply chain optimization

– Energy & Utility: Severe storm prediction, climate, weather and wind modelling

– Education/Research: High energy physics, computational chemistry

The reign and modern challenges of the Message Passing Interface (MPI)

All good, but why do you guys doing numerical linear algebra and parallel computing always use the Message Passing Interface to communicate between the processors?”

MPI begun about 25 years ago and has been since then, undoubtedly, the “King” of HPC. What were the characteristics of MPI that made it the de-facto language of HPC?

MPI-3 has added several interfaces to enable more powerful communication scheduling, for example nonblocking collective operations and neighborhood collective operations.

Much of the big data community moved from single-nodes to parallel and distributed computing to process larger amounts of data using relatively short-lived programs and scripts. Thus, programmer productivity only played a minor role in MPI/HPC while it was one of the major requirements for big data analytics. While MPI codes are often orders-of-magnitude faster than many big data codes, they also take much longer to develop. And that is most often a good trade-off.

MPI I/O has been introduced nearly two decades ago to improve the handling of large datasets in parallel settings. It is successfully used in many large applications and I/O libraries such as HDF-5.

MPI predates the time when the use of accelerators became commonplace. However, when these accelerators are used in distributed-memory settings such as computer clusters, then MPI is the common way to program them. The current model, often called MPI+X (e.g., MPI+CUDA), combines traditional MPI with accelerator programming models (e.g., CUDA, OpenACC, OpenMP etc.) in a simple way. In this model, MPI communication is performed by the CPU. Yet, this can be inefficient and inconvenient and recently, we have proposed a programming model called distributed CUDA (dCUDA) to perform communication from within a CUDA compute kernel [3]. This allows to use the powerful GPU warp scheduler for communication latency hiding. In general, integrating accelerators and communication functions is an interesting research topic.

Programming at the transport layer, where every exchange of data has to be implemented with lovingly hand-crafted sends and receives or gets and puts, is an incredibly awkward fit for numerical application developers, who want to think in terms of distributed arrays, data frames, trees, or hash tables.

Everyone uses MPI” has made it nearly impossible for even made-in-HPC-land tools like Chapel or UPC to make any headway, much less quite different systems like Spark or Flink, meaning that HPC users are largely stuck with using an API which was a big improvement over anything else available 25 years ago,

Chapel is a modern programming language designed for productive parallel computing at scale. Chapel's design and implementation have been undertaken with portability in mind, permitting Chapel to run on multicore desktops and laptops, commodity clusters, and the cloud, in addition to the high-end supercomputers for which it was originally undertaken.

Reference:

https://medium.com/quantonation/a-beginners-guide-to-high-performance-computing-ae70246a7af

https://github.com/ljdursi/mpi-tutorial/blob/master/presentation/presentation.md

https://github.com/chapel-lang/chapel

https://blog.xrds.acm.org/2017/02/message-passing-interface-mpi-reign-modern-challenges/

================