Newbie's Guide for Spectrum LSF, Message Passing Interface(MPI), Kubernetes, Big Data applications ,Docker, Jenkins, Spark, Hadoop, Quantum Computing, Linux Operating system and features, git, DevOps...........!
Friday, September 8, 2023
Multipath setup in Linux
Monday, September 4, 2023
Openstack Framework and components
OpenStack is an open-source cloud computing platform that provides a set of software tools and components for building and managing public and private clouds. It enables organizations to create and manage cloud infrastructure services, including compute, storage, networking, and more. OpenStack is designed to be highly flexible, scalable, and customizable, making it a popular choice for building cloud solutions.
OpenStack is an open-source cloud computing platform that was initially launched in July 2010 as a joint project by Rackspace Hosting and NASA. Since then, it has grown into a vibrant open-source community with contributions from a wide range of organizations and individuals. Here's a brief history of OpenStack and an overview of its main components:
OpenStack History:
Launch (2010): OpenStack was publicly launched in July 2010 with the release of the first two core projects, Nova (compute) and Swift (object storage). It was created to address the need for an open and flexible cloud computing platform.
Expanding Community (2011-2012): The OpenStack community quickly expanded, with numerous companies joining the project. The community released new versions of OpenStack, including Diablo, Essex, and Folsom, each with additional core and supporting projects.
Foundation Establishment (2012): In September 2012, the OpenStack Foundation was established to oversee the project's development and ensure its long-term governance as an open-source project.
Maturing Ecosystem (2013-2015): OpenStack continued to evolve, with new releases like Grizzly, Havana, Icehouse, and Juno. During this period, more projects were added to the ecosystem, covering areas such as networking (Neutron), block storage (Cinder), and identity (Keystone).
Enterprise Adoption (2016-2017): OpenStack gained significant traction among enterprises and service providers. Projects like Heat (orchestration) and Magnum (containers) were introduced to support cloud automation and container orchestration.
Continued Growth (2018-Present): OpenStack has continued to grow and evolve, with new projects and features being added regularly. The community releases new versions of OpenStack every six months, with each version introducing enhancements and improvements.
Openstack Releases: Currently running Openstack is release is "xena". Austin was the 1st Openstack release and it obsolete now. For more details check the links below:
https://docs.openstack.org/puppet-openstack-guide/latest/install/releases.html
https://releases.openstack.org/
Austin (2010): The first official release of OpenStack, code-named "Austin."
Bexar (2011): The second release, code-named "Bexar."
Cactus (2011): The third release, code-named "Cactus."
Diablo (2011): The fourth release, code-named "Diablo."
Essex (2012): The fifth release, code-named "Essex."
Folsom (2012): The sixth release, code-named "Folsom."
Grizzly (2013): The seventh release, code-named "Grizzly."
Havana (2013): The eighth release, code-named "Havana."
Icehouse (2014): The ninth release, code-named "Icehouse."
Juno (2014): The tenth release, code-named "Juno."
Kilo (2015): The eleventh release, code-named "Kilo."
Liberty (2015): The twelfth release, code-named "Liberty."
Mitaka (2016): The thirteenth release, code-named "Mitaka."
Newton (2016): The fourteenth release, code-named "Newton."
Ocata (2017): The fifteenth release, code-named "Ocata."
Pike (2017): The sixteenth release, code-named "Pike."
Queens (2018): The seventeenth release, code-named "Queens."
Rocky (2018): The eighteenth release, code-named "Rocky."
Stein (2019): The nineteenth release, code-named "Stein."
Train (2019): The twentieth release, code-named "Train."
Ussuri (2020): The twenty-first release, code-named "Ussuri."
Victoria (2020): The twenty-second release, code-named "Victoria."
Wallaby (2021): The twenty-third release, code-named "Wallaby."
Xena (2021): The twenty-fourth release, code-named "Xena."
Yoga (2022): The twenty-fifth release, code-named "Yoga."
Zuul (2022): The twenty-sixth release, code-named "Zuul."
OpenStack's modular architecture allows organizations to choose the components that best fit their cloud computing needs, making it a versatile and customizable platform for building private, public, and hybrid clouds. OpenStack is built using a modular architecture, where each component provides a specific cloud service. These components can be combined to create a custom cloud infrastructure tailored to the organization's needs. OpenStack is composed of multiple projects, each providing a specific cloud service.
- Multi-Tenancy: OpenStack supports multi-tenancy, allowing organizations to create isolated environments within the cloud infrastructure. This means that multiple users or projects can share the same cloud while maintaining security and resource separation.
- Open Source: OpenStack is released under an open-source license, making it freely available for anyone to use, modify, and contribute to. This open nature has led to a vibrant community of developers and users collaborating on its development.
- Integration and Compatibility: OpenStack is designed to integrate with various virtualization technologies, hardware vendors, and third-party tools. It can be used with different hypervisors, storage systems, and networking solutions.
- Private and Public Clouds: Organizations can use OpenStack to create private clouds within their data centers or deploy public cloud services to offer cloud resources to external customers or users.
- Hybrid Clouds: OpenStack can be part of a hybrid cloud strategy, where organizations combine private and public cloud resources to achieve flexibility and scalability
Here are some of the main components:
source |
- Nova (Compute): Manages and orchestrates virtual machines (instances) on hypervisors. It provides features for creating, scheduling, and managing VMs.
- Swift (Object Storage): Offers scalable and durable object storage services for storing and retrieving data, including large files and unstructured data.
- Cinder (Block Storage): Manages block storage volumes that can be attached to instances. It provides persistent storage for VMs.
- Neutron (Networking): Handles networking services, including the creation and management of networks, subnets, routers, and security groups.
- Keystone (Identity): Manages identity and authentication services, including user management, role-based access control (RBAC), and token authentication.
- Glance (Image Service): Stores and manages virtual machine images (VM snapshots) that can be used to create instances.
- Horizon (Dashboard): A web-based user interface that provides a graphical way to manage and monitor OpenStack resources.
- Heat (Orchestration): Provides orchestration and automation services for defining and managing cloud application stacks.
- Ceilometer (Telemetry): Collects telemetry data, including usage and performance statistics, for billing, monitoring, and auditing.
- Trove (Database-as-a-Service): Manages database instances as a service, making it easier to provision and manage databases.
- Ironic (Bare Metal): Manages bare-metal servers as a service, allowing users to provision physical machines in the same way as virtual machines.
- Zaqar (Messaging and Queuing): Provides messaging and queuing services for distributed applications.
- Magnum (Container Orchestration): Orchestrates container platforms like Kubernetes to manage containerized applications.
Postman provides a user-friendly interface for building and sending API requests, inspecting responses, and automating API testing. Internally, Postman is a comprehensive software tool that facilitates the process of sending HTTP requests to APIs, receiving responses, and performing various tasks related to API testing, monitoring, and development. It operates through a combination of user interactions and underlying components. Postman simplifies the process of sending HTTP requests to APIs by providing a user-friendly interface, generating HTTP requests based on user input, and enabling users to work with API responses. It also supports more advanced features such as scripting, automation, and test execution for comprehensive API testing and monitoring. It's widely used by developers to
- Test APIs: Developers can use Postman to send requests to APIs and receive responses, making it easy to test how the API functions.
- Automate Tests: Postman allows you to create and automate test scripts to ensure that your APIs are working as expected. You can set up tests to validate the response data, headers, and more.
- Document APIs: You can use Postman to generate API documentation, which is useful for sharing information about how to use an API with others.
- Monitor APIs: Postman can be used to monitor APIs and receive alerts when issues or errors occur.
- Mock Servers: Postman provides the ability to create mock servers, which can simulate an API's behavior without the actual backend being implemented yet.
Here's how Postman is involved and invoked internally when working with the examples provided:
1) User Interface (UI): Postman provides a user-friendly graphical interface where users can create, manage, and send API requests. Users interact with this UI to input API details, such as request URLs, headers, parameters, and request bodies.
2) Request Configuration: When you create a request in Postman, you configure various aspects of the request, including the request method (e.g., GET, POST, PUT, DELETE), request URL, headers, query parameters, request body (if applicable), and authentication settings.
3) HTTP Request Generation: Postman internally generates the corresponding HTTP request based on the user's configuration. For example, if you configure a GET request to retrieve user data, Postman generates an HTTP GET request to the specified URL with the provided headers and parameters.
4) Request Sending: When you click the "Send" button within Postman, it sends the generated HTTP request to the target API endpoint using the configured settings (e.g., URL, headers, body). This request is sent via the HTTP protocol to the specified API server.
5) API Server Interaction: The HTTP request sent by Postman is received by the API server. The server processes the request based on the HTTP method, URL, and other request details. For example, in a RESTful API, a GET request may retrieve data, while a POST request may create new data.
6) Response Reception: After the API server processes the request, it sends an HTTP response back to Postman. This response includes data (e.g., JSON or XML) and metadata (e.g., status code, headers) generated by the server.
7) Response Handling: Postman receives the HTTP response and presents it to the user within its UI. The user can inspect the response content, status code, headers, and other details. Postman also provides tools for handling response data, such as extracting values or running tests.
8)Test Execution: Users can define tests and assertions within Postman using scripts (e.g., JavaScript). When a test script is defined, Postman internally executes the script and checks the results against the specified assertions.
9) Results Reporting: Postman provides feedback to the user about the outcome of the API request and any tests that were run. Users can view whether the request was successful, the response met the expected criteria, and any potential errors or issues.
10)Automation: Postman can be integrated into automated testing pipelines, continuous integration (CI) workflows, and monitoring systems. It can be invoked programmatically to run collections of requests, automate tests, and monitor APIs at specified intervals.
Examples: make sure you have access to a RESTful API that you want to test. Replace the URL, endpoints, and parameters with the appropriate values for your specific API.
1) GET Request to Retrieve Data . To retrieve data from an API using a GET request:
- GET https://api.example.com/users
2) GET Request with Query Parameters.To retrieve data with query parameters:
- GET https://api.example.com/users?id=123&name=John
3) POST Request to Create Data.To create data using a POST request with a JSON body:
- POST https://api.example.com/users
Content-Type: application/json
Body (JSON):
{
"name": "Alice",
"email": "alice@example.com"
}
4) PUT Request to Update Data.To update data using a PUT request with a JSON body:
- PUT https://api.example.com/users/123
Headers:
Content-Type: application/json
Body (JSON):
{
"name": "Updated Name",
"email": "updated@example.com"
}
5) DELETE Request to Remove Data. To delete data using a DELETE request:
- DELETE https://api.example.com/users/123
6) Headers and Authentication. You can add headers, such as authorization headers, to your requests. For example, to send an API key in the headers
- GET https://api.example.com/resource
Authorization: Bearer YOUR_API_KEY
7) Handling Response Data:After sending a request, you can inspect the response data. For example, to extract a specific value from the response, you can use JavaScript-like syntax in Postman's Tests tab:
// Extract the value of the "name" field from the JSON response
var jsonData = pm.response.json();
pm.environment.set("username", jsonData.name);
These are just some basic examples of how to use Postman to interact with RESTful APIs. You can create collections of requests, use variables, and write more complex tests to thoroughly test and validate your APIs.
Python code example that demonstrates how to make an HTTP GET request to a RESTful API using the popular requests library. In this example, we'll use the JSONPlaceholder API, which provides dummy data for testing and learning purposes:
import requests
# Define the API endpoint URL
api_url = "https://jsonplaceholder.typicode.com/posts/1"
try:
# Send an HTTP GET request to the API endpoint
response = requests.get(api_url)
# Check if the request was successful (status code 200)
if response.status_code == 200:
# Parse the JSON response
data = response.json()
# Print the response data
print("Title:", data["title"])
print("Body:", data["body"])
else:
print("HTTP Request Failed with Status Code:", response.status_code)
except requests.exceptions.RequestException as e:
# Handle any exceptions that may occur during the request
print("An error occurred:", e)
NOTE: We define the API endpoint URL (api_url) that we want to retrieve data from. In this example, we're fetching data for a specific post using its ID.
and use a try block to send an HTTP GET request to the API endpoint using requests.get(api_url).
We check the HTTP response status code. If it's 200, the request was successful, and we proceed to parse the JSON response.If the request was successful, we parse the JSON response using response.json() and print specific fields from the response (in this case, the post's title and body). If the request fails or encounters an exception, we handle it and print an error message.
OpenStack provides a set of RESTful APIs for managing cloud infrastructure resources. These APIs are used to create, manage, and interact with virtualized resources such as instances (virtual machines), volumes, networks, and more. Here are some common API endpoint examples with respect to OpenStack:
1) Identity (Keystone) API:
Authentication and token management.
Example: http://<OpenStack-IP>:5000/v3/
Compute (Nova) API:
2) Management of virtual machines (instances).
Example: http://<OpenStack-IP>:8774/v2.1/
Block Storage (Cinder) API:
3) Management of block storage volumes.
Example: http://<OpenStack-IP>:8776/v2/
Object Storage (Swift) API:
4) Storage and retrieval of objects (files and data).
Example: http://<OpenStack-IP>:8080/v1/
Image (Glance) API:
5) Management of virtual machine images (VM snapshots).
Example: http://<OpenStack-IP>:9292/v2/
Network (Neutron) API:
6) Management of network resources, including routers, subnets, and security groups.
Example: http://<OpenStack-IP>:9696/v2.0/
Orchestration (Heat) API:
7) Orchestration of cloud resources through templates.
Example: http://<OpenStack-IP>:8004/v1/
Telemetry (Ceilometer) API:
8) Collection of usage and performance data.
Example: http://<OpenStack-IP>:8777/v2/
Dashboard (Horizon) API:
9) Web-based user interface for OpenStack services.
Example: http://<OpenStack-IP>/dashboard/
Placement (Placement) API:
10) Management of resource placement and allocation.
Example: http://<OpenStack-IP>:8778/
These are just some examples of the core OpenStack APIs and their respective endpoint URLs.
--------
To check if a user exists in your OpenStack environment, you can use the Identity (Keystone) API, which manages authentication and user-related operations. Specifically, you can make a request to the Keystone API to list users and then check if the desired user is in the list. Here are the general steps to do this:
Step 1 :Authenticate with Keystone:
Before making any requests to the Keystone API, you need to authenticate. Typically, this involves sending a POST request with your credentials to the Keystone authentication endpoint. You'll receive a token in response, which you can use to make subsequent API requests.
List Users:
Step 2 : Make a GET request to the Keystone API's user listing endpoint to retrieve a list of all users in the OpenStack environment.
Example API endpoint for listing users: http://<OpenStack-IP>:5000/v3/users
Include the authentication token in the request headers.
Check User Existence:
Step 3 : After receiving the list of users, you can iterate through the user data and check if the desired user exists by comparing usernames, IDs, or other unique identifiers.
Here's a Python example using the requests library to check if a user exists in Keystone:
import requests
# Keystone authentication endpoint
auth_url = "http://<OpenStack-IP>:5000/v3/auth/tokens"
# Keystone user listing endpoint
users_url = "http://<OpenStack-IP>:5000/v3/users"
# Replace with your OpenStack credentials
auth_data = {
"auth": {
"identity": {
"methods": ["password"],
"password": {
"user": {
"name": "your_username",
"domain": {"name": "your_domain"},
"password": "your_password"
}
}
}
}
}
# Authenticate and get a token
response = requests.post(auth_url, json=auth_data)
if response.status_code == 201:
token = response.headers["X-Subject-Token"]
# List all users
headers = {"X-Auth-Token": token}
response = requests.get(users_url, headers=headers)
if response.status_code == 200:
users = response.json()["users"]
# Check if the user exists
target_user = "desired_username"
user_exists = any(user["name"] == target_user for user in users)
if user_exists:
print(f"User {target_user} exists.")
else:
print(f"User {target_user} does not exist.")
else:
print("Failed to list users.")
else:
print("Authentication failed.")
This example demonstrates how to authenticate with Keystone, list users, and check if a specific user exists by comparing usernames. Replace placeholders with your OpenStack-specific values and adjust the code as needed for your environment
-----------------------
OpenStack service overview:
source |
Nova , Cinder, Swift and Neutron -these OpenStack services together provide a comprehensive cloud computing platform. Nova manages compute resources, Cinder offers block storage, Swift provides object storage, and Neutron handles networking, enabling organizations to build and manage private and public clouds tailored to their specific needs.
Nova (OpenStack Compute): Nova is the core compute service in OpenStack. It manages the creation, scheduling, and management of virtual machines (VMs) in a cloud environment. Nova is hypervisor-agnostic, supporting various virtualization technologies, and it provides features for live migration, scaling, and resource management.
Cinder (OpenStack Block Storage): Cinder is the block storage service in OpenStack. It offers block-level storage volumes that can be attached to VMs. Users can create, manage, and snapshot these volumes, making it suitable for data persistence in applications like databases.
Swift (OpenStack Object Storage): Swift is the object storage service in OpenStack. It is designed for the storage of large amounts of unstructured data, such as images, videos, and backups. Swift provides scalable, redundant, and highly available storage with easy-to-use APIs.
Neutron (OpenStack Networking): Neutron is the networking service in OpenStack. It enables users to create and manage networks, subnets, routers, and security groups for VMs. Neutron supports various network configurations, including flat networks, VLANs, and overlay networks, allowing for flexibility in network design.
--------
Key Differences between Cinder and swift : The object storage and block storage serve different purposes and have distinct access methods. Object storage is well-suited for handling unstructured data and large-scale content distribution, while block storage is preferred for applications requiring direct control over data blocks and high performance. Organizations often choose between these storage types based on their specific use cases and storage needs.
Access Level: Object storage uses a higher-level access method, where data is accessed and managed as whole objects using unique identifiers. Block storage provides lower-level access, treating data as raw blocks.
Use Cases: Object storage is ideal for storing large amounts of unstructured data and content distribution, while block storage is suited for applications requiring direct control over storage blocks.
Scalability: Object storage is known for its horizontal scalability and ease of expansion, whereas block storage scalability may require more planning and management.
Data Management: Object storage systems often manage data redundancy and durability internally, while block storage may rely on external solutions or the application to manage data redundancy.
Data Retrieval: Object storage is optimized for read-heavy workloads and large-scale data distribution, while block storage is designed for high performance and low-latency access.
------------
Ceph:
Ceph is an open-source, distributed storage system designed for both object and block storage. It is known for its flexibility, scalability, and ability to provide a unified storage platform. Ceph is often used in cloud computing environments, data centers, and high-performance computing (HPC) clusters.
Key components and features of Ceph include:
Object Storage (RADOS Gateway): Ceph provides object storage capabilities through its RADOS (Reliable Autonomic Distributed Object Store) Gateway. This allows users to store and retrieve objects using a RESTful API compatible with Amazon S3 and Swift.
Block Storage (RBD): Ceph's RADOS Block Device (RBD) allows users to create block storage volumes that can be attached to virtual machines or used as raw block storage. RBD is often integrated with virtualization platforms like KVM.
Scalability: Ceph scales seamlessly from a few nodes to thousands of nodes by distributing data across OSDs (Object Storage Daemons) and MONs (Monitor Daemons). This scalability makes it suitable for large-scale storage deployments.
Data Redundancy: Ceph replicates data across multiple OSDs to ensure redundancy and high availability. It uses a CRUSH algorithm to distribute data efficiently.
Self-Healing: Ceph can automatically detect and recover from hardware failures or data inconsistencies. It continuously monitors data integrity.
Unified Storage: Ceph provides a unified storage platform that combines object, block, and file storage, allowing users to access data in various ways, depending on their requirements.
Community and Ecosystem: Ceph has a vibrant open-source community and a wide ecosystem of tools and projects that integrate with it. This includes interfaces for OpenStack integration.
-------------------------
Neutron, the networking component of OpenStack, plays a crucial role in creating and managing networking resources within a cloud infrastructure.
source |
Here are some interesting factors and capabilities related to Neutron:
Network Abstraction: Neutron abstracts network resources, allowing users to create and manage virtual networks, subnets, routers, and security groups through APIs or the dashboard. This abstraction simplifies complex networking tasks and provides a consistent interface.
Multi-Tenancy: Neutron supports multi-tenancy, enabling the isolation of network resources between different projects or tenants. This ensures that one tenant's network activities do not impact another's.
Pluggable Architecture: Neutron follows a pluggable architecture, allowing users to integrate with various networking technologies and solutions. This includes support for different plugins and drivers, enabling compatibility with a wide range of network devices and services.
Software-Defined Networking (SDN): Neutron can be used in conjunction with SDN controllers and solutions to provide advanced network automation, programmability, and flexibility. SDN allows for the dynamic configuration of network services and policies.
Networking Interfaces: Neutron allows the creation of various types of networking interfaces for virtual machines, including:
Port: Neutron manages ports, which represent virtual interfaces connected to a network. VMs attach to ports to access the network.
Router: Routers connect different subnets and provide inter-subnet routing. Neutron manages router interfaces and routing rules.
Floating IPs: Floating IPs provide external network access to VMs. Neutron can assign floating IPs dynamically or statically.
Bonding and Teaming: Neutron can manage bonded network interfaces (NIC bonding) for redundancy and increased network bandwidth. This is especially useful for ensuring high availability and load balancing of VMs.
Security Groups: Neutron's security groups feature allows users to define firewall rules and policies to control incoming and outgoing traffic to VMs. It enhances network security within the cloud environment.
L3 and L2 Services: Neutron supports Layer 3 (routing) and Layer 2 (bridging) services. This flexibility enables complex network topologies and scenarios.
Interoperability: Neutron integrates with various network technologies, including VLANs, VXLANs, GRE tunnels, and more. It provides interoperability with physical network infrastructure and external networks.
Communication Between VMs: Neutron ensures that VMs can communicate with each other within the same network or across networks using routing. It manages the routing tables and connectivity.
Load Balancing as a Service (LBaaS): Neutron offers LBaaS, allowing users to create and manage load balancers to distribute traffic among multiple VMs or instances.
High Availability (HA): Neutron can be configured for high availability, ensuring network services remain operational even in the event of network node failures.
---------------------------------------------------------
Containerization in OpenStack involves deploying and managing containers within an OpenStack cloud environment. This allows users to run containerized applications and microservices alongside traditional virtual machines (VMs).
source |
Here's a step-by-step explanation of the design and components involved in containerization within OpenStack:
1. Container Orchestration Framework: OpenStack supports various container orchestration frameworks, with Kubernetes being one of the most popular choices. Kubernetes helps manage the deployment, scaling, and operation of application containers. It serves as the foundation for container orchestration in an OpenStack environment.
2. Container Runtime: Containers are run using a container runtime, such as Docker or containerd. This runtime manages the execution of containerized applications and provides isolation between containers. In an OpenStack-based containerization setup, a container runtime is installed on each compute node in the OpenStack cluster.
3. OpenStack Components:
- Nova (Compute Service): Nova is responsible for managing compute resources, including VMs and, in a containerized environment, bare metal servers. It can provision servers specifically for running containers alongside traditional VMs.
- Neutron (Networking Service): Neutron handles networking and connectivity for containers. It ensures that containers can communicate with each other, VMs, and external networks.
- Cinder (Block Storage Service): Cinder provides block-level storage for containers when persistent storage is required. Containers can use Cinder volumes for data storage.
4. Magnum (Container Orchestration Service): OpenStack Magnum is a dedicated service for managing container orchestration clusters, such as Kubernetes, within the OpenStack cloud. It simplifies the deployment and management of container orchestration platforms.
5. Heat (Orchestration Service): Heat is an orchestration service in OpenStack that enables the automated deployment and scaling of infrastructure resources, including containers. It allows users to define templates describing the desired container infrastructure and then deploys and manages the resources accordingly.
6. Glance (Image Service): Glance is responsible for storing and managing container images. Containers are typically built from base images, and Glance helps manage these images within the OpenStack environment.
7. Keystone (Identity Service): Keystone provides authentication and authorization services for containerized applications and services. It ensures that only authorized users and services can access containers and container orchestration platforms.
8. Container Networking and Storage Plugins: In an OpenStack-based containerization environment, specialized networking and storage plugins are often used to integrate container networking and storage with OpenStack services. These plugins enable efficient communication and data storage for containers.
9. User Interface: Users interact with the containerization platform through the OpenStack dashboard (Horizon) or through the command-line interface (CLI). They can deploy and manage containers, container orchestration clusters, and associated resources.
10. Monitoring and Logging: Containerized applications generate logs and require monitoring for performance and resource usage. OpenStack can be integrated with monitoring and logging solutions like Prometheus, Grafana, and ELK (Elasticsearch, Logstash, and Kibana) to provide insights into containerized workloads.
11. External Services Integration: Containers often need to interact with external services and APIs. OpenStack allows for integration with external services through the use of network configurations, load balancers, and other relevant components.
In summary, containerization in OpenStack involves a combination of OpenStack services, container orchestration frameworks like Kubernetes, container runtimes, and specialized plugins to provide a seamless environment for deploying and managing containerized applications alongside traditional VMs within an OpenStack cloud infrastructure. This setup offers flexibility, scalability, and isolation for running containerized workloads in a cloud environment.
Sunday, July 30, 2023
Watsonx AI and data platform with Foundation Models
Why can't we build and reuse AI models? More data, more problems? Learn how AI foundation models change the game for training AI/ML from IBM Research AI VP Sriram Raghavan and DarÃo Gil, SVP and Director of IBM Research as they demystifies the technology and shares a set of principles to guide your generative AI business strategy. Experience watsonx, IBM’s new data and AI platform for generative AI and learn about the breakthroughs that IBM Research is bringing to this platform and to the world of computing. and to explore foundation models, an emerging approach to machine learning and data representation. Even in the age of big data when AI/ML is more prevalent, training the next generation of AI tools like NLP requires enormous data, and using AI models to new or different domains may be tricky. A foundation model can consolidate data from several sources so that one model may then be used for various activities. But how will foundation models be used for things beyond natural language processing? Don't miss this episode to explore how foundation models are a paradigm shift in how AI gets done.
You can bring your own data and AI models to watsonx or choose from a library of tools and technologies. You can train or influence training (if you want), then you can tune, that way you can have transparency and control over governing data and AI models. You can prompt it too. Instead of only one model, you can have family of models. The foundation models trained with your own data will become more valuable asset. Watsonx is a new integrated data platform to become a value creator. It consists of 3 primary parts, first watsonx.data is massive curated data repository that is ready to be tapped to train and fine-tune models with data management system. Watsonx.ai is an enterprise studio to train, validate, tune and deploy traditional ML and foundation models that provide generative AI capabilities. Watson.governance is a powerful set of tools to ensure your AI is executing responsibly. They work together seemlessly throughout the entire lifecycle of foundation models. Watsonx built on top of RedHat Openshift. The lifecycle consists of
STEP 1: preparing our data [Acquire, filter and pre-process, version & tag]. Each data set after being filtered , processed , it receives a data card. Data card has name and version of pile, specifies its content and filters that have been applied to it. We can have multiple data piles . They co-exists in .data and access different versions of data maintained for different purpose is managed seamlessly.
STEP2 : using it to train the model, validate the model, Tune the model and deploying applications and solutions. So we moved from .data to .AI and start picking a model architecture from the five families that IBM provides. These are bedrocks of models and they range from encoder only, encoder-decoder, decoder only and other novel architectures.
What Are Foundation Models? . Foundation models are AI neural networks trained on massive unlabeled datasets to handle a wide variety of jobs from translating text to analyzing medical images. We're witnessing a transition in AI. Systems that execute specific tasks in a single domain are giving way to broad AI that learns more generally and works across domains and problems. Foundation models, trained on large, unlabeled datasets and fine-tuned for an array of applications, are driving this shift. The models are pre-trained to support a range of natural language processing (NLP) type tasks including question answering, content generation and summarization, text classification and extraction. Future releases will provide access to a greater variety of IBM-trained proprietary foundation models for efficient domain and task specialization.
Source |
Foundation models are trained with massive amounts of data that allow for generative AI capabilities with a broad set of raw data that can be applied to different tasks, such as natural language processing. Instead of one model built solely for one task, foundation models can be adapted across a wide variety of different scenarios, summarizing documents, generating stories, answering questions, writing code, solving math problems, synthesizing audio. A year after the group defined foundation models, other tech watchers coined a related term — generative AI. It’s an umbrella term for transformers, large language models, diffusion models and other neural networks capturing people’s imaginations because they can create text, images, music, software and more.
IBM has planned to offer a suite of foundation models, for example smaller encoder based models, but also encoder-decoder or just decoder based models.
source |
Watsonx is our enterprise-ready AI and data platform designed to multiply the impact of AI across your business. The platform comprises three powerful products: the watsonx.ai studio for new foundation models, generative AI and machine learning; the watsonx.data fit-for-purpose data store, built on an open lakehouse architecture; and the watsonx.governance toolkit, to accelerate AI workflows that are built with responsibility, transparency and explainability. It consists of Watsonx.data, Watsonx.ai and Watsonx.governance
Source |
Watsonx.ai Studio: is an AI studio that combines the capabilities of IBM Watson Studio with the latest generative AI capabilities that leverage the power of foundation models. It provides access to high-quality, pre-trained, and proprietary IBM foundation models built with a rigorous focus on data acquisition, provenance, and quality. watsonx.ai is user-friendly. It’s not just for data scientists & developers, but also for business users. It provides a simple, natural language interface for different tasks. Watsonx.ai Studio with the new playground including easy to use Prompt Tuning. With watsonx.xi, you can train, validate, tune and deploy AI models.
WatsonX.governance : IBM has described watsonX.governance as a tool for building responsible, transparent and explainable AI workflows. According to IBM, watsonx.governance will also enable customers to direct, manage and monitor AI activities, map with regulatory requirements, and address ethical issues. The more AI is embedded into daily workflows, the more you need proactive governance to drive responsible, ethical decisions across the business. Watsonx.governance allows you to direct, manage, and monitor your organization’s AI activities, and employs software automation to strengthen your ability to mitigate risk, manage regulatory requirements and address ethical concerns without the excessive costs of switching your data science platform—even for models developed using third-party tools.
Source |
Why we built an AI supercomputer in the cloud?. Introducing Vela, IBM’s first AI-optimized, cloud-native supercomputer.
IBM built Vela supercomputer designed specifically for training so-called “foundation” AI models such as GPT-3. According to IBM, this new supercomputer should become the basis for all its own research and development activities for these types of AI models.IBM’s Vela supercomputer uses x86-based standard hardware. In the Vela system, each node’s hardware consists of a pair of “regular” Intel Xeon Scalable processors. To this are added eight 80GB Nvidia A100 GPUs per node. Furthermore, each node within the supercomputer is connected to several 100 Gbps Ethernet network interfaces. Each Vela node also has 1.5TB of DRAM internal memory and four 3.2TB NVMe drives for storage.In addition, IBM has also built a new workload-scheduling system for the Vela, the MultiCluster App Dispatcher (MCAD) system. This should handle cloud-based job scheduling for training foundation AI models.
Multi-Cluster Application Dispatcher:
The multi-cluster-app-dispatcher is a Kubernetes controller providing mechanisms for applications to manage batch jobs in a single or mult-cluster environment. The multi-cluster-app-dispatcher (MCAD) controller is capable of (i) providing an abstraction for wrapping all resources of the job/application and treating them holistically, (ii) queuing job/application creation requests and applying different queuing policies, e.g., First In First Out, Priority, (iii) dispatching the job to one of multiple clusters, where a MCAD queuing agent runs, using configurable dispatch policies, and (iv) auto-scaling pod sets, balancing job demands and cluster availability.
What is prompt-tuning?
Prompt-tuning is an efficient, low-cost way of adapting an AI foundation model to new downstream tasks without retraining the model and updating its weights. Redeploying an AI model without retraining it can cut computing and energy use by at least 1,000 times, saving thousands of dollars. With prompt-tuning, you can rapidly spin up a powerful model for your particular needs. It also lets you move faster and experiment.
In prompt-tuning, the best cues, or front-end prompts, are fed to your AI model to give it task-specific context. The prompts can be extra words introduced by a human, or AI-generated numbers introduced into the model's embedding layer. Like crossword puzzle clues, both prompt types guide the model toward a desired decision or prediction. Prompt-tuning allows a company with limited data to tailor a massive model to a narrow task. It also eliminates the need to update the model’s billions (or trillions) of weights, or parameters. Prompt-tuning originated with large language models but has since expanded to other foundation models, like transformers that handle other sequential data types, including audio and video. Prompts may be snippets of text, streams of speech, or blocks of pixels in a still image or video. We don’t touch the model. It’s frozen.
For Example: How do AI art generators work?
AI art generators don’t know what an owl looks like in the wild. They don’t know what a sunset looks like in a physical sense. They can only understand details about features, patterns, and relationships within the datasets they’ve been trained on. Prompting for a “beautiful face” is not very helpful. It is more effective to prompt for specific features such as symmetry, big lips, and green eyes. Even if the bot doesn’t understand beauty, it can recognize the features you describe as beautiful and generate something relatively accurate. To get the best results from your AI art generator prompt, you’ll need to give clear and detailed instructions. An effective AI art prompt should include specific descriptions, shapes, colors, textures, patterns, and artistic styles. This allows the neural networks used by the generator to create the best possible visuals.
T5 (Text to test transfer transformer ) is an encoder decoder model pre trained on a multi-task mixture of unsupervised and supervised tasks. We have complete transformer. T5 provides simple way to train a single model on a wide variety of text tasks. FLAN is Fine-Tuning LANguage Model. FLAN already been fine tuned by google and you try your multiple tasks on already pre tuned Model by Google. If you fine tune, then you may destroy that fine tuned model by overwriting it. Flan-UL2 is an encoder decoder model based on the T5 architecture. It uses the same configuration as the UL2 model released earlier last year. It was fine tuned using the “Flan” prompt tuning and dataset collection. With its impressive 20 billion parameters, Flan-UL2 is a remarkable encoder-decoder model with exceptional performance. UL2 20B: An Open Source Unified Language Learner.In “Unifying Language Learning Paradigms”, we present a novel language pre-training paradigm called Unified Language Learner (UL2) that improves the performance of language models universally across datasets and setups. UL2 frames different objective functions for training language models as denoising tasks, where the model has to recover missing sub-sequences of a given input. During pre-training it uses a novel mixture-of-denoisers that samples from a varied set of such objectives, each with different configurations. We demonstrate that models trained using the UL2 framework perform well in a variety of language domains, including prompt-based few-shot learning and models fine-tuned for down-stream tasks. Additionally, we show that UL2 excels in generation, language understanding, retrieval, long-text understanding and question answering tasks.
Retrieval Augmented Generation (RAG):
Foundation models are usually trained offline, making the model agnostic to any data that is created after the model was trained. Additionally, foundation models are trained on very general domain corpora, making them less effective for domain-specific tasks. You can use Retrieval Augmented Generation (RAG) to retrieve data from outside a foundation model and augment your prompts by adding the relevant retrieved data in context. For more information about RAG model architectures, see Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.
With RAG, the external data used to augment your prompts can come from multiple data sources, such as a document repositories, databases, or APIs. The first step is to convert your documents and any user queries into a compatible format to perform relevancy search. To make the formats compatible, a document collection, or knowledge library, and user-submitted queries are converted to numerical representations using embedding language models. Embedding is the process by which text is given numerical representation in a vector space. RAG model architectures compare the embeddings of user queries within the vector of the knowledge library. The original user prompt is then appended with relevant context from similar documents within the knowledge library. This augmented prompt is then sent to the foundation model. You can update knowledge libraries and their relevant embeddings asynchronously.
Pre-requisites: data sets
1) Training data set ( contains question and answer )
2) Test data set
embedding generation -----> storing it in vector data base ---> giving a user question ----> convert it into embedding --->sending it to vector database ---> getting an answer ---> Finally creating a prompt ---->sending it to foundation model Flan-UL2 (encoder-decoder Model) ---> getting an answer
Wednesday, April 19, 2023
Kubernetes - decommissioning a node from the cluster
Kubernetes cluster is a group of nodes that are used to run containerized applications and services. The cluster consists of a control plane, which manages the overall state of the cluster, and worker nodes, which run the containerized applications.
The control plane is responsible for managing the configuration and deployment of applications on the cluster, as well as monitoring and scaling the cluster as needed. It includes components such as the Kubernetes API server, the etcd datastore, the kube-scheduler, and the kube-controller-manager.
The worker nodes are responsible for running the containerized applications and services. Each node typically runs a container runtime, such as Docker or containerd, as well as a kubelet process that communicates with the control plane to manage the containers running on the node.
In a Kubernetes cluster, applications are deployed as pods, which are the smallest deployable units in Kubernetes. Pods contain one or more containers, and each pod runs on a single node in the cluster. Kubernetes manages the deployment and scaling of the pods across the cluster, ensuring that the workload is evenly distributed and resources are utilized efficiently.
In Kubernetes, the native scheduler is a built-in component responsible for scheduling pods onto worker nodes in the cluster. When a new pod is created, the scheduler evaluates the resource requirements of the pod, along with any constraints or preferences specified in the pod's definition, and selects a node in the cluster where the pod can be scheduled. The native scheduler uses a combination of heuristics and policies to determine the best node for each pod. It considers factors such as the available resources on each node, the affinity and anti-affinity requirements of the pod, any node selectors or taints on the nodes, and the current state of the cluster. The native scheduler in Kubernetes is highly configurable and can be customized to meet the specific needs of different workloads. For example, you can configure the scheduler to prioritize certain nodes in the cluster over others, or to balance the workload evenly across all available nodes.
[sachinpb@remotehostn18 ~]$ kubectl get pods -n kube-system | grep kube-scheduler
kube-scheduler-remotehost18 1/1 Running 11 398d
kubectl cordon is a command in Kubernetes that is used to mark a node as unschedulable. This means that Kubernetes will no longer schedule any new pods on the node, but will continue to run any existing pods on the node.
The kubectl cordon command is useful when you need to take a node offline for maintenance or other reasons, but you want to ensure that the existing pods on the node continue to run until they can be safely moved to other nodes in the cluster. By marking the node as unschedulable, you can prevent Kubernetes from scheduling any new pods on the node, which helps to ensure that the overall health and stability of the cluster is maintained.
[sachinpb@remotenode18 ~]$ kubectl get nodesNAME STATUS ROLES AGE VERSION
remotenode01 Ready worker 270d v1.23.4
remotenode02 Ready worker 270d v1.23.4
remotenode03 Ready worker 270d v1.23.4
remotenode04 Ready worker 81d v1.23.4
remotenode07 Ready worker 389d v1.23.4
remotenode08 Ready worker 389d v1.23.4
remotenode09 Ready worker 389d v1.23.4
remotenode14 Ready worker 396d v1.23.4
remotenode15 Ready worker 81d v1.23.4
remotenode16 Ready worker 396d v1.23.4
remotenode17 Ready worker 396d v1.23.4
remotenode18 Ready control-plane,master 398d v1.23.4
[sachinpb@remotenode18 ~]$ kubectl cordon remotenode16
node/remotenode16 cordoned
[sachinpb@remotenode18 ~]$ kubectl uncordon remotenode16
node/remotenode16 uncordoned
[sachinpb@remotenode18 ~]$ kubectl cordon remotenode16
node/remotenode16 cordoned
[sachinpb@remotenode18 ~]$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
remotenode01 Ready worker 270d v1.23.4
remotenode02 Ready worker 270d v1.23.4
remotenode03 Ready worker 270d v1.23.4
remotenode04 Ready worker
remotenode07 Ready worker 389d v1.23.4
remotenode08 Ready worker 389d v1.23.4
remotenode09 Ready worker 389d v1.23.4
remotenode14 Ready worker 396d v1.23.4
remotenode15 Ready worker 81d v1.23.4
remotenode16 Ready,SchedulingDisabled worker 396d v1.23.4
remotenode18 Ready control-plane,master 398d v1.23.4
[sachinpb@remotenode18 ~]$
After the node has been cordoned off, you can use the kubectl drain command to safely and gracefully terminate any running pods on the node and reschedule them onto other available nodes in the cluster. Once all the pods have been moved, the node can then be safely removed from the cluster.
kubectl drain is a command in Kubernetes that is used to gracefully remove a node from a cluster. This is typically used when performing maintenance on a node, such as upgrading or replacing hardware, or when decommissioning a node from the cluster.
Source |
[sachinpb@remotenode18 ~]$ kubectl drain --ignore-daemonsets remote16
node/remote16 already cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-j749l, kube-system/fuse-device-plugin-daemonset-59lrp, kube-system/kube-proxy-v26k2, kube-system/nvidia-device-plugin-daemonset-w2k57, kube-system/rdma-shared-dp-ds-zdpfw, sys-monitor/prometheus-op-prometheus-node-exporter-rh4db
node/remote16 drained
[sachinpb@remotenode18 ~]$
By default kubectl drain is non-destructive, you have to override to change that behaviour. It runs with the following defaults:
--delete-local-data=false
--force=false
--grace-period=-1 (Period of time in seconds given to each pod to terminate gracefully. If negative, the default value specified in the pod will be used.)
--ignore-daemonsets=false
--timeout=0s
Each of these safeguard deals with a different category of potential destruction (local data, bare pods, graceful termination, daemonsets). It also respects pod disruption budgets to adhere to workload availability. Any non-bare pod will be recreated on a new node by its respective controller (e.g. daemonset controller, replication controller). It's up to you whether you want to override that behaviour (for example you might have a bare pod if running jenkins job. If you override by setting --force=true it will delete that pod and it won't be recreated). If you don't override it, the node will be in drain mode indefinitely (--timeout=0s))
Source |
When a node is drained, Kubernetes will automatically reschedule any running pods onto other available nodes in the cluster, ensuring that the workload is not interrupted. The kubectl drain command ensures that the node is cordoned off, meaning no new pods will be scheduled on it, and then gracefully terminates any running pods on the node. This helps to ensure that the pods are shut down cleanly, allowing them to complete any in-progress tasks and save any data before they are terminated.
After the pods have been rescheduled, the node can then be safely removed from the cluster. This helps to ensure that the overall health and stability of the cluster is maintained, even when individual nodes need to be taken offline for maintenance or other reasons
When kubectl drain returns successfully, that indicates that all of the pods have been safely evicted. It is then safe to bring down the node. After maintenance work we can use kubectl uncordon to tell Kubernetes that it can resume scheduling new pods onto the node.
[sachinpb@remotenode18 ~]$ kubectl uncordon remotenode16
node/remotenode16 uncordoned
Let's try all the above steps and see :
1) Retrieve information from a Kubernetes cluster
[sachinpb@remotenode18 ~]$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
remotenode01 Ready worker 270d v1.23.4
remotenode02 Ready worker 270d v1.23.4
remotenode03 Ready worker 270d v1.23.4
remotenode04 Ready worker 81d v1.23.4
remotenode07 Ready worker 389d v1.23.4
remotenode08 Ready worker 389d v1.23.4
remotenode09 Ready worker 389d v1.23.4
remotenode14 Ready worker 396d v1.23.4
remotenode15 Ready worker 81d v1.23.4
remotenode16 Ready worker 396d v1.23.4
remotenode17 Ready worker 396d v1.23.4
remotenode18 Ready control-plane,master 398d v1.23.4
--------------------------------
2) Kubernetes cordon is an operation that marks or taints a node in your existing node pool as unschedulable.
[sachinpb@remotenode18 ~]$ kubectl cordon remotenode16
node/remotenode16 cordoned
[sachinpb@remotenode18 ~]$
[sachinpb@remotenode18 ~]$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
remotenode01 Ready worker 270d v1.23.4
remotenode02 Ready worker 270d v1.23.4
remotenode03 Ready worker 270d v1.23.4
remotenode04 Ready worker 81d v1.23.4
remotenode07 Ready worker 389d v1.23.4
remotenode08 Ready worker 389d v1.23.4
remotenode09 Ready worker 389d v1.23.4
remotenode14 Ready worker 396d v1.23.4
remotenode15 Ready worker 81d v1.23.4
remotenode16 Ready,SchedulingDisabled worker 396d v1.23.4
remotenode17 Ready worker 396d v1.23.4
remotenode18 Ready control-plane,master 398d v1.23.4
3) Drain node in preparation for maintenance. The given node will be marked unschedulable to prevent new pods from arriving. Then drain deletes all pods
[sachinpb@remotenode18 ~]$ kubectl drain remotenode16 --grace-period=2400
node/remotenode16 already cordoned
error: unable to drain node "remotenode16" due to error:cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): kube-system/calico-node-j749l, kube-system/fuse-device-plugin-daemonset-59lrp, kube-system/kube-proxy-v26k2, kube-system/nvidia-device-plugin-daemonset-w2k57, kube-system/rdma-shared-dp-ds-zdpfw, sys-monitor/prometheus-op-prometheus-node-exporter-rh4db, continuing command...
There are pending nodes to be drained:
remotenode16
cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): kube-system/calico-node-j749l, kube-system/fuse-device-plugin-daemonset-59lrp, kube-system/kube-proxy-v26k2, kube-system/nvidia-device-plugin-daemonset-w2k57, kube-system/rdma-shared-dp-ds-zdpfw, sys-monitor/prometheus-op-prometheus-node-exporter-rh4db
[sachinpb@remotenode18 ~]$
NOTE:
The given node will be marked unschedulable to prevent new pods from arriving. Then drain deletes all pods except mirror pods (which cannot be deleted through the API server). If there are DaemonSet-managed pods, drain will not proceed without –ignore-daemonsets, and regardless it will not delete any DaemonSet-managed pods, because those pods would be immediately replaced by the DaemonSet controller, which ignores unschedulable markings. If there are any pods that are neither mirror pods nor managed–by ReplicationController, DaemonSet or Job–, then drain will not delete any pods unless you use –force.
----------------------------
4) Drain node with --ignore-daemonsets
[sachinpb@remotenode18 ~]$ kubectl drain --ignore-daemonsets remotenode16 --grace-period=2400
node/remotenode16 cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-j749l, kube-system/fuse-device-plugin-daemonset-59lrp, kube-system/kube-proxy-v26k2, kube-system/nvidia-device-plugin-daemonset-w2k57, kube-system/rdma-shared-dp-ds-zdpfw, sys-monitor/prometheus-op-prometheus-node-exporter-rh4db
node/remotenode16 drained
----------------------
5) Uncordon will mark the node as schedulable.
[sachinpb@remotenode18 ~]$ kubectl uncordon remotenode16
node/remotenode16 uncordoned
[sachinpb@remotenode18 ~]$
-----------------
6) Retrieve information from a Kubernetes cluster
[sachinpb@remotenode18 ~]$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
remotenode01 Ready worker 270d v1.23.4
remotenode02 Ready worker 270d v1.23.4
remotenode03 Ready worker 270d v1.23.4
remotenode04 Ready worker 81d v1.23.4
remotenode07 Ready worker 389d v1.23.4
remotenode08 Ready worker 389d v1.23.4
remotenode09 Ready worker 389d v1.23.4
remotenode14 Ready worker 396d v1.23.4
remotenode15 Ready worker 81d v1.23.4
remotenode16 Ready worker 396d v1.23.4
remotenode17 Ready worker 396d v1.23.4
remotenode18 Ready control-plane,master 398d v1.23.4
How to automate above process creating Jenkins pipeline job to cordon ,drain and uncordon the nodes with the help of groovy script:
-------------------------Sample groovy script--------------------------------
node("Kubernetes-master-node") {
stage("1") {
sh 'hostname'
sh 'cat $SACHIN_HOME/manual//hostfile'
k8s_cordon_drain()
k8s_uncordon()
}
}
/*
* CI -Kubernetes cluster : This function will cordon/drain the worker nodes in hostfile
*/
def k8s_cordon_drain() {
def maxTries = 3 // the maximum number of times to retry the kubectl commands
def sleepTime = 5 * 1000 // the amount of time to wait between retries (in milliseconds)
def filename = '$SACHIN_HOME/manual/hostfile'
def content = readFile(filename)
def hosts = content.readLines().collect { it.split()[0] }
println "List of Hostnames to be cordoned from K8s cluster: ${hosts}"
hosts.each { host ->
def command1 = "kubectl cordon $host"
def command2 = "kubectl drain --ignore-daemonsets --grace-period=2400 $host"
def tries = 0
def result1 = null
def result2 = null
while (tries < maxTries) {
result1 = sh(script: command1, returnStatus: true)
if (result1 == 0) {
println "Successfully cordoned $host"
break
} else {
tries++
println "Failed to cordoned $host (attempt $tries/$maxTries), retrying in ${sleepTime/1000} seconds..."
sleep(sleepTime)
}
}
if (result1 == 0) {
tries = 0
while (tries < maxTries) {
result2 = sh(script: command2, returnStatus: true)
if (result2 == 0) {
println "Successfully drained $host"
break
} else {
tries++
println "Failed to drain $host (attempt $tries/$maxTries), retrying in ${sleepTime/1000} seconds..."
sleep(sleepTime)
}
}
}
if (result2 != 0) {
println "Failed to drain $host after $maxTries attempts"
}
}
}
/*
* CI - Kubernetes cluster : This function will uncordon the worker nodes in hostfile
*/
def k8s_uncordon() {
def maxTries = 3 // the maximum number of times to retry the kubectl commands
def sleepTime = 5 * 1000 // the amount of time to wait between retries (in milliseconds)
def filename = '$SACHIN_HOME/manual/hostfile'
def content = readFile(filename)
def hosts = content.readLines().collect { it.split()[0] }
println "List of Hostnames to be uncordoned from K8s cluster: ${hosts}"
hosts.each { host ->
def command1 = "kubectl uncordon $host"
def tries = 0
def result1 = null
while (tries < maxTries) {
result1 = sh(script: command1, returnStatus: true)
if (result1 == 0) {
println "Successfully cordoned $host"
break
} else {
tries++
println "Failed to uncordon $host (attempt $tries/$maxTries), retrying in ${sleepTime/1000} seconds..."
sleep(sleepTime)
}
}
if (result1 != 0) {
println "Failed to uncordon $host after $maxTries attempts"
}
}
}
------------------Jenkins Console output for pipeline job -----------------
Started by user jenkins-admin
[Pipeline] Start of Pipeline
[Pipeline] node
Running on Kubernetes-master-node in $SACHIN_HOME/workspace/test_sample4_cordon_drain
[Pipeline] {
[Pipeline] stage
[Pipeline] { (1)
[Pipeline] sh
+ hostname
kubernetes-master-node
[Pipeline] sh
+ cat $SACHIN_HOME/manual//hostfile
Remotenode16 slots=4
Remotenode17 slots=4
[Pipeline] readFile
[Pipeline] echo
List of Hostnames to be cordoned from K8s cluster: [Remotenode16, Remotenode17]
[Pipeline] sh
+ kubectl cordon Remotenode16
node/Remotenode16 cordoned
[Pipeline] echo
Successfully cordoned Remotenode16
[Pipeline] sh
+ kubectl drain --ignore-daemonsets --grace-period=2400 Remotenode16
node/Remotenode16 already cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-j749l, kube-system/fuse-device-plugin-daemonset-59lrp, kube-system/kube-proxy-v26k2, kube-system/nvidia-device-plugin-daemonset-w2k57, kube-system/rdma-shared-dp-ds-zdpfw, sys-monitor/prometheus-op-prometheus-node-exporter-rh4db
node/Remotenode16 drained
[Pipeline] echo
Successfully drained Remotenode16
[Pipeline] sh
+ kubectl cordon Remotenode17
node/Remotenode17 cordoned
[Pipeline] echo
Successfully cordoned Remotenode17
[Pipeline] sh
+ kubectl drain --ignore-daemonsets --grace-period=2400 Remotenode17
node/Remotenode17 already cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-hz5zh, kube-system/fuse-device-plugin-daemonset-dj72m, kube-system/kube-proxy-g87dc, kube-system/nvidia-device-plugin-daemonset-tk5x8, kube-system/rdma-shared-dp-ds-n4g5w, sys-monitor/prometheus-op-prometheus-node-exporter-gczmz
node/Remotenode17 drained
[Pipeline] echo
Successfully drained Remotenode17
[Pipeline] readFile
[Pipeline] echo
List of Hostnames to be uncordoned from K8s cluster: [Remotenode16, Remotenode17]
[Pipeline] sh
+ kubectl uncordon Remotenode16
node/Remotenode16 uncordoned
[Pipeline] echo
Successfully cordoned Remotenode16
[Pipeline] sh
+ kubectl uncordon Remotenode17
node/Remotenode17 uncordoned
[Pipeline] echo
Successfully cordoned Remotenode17
[Pipeline] }
[Pipeline] // stage
[Pipeline] }
[Pipeline] // node
[Pipeline] End of Pipeline
Finished: SUCCESS
-----------------------------------------------------------------
Reference: