Incident investigations are time-consuming and often happen during off-hours. I built an AI agent that automates this process—When incidents happen at 2 AM, engineers lose sleep digging through Slack alerts, Prometheus metrics, and Splunk logs. What if an AI agent could do the heavy lifting—triage, investigate, and summarize the root cause. LangGraph makes it simple to design agent workflows as nodes and flows, while GPT provides the reasoning power. LangGraph is a powerful framework built on top of LangChain that allows you to create stateful, multi-step agents using a graph-based architecture. Unlike traditional chains, LangGraph lets you define nodes(functions or agents) and edges (transitions) to model complex workflows so that it helps you orchestrate multi-step agent workflows. Think of it like drawing a flowchart for your AI agent where each box is a node (task) and the arrows represent the logic flow.

Node = step in your agent (e.g., "fetch metrics")
Edge = connection (e.g., "if anomaly detected → analyze logs")

Here’s the problem we’re solving:
Alerts come in via Slack
We need to query Prometheus (metrics) + Splunk (logs)
Generate an investigation report
Share root cause insights automatically

Step 1: Define the Flow (Nodes)

We design our agent as a graph of nodes:

Slack Listener Node → listens for alerts in real-time
Prometheus Query Node → fetches system metrics
Splunk Query Node → retrieves log entries
LLM Analysis Node (GPT-4o) → correlates signals
Summary & Report Node → generates incident summary
Slack Notifier Node → posts root cause back to Slack

Step 2: How the Flow Works

When a Slack alert is received → trigger the workflow
Prometheus Node and Splunk Node run in parallel (fetching metrics & logs)
LLM Node takes this raw data and performs correlation reasoning
Report Node structures it into a human-readable summary

Finally, Slack Node posts results back to the team

+----------------+
| Slack Alerts   |
+----------------+
        ↓
+----------------+        +------------------+
| Slack Listener | ----→  | Prometheus Query |
+----------------+        +------------------+
        ↓                        ↓
        └─────────────→ +------------------+
                        | Splunk Query     |
                        +------------------+
                                ↓
                        +------------------+
                        | GPT-4o Analysis  |
                        +------------------+
                                ↓
                        +------------------+
                        | Report Generator |
                        +------------------+
                                ↓
                        +------------------+
                        | Slack Notifier   |
                        +------------------+

Step 3: LangGraph Code (POC 1)

Here’s a simplified version for beginners:

from langgraph.graph import StateGraph, END
from langchain_openai import ChatOpenAI

# 1. Define LLM
llm = ChatOpenAI(model="gpt-4o")

# 2. State definition
class AgentState(dict):
    pass

# 3. Nodes (functions)
def slack_listener(state: AgentState):
    state["alert"] = "High CPU usage on server-123"
    return state

def query_prometheus(state: AgentState):
    state["metrics"] = "CPU 95% for last 10 min"
    return state

def query_splunk(state: AgentState):
    state["logs"] = "Error: Timeout connecting to DB"
    return state

def analyze_with_llm(state: AgentState):
    response = llm.predict(f"""
        Alert: {state['alert']}
        Metrics: {state['metrics']}
        Logs: {state['logs']}
        Please find likely root cause and suggest action.
    """)
    state["analysis"] = response
    return state

def slack_notifier(state: AgentState):
    print("Incident Report to Slack:")
    print(state["analysis"])
    return state

# 4. Build Graph
workflow = StateGraph(AgentState)
workflow.add_node("slack_listener", slack_listener)
workflow.add_node("query_prometheus", query_prometheus)
workflow.add_node("query_splunk", query_splunk)
workflow.add_node("analyze_with_llm", analyze_with_llm)
workflow.add_node("slack_notifier", slack_notifier)

# 5. Define edges (flow)
workflow.set_entry_point("slack_listener")
workflow.add_edge("slack_listener", "query_prometheus")
workflow.add_edge("slack_listener", "query_splunk")
workflow.add_edge("query_prometheus", "analyze_with_llm")
workflow.add_edge("query_splunk", "analyze_with_llm")
workflow.add_edge("analyze_with_llm", "slack_notifier")
workflow.add_edge("slack_notifier", END)

# 6. Compile & run
agent = workflow.compile()
agent.invoke({})

------------

This POC shows how simple it is to build your own AI agent using LangGraph:
Just define nodes as functions
Connect them with edges
Let GPT handle the reasoning
From here, you can expand:
Add ticket creation in Jira
Add automated remediation scripts
Scale to multi-agent workflows

Install dependencies:

pip install -r requirements.txt

pip install langgraph openai slack_sdk prometheus-api-client splunk-sdk

---------------------------------------------------

You can design the flow bit differently as shown in this Building the LangGraph

Each node is a LangGraph function or GPT-4o-powered agent. The transitions are based on the success/failure of each step.

--------------------------

Define nodes: Each node is a Python function. Example: Query Prometheus

----------

def query_prometheus(alert_data):

from prometheus_api_client import PrometheusConnect

prom = PrometheusConnect(url=os.getenv("PROMETHEUS_URL"), disable_ssl=True)

metric_data = prom.get_current_metric_value(metric_name="cpu_usage")

return {"metrics": metric_data}

------------

Build the Graph

------------

from langgraph.graph import StateGraph

graph = StateGraph()

graph.add_node("parse_alert", parse_alert)

graph.add_node("query_prometheus", query_prometheus)

graph.add_node("query_splunk", query_splunk)

graph.add_node("analyze", analyze_data)

graph.add_node("summarize", summarize_root_cause)

graph.add_node("post_slack", post_to_slack)

graph.set_entry_point("parse_alert")

graph.add_edge("parse_alert", "query_prometheus")

graph.add_edge("parse_alert", "query_splunk")

graph.add_edge("query_prometheus", "analyze")

graph.add_edge("query_splunk", "analyze")

graph.add_edge("analyze", "summarize")

graph.add_edge("summarize", "post_slack")

agent = graph.compile()

--------------------------------------------------

Running the Agent

agent.invoke({"alert": slack_alert_data})

-----------------------------------------------------

POC 2: Security Breach Detection Agent with LangGraph + GPT

Security teams spend countless hours scanning logs for suspicious login attempts—failed SSH connections, brute force attacks, or abnormal geolocation logins. How AI agent could detect login anomalies, analyze logs, and automatically alert on Slack with root cause insights. Let’s build that with LangGraph + OpenAI GPT-4o.

Use Case Overview

Problem: Multiple suspicious logins are detected on a server, but admins often get overwhelmed by raw log alerts.
Solution: Create an AI Agent flow to:
1. Monitor server logs for abnormal login attempts
2. Correlate failed login data (IPs, frequency, geolocation)
3. Ask GPT to determine if it’s a brute force attack, unusual login, or benign
4. Generate a clear summary with root cause analysis
5. Send the summary to Slack Security Channel

Agent Flow (Nodes)

Log Monitor Node → listens to /var/log/auth.log or SIEM events
Anomaly Detector Node → extracts suspicious login attempts (e.g., >5 failed SSH logins from same IP)
GeoIP Lookup Node → enriches IP with geolocation info
LLM Analysis Node (GPT-4o) → determines likelihood of attack and explains root cause
Slack Notifier Node → sends human-readable incident report to security team

How the Flow Works

Input: System log entries (/var/log/auth.log)
Processing: Detect multiple failed login attempts, enrich data with GeoIP lookup
Reasoning: LLM correlates and explains possible root cause (e.g., brute-force attempt from overseas IP)
Output: Slack notification with analysis & recommended action

LangGraph Implementation

from langgraph.graph import StateGraph, END
from langchain_openai import ChatOpenAI
import random

# Define LLM
llm = ChatOpenAI(model="gpt-4o")

# Agent State
class AgentState(dict):
    pass

# 1. Log Monitor Node
def log_monitor(state: AgentState):
    # Example logs (in real case, parse /var/log/auth.log)
    state["logs"] = [
        "Failed password for root from 203.0.113.25 port 54321 ssh2",
        "Failed password for root from 203.0.113.25 port 54322 ssh2",
        "Failed password for root from 203.0.113.25 port 54323 ssh2",
        "Failed password for root from 203.0.113.25 port 54324 ssh2",
        "Failed password for root from 203.0.113.25 port 54325 ssh2",
    ]
    return state

# 2. Anomaly Detector Node
def detect_anomaly(state: AgentState):
    failed_attempts = len(state["logs"])
    if failed_attempts > 3:
        state["suspicious_ip"] = "203.0.113.25"
        state["anomaly"] = f"Detected {failed_attempts} failed logins from {state['suspicious_ip']}"
    else:
        state["anomaly"] = "No anomaly detected"
    return state

# 3. GeoIP Lookup Node (simulated)
def geoip_lookup(state: AgentState):
    # Fake GeoIP lookup for example
    geo_info = {"ip": state.get("suspicious_ip", "N/A"), "country": "Russia", "asn": "AS12345"}
    state["geoip"] = geo_info
    return state

# 4. LLM Analysis Node
def analyze_with_llm(state: AgentState):
    prompt = f"""
    Security Alert:
    Logs: {state['logs']}
    Anomaly: {state['anomaly']}
    GeoIP Info: {state['geoip']}
    
    Please analyze the root cause.
    Is this a brute force attack, suspicious login, or benign activity?
    Suggest next action.
    """
    response = llm.predict(prompt)
    state["analysis"] = response
    return state

# 5. Slack Notifier Node
def slack_notifier(state: AgentState):
    print(" Security Incident Report to Slack:")
    print(state["analysis"])
    return state

# Build Workflow
workflow = StateGraph(AgentState)
workflow.add_node("log_monitor", log_monitor)
workflow.add_node("detect_anomaly", detect_anomaly)
workflow.add_node("geoip_lookup", geoip_lookup)
workflow.add_node("analyze_with_llm", analyze_with_llm)
workflow.add_node("slack_notifier", slack_notifier)

# Define Flow
workflow.set_entry_point("log_monitor")
workflow.add_edge("log_monitor", "detect_anomaly")
workflow.add_edge("detect_anomaly", "geoip_lookup")
workflow.add_edge("geoip_lookup", "analyze_with_llm")
workflow.add_edge("analyze_with_llm", "slack_notifier")
workflow.add_edge("slack_notifier", END)

# Run Workflow
app = workflow.compile()
app.invoke({})

Block Diagram

+------------------+        +-------------------+
| Auth Logs (/var) | -----> | Log Monitor Node  |
+------------------+        +-------------------+
                                  ↓
                          +-------------------+
                          | Anomaly Detector  |
                          +-------------------+
                                  ↓
                          +-------------------+
                          | GeoIP Lookup Node |
                          +-------------------+
                                  ↓
                          +-------------------+
                          | GPT Analysis Node |
                          +-------------------+
                                  ↓
                          +-------------------+
                          | Slack Notifier    |
                          +-------------------+

Benefits

Detects brute force attacks in real time
Provides context (IP, country, ASN, frequency)
Generates human-readable summary for faster decision-making
Alerts team on Slack in seconds

With just a few nodes and flows in LangGraph, you’ve created a security incident investigation assistant—a perfect POC for security teams.

LangGraph makes it easy to build modular, scalable, and intelligent agents that mirror real-world workflows. Combined with GPT-4o’s reasoning power, you can automate even complex tasks like incident investigations.

LINUX & HPC : Advanced Large Scale Computing at a Glance !

Sunday, September 7, 2025

Automating Incident Investigations with LangGraph and OpenAI GPT

Here’s the problem we’re solving:
Alerts come in via Slack
We need to query Prometheus (metrics) + Splunk (logs)
Generate an investigation report
Share root cause insights automatically

Step 1: Define the Flow (Nodes)

Step 2: How the Flow Works

Step 3: LangGraph Code (POC 1)

Build the Graph

Running the Agent

POC 2: Security Breach Detection Agent with LangGraph + GPT

Use Case Overview

Agent Flow (Nodes)

How the Flow Works

LangGraph Implementation

Block Diagram

Benefits

No comments:

Post a Comment

Popular Posts

Translate

Sunday, September 7, 2025

Automating Incident Investigations with LangGraph and OpenAI GPT

Here’s the problem we’re solving:Alerts come in via SlackWe need to query Prometheus (metrics) + Splunk (logs)Generate an investigation reportShare root cause insights automatically

Step 1: Define the Flow (Nodes)

Step 2: How the Flow Works

Step 3: LangGraph Code (POC 1)

Build the Graph

Running the Agent

POC 2: Security Breach Detection Agent with LangGraph + GPT

Use Case Overview

Agent Flow (Nodes)

How the Flow Works

LangGraph Implementation

Block Diagram

Benefits

No comments:

Post a Comment

Here’s the problem we’re solving:
Alerts come in via Slack
We need to query Prometheus (metrics) + Splunk (logs)
Generate an investigation report
Share root cause insights automatically