Incident investigations are time-consuming and often happen during off-hours. I built an AI agent that automates this process—When incidents happen at 2 AM, engineers lose sleep digging through Slack alerts, Prometheus metrics, and Splunk logs. What if an AI agent could do the heavy lifting—triage, investigate, and summarize the root cause. LangGraph makes it simple to design agent workflows as nodes and flows, while GPT provides the reasoning power. LangGraph is a powerful framework built on top of LangChain that allows you to create stateful, multi-step agents using a graph-based architecture. Unlike traditional chains, LangGraph lets you define nodes(functions or agents) and edges (transitions) to model complex workflows so that it helps you orchestrate multi-step agent workflows. Think of it like drawing a flowchart for your AI agent where each box is a node (task) and the arrows represent the logic flow.
- Node = step in your agent (e.g., "fetch metrics")
- Edge = connection (e.g., "if anomaly detected → analyze logs")
Here’s the problem we’re solving:
- Alerts come in via Slack
- We need to query Prometheus (metrics) + Splunk (logs)
- Generate an investigation report
- Share root cause insights automatically
Step 1: Define the Flow (Nodes)
We design our agent as a graph of nodes:
Slack Listener Node → listens for alerts in real-time
Prometheus Query Node → fetches system metrics
Splunk Query Node → retrieves log entries
LLM Analysis Node (GPT-4o) → correlates signals
Summary & Report Node → generates incident summary
Slack Notifier Node → posts root cause back to Slack
Step 2: How the Flow Works
When a Slack alert is received → trigger the workflow
Prometheus Node and Splunk Node run in parallel (fetching metrics & logs)
LLM Node takes this raw data and performs correlation reasoning
Report Node structures it into a human-readable summary
Finally, Slack Node posts results back to the team
+----------------+ | Slack Alerts | +----------------+ ↓ +----------------+ +------------------+ | Slack Listener | ----→ | Prometheus Query | +----------------+ +------------------+ ↓ ↓ └─────────────→ +------------------+ | Splunk Query | +------------------+ ↓ +------------------+ | GPT-4o Analysis | +------------------+ ↓ +------------------+ | Report Generator | +------------------+ ↓ +------------------+ | Slack Notifier | +------------------+
Step 3: LangGraph Code (POC 1)
Here’s a simplified version for beginners:
from langgraph.graph import StateGraph, END
from langchain_openai import ChatOpenAI
# 1. Define LLM
llm = ChatOpenAI(model="gpt-4o")
# 2. State definition
class AgentState(dict):
pass
# 3. Nodes (functions)
def slack_listener(state: AgentState):
state["alert"] = "High CPU usage on server-123"
return state
def query_prometheus(state: AgentState):
state["metrics"] = "CPU 95% for last 10 min"
return state
def query_splunk(state: AgentState):
state["logs"] = "Error: Timeout connecting to DB"
return state
def analyze_with_llm(state: AgentState):
response = llm.predict(f"""
Alert: {state['alert']}
Metrics: {state['metrics']}
Logs: {state['logs']}
Please find likely root cause and suggest action.
""")
state["analysis"] = response
return state
def slack_notifier(state: AgentState):
print("Incident Report to Slack:")
print(state["analysis"])
return state
# 4. Build Graph
workflow = StateGraph(AgentState)
workflow.add_node("slack_ listener", slack_listener)
workflow.add_node("query_ prometheus", query_prometheus)
workflow.add_node("query_ splunk", query_splunk)
workflow.add_node("analyze_ with_llm", analyze_with_llm)
workflow.add_node("slack_ notifier", slack_notifier)
# 5. Define edges (flow)
workflow.set_entry_point(" slack_listener")
workflow.add_edge("slack_ listener", "query_prometheus")
workflow.add_edge("slack_ listener", "query_splunk")
workflow.add_edge("query_ prometheus", "analyze_with_llm")
workflow.add_edge("query_ splunk", "analyze_with_llm")
workflow.add_edge("analyze_ with_llm", "slack_notifier")
workflow.add_edge("slack_ notifier", END)
# 6. Compile & run
agent = workflow.compile()
agent.invoke({})
------------
This POC shows how simple it is to build your own AI agent using LangGraph:
- Just define nodes as functions
- Connect them with edges
- Let GPT handle the reasoning
From here, you can expand:
- Add ticket creation in Jira
- Add automated remediation scripts
- Scale to multi-agent workflows
pip install -r requirements.txt
pip install langgraph openai slack_sdk prometheus-api-client splunk-sdk
---------------------------------------------------
You can design the flow bit differently as shown in this Building the LangGraph
Each node is a LangGraph function or GPT-4o-powered agent. The transitions are based on the success/failure of each step.
--------------------------
Define nodes: Each node is a Python function. Example: Query Prometheus
----------
def query_prometheus(alert_data):
from prometheus_api_client import PrometheusConnect
prom = PrometheusConnect(url=os.getenv("PROMETHEUS_URL"), disable_ssl=True)
metric_data = prom.get_current_metric_value(metric_name="cpu_usage")
return {"metrics": metric_data}
------------
Build the Graph
from langgraph.graph import StateGraph
graph = StateGraph()
graph.add_node("parse_alert", parse_alert)
graph.add_node("query_prometheus", query_prometheus)
graph.add_node("query_splunk", query_splunk)
graph.add_node("analyze", analyze_data)
graph.add_node("summarize", summarize_root_cause)
graph.add_node("post_slack", post_to_slack)
graph.set_entry_point("parse_alert")
graph.add_edge("parse_alert", "query_prometheus")
graph.add_edge("parse_alert", "query_splunk")
graph.add_edge("query_prometheus", "analyze")
graph.add_edge("query_splunk", "analyze")
graph.add_edge("analyze", "summarize")
graph.add_edge("summarize", "post_slack")
agent = graph.compile()
--------------------------------------------------
Running the Agent
-----------------------------------------------------
POC 2: Security Breach Detection Agent with LangGraph + GPT
Security teams spend countless hours scanning logs for suspicious login attempts—failed SSH connections, brute force attacks, or abnormal geolocation logins. How AI agent could detect login anomalies, analyze logs, and automatically alert on Slack with root cause insights. Let’s build that with LangGraph + OpenAI GPT-4o.
Use Case Overview
Problem: Multiple suspicious logins are detected on a server, but admins often get overwhelmed by raw log alerts.
Solution: Create an AI Agent flow to:
Monitor server logs for abnormal login attempts
Correlate failed login data (IPs, frequency, geolocation)
Ask GPT to determine if it’s a brute force attack, unusual login, or benign
Generate a clear summary with root cause analysis
Send the summary to Slack Security Channel
Agent Flow (Nodes)
Log Monitor Node → listens to
/var/log/auth.log
or SIEM eventsAnomaly Detector Node → extracts suspicious login attempts (e.g., >5 failed SSH logins from same IP)
GeoIP Lookup Node → enriches IP with geolocation info
LLM Analysis Node (GPT-4o) → determines likelihood of attack and explains root cause
Slack Notifier Node → sends human-readable incident report to security team
How the Flow Works
Input: System log entries (
/var/log/auth.log
)Processing: Detect multiple failed login attempts, enrich data with GeoIP lookup
Reasoning: LLM correlates and explains possible root cause (e.g., brute-force attempt from overseas IP)
Output: Slack notification with analysis & recommended action
LangGraph Implementation
from langgraph.graph import StateGraph, END
from langchain_openai import ChatOpenAI
import random
# Define LLM
llm = ChatOpenAI(model="gpt-4o")
# Agent State
class AgentState(dict):
pass
# 1. Log Monitor Node
def log_monitor(state: AgentState):
# Example logs (in real case, parse /var/log/auth.log)
state["logs"] = [
"Failed password for root from 203.0.113.25 port 54321 ssh2",
"Failed password for root from 203.0.113.25 port 54322 ssh2",
"Failed password for root from 203.0.113.25 port 54323 ssh2",
"Failed password for root from 203.0.113.25 port 54324 ssh2",
"Failed password for root from 203.0.113.25 port 54325 ssh2",
]
return state
# 2. Anomaly Detector Node
def detect_anomaly(state: AgentState):
failed_attempts = len(state["logs"])
if failed_attempts > 3:
state["suspicious_ip"] = "203.0.113.25"
state["anomaly"] = f"Detected {failed_attempts} failed logins from {state['suspicious_ip']}"
else:
state["anomaly"] = "No anomaly detected"
return state
# 3. GeoIP Lookup Node (simulated)
def geoip_lookup(state: AgentState):
# Fake GeoIP lookup for example
geo_info = {"ip": state.get("suspicious_ip", "N/A"), "country": "Russia", "asn": "AS12345"}
state["geoip"] = geo_info
return state
# 4. LLM Analysis Node
def analyze_with_llm(state: AgentState):
prompt = f"""
Security Alert:
Logs: {state['logs']}
Anomaly: {state['anomaly']}
GeoIP Info: {state['geoip']}
Please analyze the root cause.
Is this a brute force attack, suspicious login, or benign activity?
Suggest next action.
"""
response = llm.predict(prompt)
state["analysis"] = response
return state
# 5. Slack Notifier Node
def slack_notifier(state: AgentState):
print(" Security Incident Report to Slack:")
print(state["analysis"])
return state
# Build Workflow
workflow = StateGraph(AgentState)
workflow.add_node("log_ monitor", log_monitor)
workflow.add_node("detect_ anomaly", detect_anomaly)
workflow.add_node("geoip_ lookup", geoip_lookup)
workflow.add_node("analyze_ with_llm", analyze_with_llm)
workflow.add_node("slack_ notifier", slack_notifier)
# Define Flow
workflow.set_entry_point("log_ monitor")
workflow.add_edge("log_ monitor", "detect_anomaly")
workflow.add_edge("detect_ anomaly", "geoip_lookup")
workflow.add_edge("geoip_ lookup", "analyze_with_llm")
workflow.add_edge("analyze_ with_llm", "slack_notifier")
workflow.add_edge("slack_ notifier", END)
# Run Workflow
app = workflow.compile()
app.invoke({})
Block Diagram
+------------------+ +-------------------+
| Auth Logs (/var) | -----> | Log Monitor Node |
+------------------+ +-------------------+
↓
+-------------------+
| Anomaly Detector |
+-------------------+
↓
+-------------------+
| GeoIP Lookup Node |
+-------------------+
↓
+-------------------+
| GPT Analysis Node |
+-------------------+
↓
+-------------------+
| Slack Notifier |
+-------------------+
Benefits
- Detects brute force attacks in real time
- Provides context (IP, country, ASN, frequency)
- Generates human-readable summary for faster decision-making
- Alerts team on Slack in seconds
With just a few nodes and flows in LangGraph, you’ve created a security incident investigation assistant—a perfect POC for security teams.
LangGraph makes it easy to build modular, scalable, and intelligent agents that mirror real-world workflows. Combined with GPT-4o’s reasoning power, you can automate even complex tasks like incident investigations.
No comments:
Post a Comment