Monday, March 16, 2026

Claude Design uses Large Context Windows for Deeper Reasoning over RAG

Modern LLM systems typically choose between two ways of giving models the information they need: Retrieval Augmented Generation (RAG) or large context windows. Both solve different problems 

Claude (Anthropic) in VS Code primarily uses a large context window plus agentic code exploration, not classic RAG by default.RAG can be added, but it is optional and external.

Claude in VS Code analyzes code using large context windows and active file exploration, because code benefits more from precise, agent‑driven inspection than from passive RAG retrieval.

1. Large context window as the primary mechanism

Claude Code relies on very large context windows to analyze code. The VS Code extension automatically provides Claude with:

  • Your currently open file
  • Selected text ranges
  • Files you explicitly reference using @file or line ranges
  • Project memory files like CLAUDE.md

This behavior is documented in the official Claude Code VS Code docs, which describe direct file visibility and context passing rather than retrieval pipelines. 

Claude models are explicitly designed to support hundreds of thousands of tokens, which makes “read the code directly” feasible without a retrieval layer.

2. Agentic search instead of passive RAG

Rather than pre‑indexing your repo into a vector database (classic RAG), Claude Code acts as an agent that:

  • Searches files
  • Reads only relevant sections
  • Iteratively explores the codebase

This design choice is highlighted in community and practitioner analyses describing Claude Code as active investigation instead of “dump everything into context” RAG. 

Examples of agentic behavior include:

  • Grep‑like searches
  • Targeted file reads
  • Incremental context building

This is fundamentally different from traditional RAG, which retrieves chunks blindly based on similarity.

3. Why Anthropic chose this approach

Reason 1: Code is structured, not fuzzy text

Source code has:

  • Strong syntax
  • Explicit dependencies
  • Precise identifiers

Anthropic’s approach assumes it is better to search deterministically (file names, symbols, call paths) than rely on embedding similarity alone. 

Reason 2: Large context windows reduce RAG overhead

With large context windows:

  • Claude can read entire files when needed
  • No chunking or embedding errors
  • No stale indexes after code changes

This is reinforced by the existence of tooling that tracks context window usage, showing that Claude is designed to operate close to context limits rather than avoid them. 

Reason 3: RAG is optional, not built‑in

RAG for Claude Code exists as external or community tools, not as a default feature. For example:

  • DevRAG and MCP‑based tools add vector search to Claude Code
  • These are explicitly framed as token‑saving optimizations, not core architecture. 

This strongly implies that Anthropic does not consider RAG mandatory for code understanding.

Summary comparison

AspectClaude Code (VS Code)
Default approachLarge context window + agentic exploration
Classic RAG❌ Not default
Vector DB indexing❌ Optional / external
File accessDirect, on demand
Context controlExplicit and visible
Best atDeep, precise code reasoning

1. What RAG is used for RAG is presented as a technique that combines:

  • Large Language Models
  • External data retrieval mechanisms such as vector databasessemantic search, and embeddings

This allows models to answer questions using external, up‑to‑date, and trusted data, rather than relying only on their training data. 

2. RAG vs large context windows 

  • Are massive context windows replacing the need for RAG?

It explains that while long context windows allow more data to be passed directly into prompts, they do not automatically eliminate the need for structured retrieval approaches like RAG. 

3. Choosing the right approach Rather than saying RAG is obsolete,

  • The decision depends on the application use case
  • Factors like data freshness, trustworthiness, and AI workflow design matter
  • RAG remains relevant in many enterprise scenarios 

RAG is not “dead”; it is one of several viable approaches, and the right choice depends on your data, accuracy needs, and LLM workflow design. 



Retrieval Augmented Generation vs Large Context Windows

Use cases and advantages

Modern LLM systems typically choose between two ways of giving models the information they need: Retrieval Augmented Generation (RAG) or large context windows. Both solve different problems and are often misunderstood as competing approaches. In practice, they are complementary.

What RAG is good at

Core idea

RAG augments an LLM with external knowledge retrieval at inference time. Instead of relying only on what the model remembers from training, the system fetches relevant documents from databases, wikis, PDFs, or logs and injects them into the prompt before generation.

RAG use cases

1. Enterprise knowledge assistants

RAG is ideal when answers must come from proprietary or fast‑changing data, such as:

  • Internal wikis
  • Product documentation
  • Support runbooks
  • Compliance and policy documents

The model retrieves the most relevant documents and generates grounded answers, reducing hallucinations.

2. Regulated and audit‑heavy environments

RAG supports traceability by attaching responses to source documents. This is critical in:

  • Legal research
  • Healthcare decision support
  • Financial compliance systems

Many RAG systems explicitly return citations or document references.

3. Dynamic and real‑time information

LLMs are static after training. RAG solves this by pulling:

  • Latest regulations
  • Updated pricing
  • Live operational data

This is why RAG is widely used in customer support, finance, and industrial operations

Advantages of RAG

  1. Up‑to‑date knowledge
    The model can access information created after training without retraining.

  2. Reduced hallucinations
    Responses are grounded in retrieved documents rather than model memory alone. 

  3. Enterprise data isolation
    Sensitive internal data stays in your retrieval layer and does not become part of model training.

  4. Scales beyond context limits
    You do not need to fit all documents into the context window at once.

What large context windows are good at

Core idea

A large context window allows the model to see and reason over massive inputs directly, sometimes hundreds of thousands or even millions of tokens at once. 

Instead of retrieving small chunks, you load large sections or entire artifacts into the prompt.

Large context window use cases

1. Large codebase understanding

Large context models excel at:

  • Reading entire modules or repositories
  • Understanding cross‑file dependencies
  • Refactoring with global awareness

This is especially valuable for code analysis where structure and relationships matter more than fuzzy retrieval. 

2. Deep document analysis

Large context windows enable:

  • End‑to‑end reading of specifications
  • Full contract analysis
  • Research paper or RFC comprehension in one pass

This avoids chunking errors introduced by RAG pipelines. 

3. Long‑running reasoning and agent workflows

With large context:

  • Multi‑step reasoning stays coherent
  • The model remembers earlier constraints
  • No repeated retrieval calls are required

This is why agentic coding tools often prefer large context over classic RAG. 

Advantages of large context windows

  1. Holistic reasoning The model sees the entire artifact, enabling better global understanding.

  2. No retrieval errors There is no risk of missing relevant chunks due to poor embeddings or ranking.

  3. Simpler architecture No vector database, no indexing, no retriever tuning required.

  4. Better for structured data Code, configs, and logs benefit more from direct inspection than semantic similarity search. 

RAG vs large context window 

AspectRAGLarge context window
Best forEnterprise knowledge, docs, policiesCode, specs, deep analysis
Handles fresh dataYesNo
Needs external systemsYesNo
Risk of missing infoPossibleLow
Cost modelRetrieval + inferenceToken heavy inference
Architecture complexityHigherLower


  • Use RAG when correctness, freshness, and traceability matter.
  • Use large context windows when deep reasoning over structured artifacts like code is required.

-------------------------------------------

  • Claude (Anthropic) supports up to 1 million tokens of context window in its latest generally available models.
  • IBM watsonx Granite models support a 128K token context window across the Granite 3.1 and newer Granite 3.x families.

Anthropic has expanded Claude’s context window significantly:

  • Claude Opus 4.6 and Claude Sonnet 4.6
    • Maximum context window: 1,000,000 tokens
    • This is generally available with no long‑context pricing premium
    • Applies to Claude Code, API usage, and supported cloud platforms

This is confirmed in Anthropic’s official documentation and announcements.

  • Entire large codebases or monorepos can fit in a single session
  • Long‑running agentic workflows without frequent context compaction
  • Strong fit for context‑first code analysis over RAG

IBM has standardized the context length across the Granite family:

  • Granite 3.1 and Granite 3.3 models
    • Context window: 128,000 tokens
    • Applies to:
      • Granite 3.1 8B Instruct
      • Granite 3.1 2B
      • Granite 3.3 8B Instruct
      • Granite Guardian models
      • Granite Code models
    • Available in IBM watsonx.ai and open‑source releases

IBM explicitly states that all Granite 3.1 language models feature a 128K token context length.

What this means in practice

  • Suitable for long documents, enterprise policies, and medium‑sized repositories
  • Optimized for enterprise RAG pipelines
  • Strong balance between cost, performance, and governance

Model familyMax context window
Claude Opus 4.61,000,000 tokens
Claude Sonnet 4.61,000,000 tokens
IBM Granite 3.1 (all variants)128,000 tokens
IBM Granite 3.3 8B Instruct128,000 tokens

Claude

  • Optimized for context‑first and agentic workflows
  • Designed to analyze entire artifacts directly
  • Reduces dependency on RAG for code and reasoning tasks

Granite

  • Optimized for enterprise AI with governance
  • Designed to work with RAG and retrieval layers
  • Prioritizes cost control, explainability, and compliance

IBM explicitly positions Granite alongside RAG‑first architectures, including embedding models and document preprocessing frameworks like Docling. 

  • Claude enables large‑context‑first code analysis, which explains its architectural preference over RAG.
  • Granite intentionally caps context at 128K, encouraging retrieval‑based grounding for enterprise workloads.

In Granite models, “B” means billions of learned parameters. It measures model capacity, not context length or training data size. In model names like:

  • Granite 3.1 8B
  • Granite 3.1 2B
  • Granite Guardian 3.1 8B
  • Granite Code 3B
  • Granite 3.1 3B‑A800M (MoE)

the “B” stands for Billion parameters.

1B = 1 billion parameters

parameter is a learned numerical weight inside the neural network that stores knowledge acquired during training.

For example : IBM explicitly states that:

  • Granite‑3.0‑8B‑Instruct is an 8‑billion‑parameter model 

Think of parameters as:

  • The knobs inside the model
  • Each knob controls how strongly the model connects concepts
  • Training adjusts billions of these knobs to encode language, code, and reasoning patterns

More parameters generally mean:

  • Higher reasoning capacity
  • Better pattern recognition
  • Better generalization

but also:

  • Higher memory usage
  • Higher compute cost

This definition is consistent across Granite, Llama, Claude, GPT, and other LLM families. 

Dense models

Model nameMeaning
Granite 3.1 2B~2 billion parameters
Granite 3.1 8B~8 billion parameters
Granite Guardian 3.1 8B~8 billion parameters
Granite Guardian 3.1 2B~2 billion parameters

These are dense transformer models, meaning all parameters are active for every token processed.

IBM confirms Granite 8B models are 8‑billion‑parameter dense decoder‑only transformers. [ibm.com]

What about MoE models like 3B‑A800M

Example: Granite 3.1 3B‑A800M. This is a Mixture‑of‑Experts (MoE) model.

Meaning

  • 3B = total parameters in the model
  • A800M = approximately 800 million active parameters per token

IBM documents that Granite MoE models activate only a subset of experts per inference step, reducing compute cost while maintaining capacity. 

Why this matters

  • MoE models scale capacity without linear cost
  • You get large model intelligence with lower inference overhead

Why all Granite models still have 128K context

Parameter count (B) and context window size are independent dimensions.

IBM explicitly states:

  • All Granite 3.1 dense, MoE, and Guardian models support a 128K token context window 

That means 

  • 2B vs 8B affects model intelligence
  • 128K affects how much text the model can read at once

Parameter sizeWhat it impacts
More parametersBetter reasoning, coding, abstraction
Fewer parametersFaster, cheaper, easier to deploy
MoE architectureBetter scaling efficiency
Context windowHow much data can be processed per request

Proprietary frontier models

These dominate enterprise copilots, coding assistants, and research tools.

  1. OpenAI GPT series

    • Examples: GPT‑5.x
    • Known for strong reasoning, math, and general intelligence
    • Widely used in ChatGPT and enterprise APIs
  2. Anthropic Claude

    • Examples: Claude Opus 4.6, Sonnet 4.6
    • Strong in code analysis, safety, and long‑context reasoning
    • Notable for very large context windows 
  3. Google Gemini

    • Examples: Gemini 3 Pro, Gemini Flash
    • Multimodal first design with text, image, audio, and video
    • Tight integration with Google Workspace and Vertex AI 
  4. xAI Grok

    • Examples: Grok 4
    • Optimized for real‑time and social data analysis
    • Integrated with the X platform 

Enterprise and governance‑focused models

These are optimized for regulated industries and private deployments.

  1. IBM watsonx Granite

    • Examples: Granite 3.1 8B, Granite Guardian
    • Enterprise‑grade governance and RAG‑first architecture
    • Open models with Apache 2.0 licensing 
  2. Amazon Nova

    • Examples: Nova Premier
    • Designed for scalable enterprise workloads on AWS
    • Integrated with Bedrock and AWS tooling 

Open and open‑weight frontier models

Popular for self‑hosting, cost control, and customization.

  1. Meta Llama

    • Examples: Llama 4 Scout, Llama 4 Maverick
    • Widely adopted open‑weight models
    • Strong ecosystem and tooling support 
  2. Mistral

    • Examples: Mistral Large, Mixtral
    • Efficient architectures and strong reasoning
    • Apache‑licensed options for enterprise use 
  3. DeepSeek

    • Examples: DeepSeek V3, DeepSeek R1
    • High‑performance open models competitive with proprietary LLMs
    • Popular for reasoning and coding tasks
  4. Qwen

    • Examples: Qwen 3, Qwen 3.5
    • Strong multilingual and long‑context capabilities
    • Increasing adoption in open‑source deployments

Lightweight and edge‑focused models

Used where latency, cost, or on‑device inference matters.

  1. Microsoft Phi

    • Examples: Phi‑3, Phi‑4
    • Small, efficient models for constrained environments
    • Often embedded in tools and workflows 
  2. Gemma

    • Examples: Gemma 2, Gemma 3
    • Google‑released open models
    • Designed for research and local inference 

Today’s LLM ecosystem spans proprietary frontier models like GPT, Claude, and Gemini, enterprise‑focused platforms such as IBM Granite, and open or open‑weight models like Llama, Mistral, and DeepSeek. Each family makes different tradeoffs across reasoning quality, context window size, governance, cost, and deployability, which is why modern AI systems increasingly adopt multi‑model strategies instead of relying on a single LLM.


Friday, January 2, 2026

Understanding BPF Trace Probes and BTF: Practical Insights from Real-World Debugging

Introduction 

BPFTrace is a powerful tool for dynamic tracing in Linux, enabling developers and system engineers to observe kernel and user-space events in real time. While working with BPFTrace, you often encounter different probe types and kernel features like BTF (BPF Type Format). This blog explains what these probes mean, why BTF matters, and how to troubleshoot common issues. 

  1. What bpftrace probes mean

    • Explain probe types like tracepointrawtracepointkprobe, and fentry:
      • tracepoint: Stable kernel instrumentation points for syscalls and subsystems.
      • rawtracepoint: Low-level hooks for tracepoints with minimal decoding.
      • kprobe: Dynamic function entry probes for kernel symbols.
      • fentry: Modern BPF function entry probes using BTF type info.
  2. What is BTF and why it matters

    • BPF Type Format (BTF) provides kernel type metadata for BPF programs.
    • Enables automatic argument decoding and advanced probes like fentry.
    • How to check if BTF is present (/sys/kernel/btf/vmlinux) and what to do if missing (use BPFTRACE_KERNEL_SOURCE or simpler probes).
  3. Common errors and fixes

    • Example error: error: field has incomplete type 'const enum landlock_rule_type'
      • Cause: Incomplete type info due to missing or partial BTF.
      • Fix: Use raw syscalls tracepoints or point bpftrace to kernel sources.
  4. Practical examples

    • bpftrace -e 'tracepoint:syscalls:sys_enter_openat { printf("%s\n", comm); }'\ Meaning: Prints process names whenever openat() syscall is called.
    • Alternatives for PPC/RHEL when BTF is incomplete: bpftrace -e 'tracepoint:rawsyscalls:sysenter { @[comm] = count(); }interval:s:5 { print(@); clear(@); }'
  5. Tips for running tests and scripts

    • How to run bpftrace tests (ctest) and functional one-liners.
    • How to handle duration (interval probe or -c 'sleep N')


Background on BPF: 

BPF (Berkeley Packet Filter) started as a packet filtering mechanism in Unix systems but has evolved into eBPF (Extended BPF) in modern Linux kernels. eBPF is a technology that allows you to run sandboxed programs inside the kernel without changing kernel source code or loading kernel modules.

  • Key idea: eBPF programs are verified and JIT-compiled by the kernel, making them safe and efficient.
  • Capabilities: Observability, networking, security, and performance monitoring.

What is bpftrace?

bpftrace is a high-level front-end for eBPF. It provides a simple scripting language to attach probes to kernel/user events and collect data. It’s similar to DTrace but for Linux.

Why is it needed?

  • Traditional monitoring tools often lack deep kernel visibility.
  • eBPF allows low-overhead, dynamic tracing without rebooting or patching the kernel.
  • Useful for:
    • Performance analysis (CPU, I/O, latency)
    • Debugging production issues
    • Security auditing

When is it applied?

  • When you need real-time insights into kernel or application behavior.
  • Examples:
    • Trace system calls (openatreadwrite)
    • Monitor network packets
    • Profile application performance without intrusive instrumentation

Who can use this feature?

  • System administrators: For troubleshooting and performance tuning.
  • Kernel developers: For debugging kernel internals.
  • SRE/DevOps engineers: For observability in production.
  • Security teams: For detecting anomalies and enforcing policies.
-------------------------Examples of BPF trace command ------------
bpftrace -e 'tracepoint:syscalls:sys_enter_openat { printf("%s\n", comm); }'


That means -->    bpftrace -e '...': Run an inline bpftrace program given in quotes.
tracepoint:syscalls:sys_enter_openat: Attach a probe to the kernel tracepoint that fires whenever a process calls the openat() system call (used to open files).
{ printf("%s\n", comm); }: The action block. For every event, print the process name (comm) that triggered the syscall.

Every time any process calls openat(), bpftrace prints the name of that process.
This is useful for observing which processes are opening files in real time. It leverages Linux tracepoints, which are stable kernel instrumentation points, and uses bpftrace’s built-in variable comm (the current process name).

# bpftrace -e 'tracepoint:syscalls:sys_enter_openat { printf("%s\n", comm); }'
Attached 1 probe
irqbalance
gssproxy
rmcd
bash
sshd
systemd
====>it means those processes invoked openat() during tracing.
--------------------------------
NOTE:
  • Built-in variable comm: In bpftrace, comm is automatically populated with the command name of the current task (the process executing when the probe fires).
  • Execution flow: The action block { printf("%s\n", comm); } runs for every event. At that instant, the kernel context is the process making the syscall, so comm reflects that process name.
  • =======Examples========

    bpftrace -l '*sleep*'
        list probes containing "sleep"

    # bpftrace -l '*sleep*'
    fentry:cls_flower:fl_destroy_sleepable
    fentry:vmlinux:wq_worker_sleeping
    fentry:vmlinux:zpool_can_sleep_mapped
    kprobe:__bpf_prog_array_free_sleepable_cb
    kprobe:__probestub_mm_compaction_kcompactd_sleep
    kprobe:__probestub_mm_vmscan_kswapd_sleep
    rawtracepoint:sunrpc:rpc_task_sleep
    rawtracepoint:sunrpc:rpc_task_sync_sleep
    tracepoint:syscalls:sys_exit_clock_nanosleep
    tracepoint:syscalls:sys_exit_nanosleep
    tracepoint:vmscan:mm_vmscan_kswapd_sleep
    ======
    bpftrace -e 'kprobe:do_nanosleep { printf("PID %d sleeping...\n", pid); }'
        trace processes calling sleep

    # bpftrace -e 'kprobe:do_nanosleep { printf("PID %d sleeping...\n", pid); }'
    Attached 1 probe
    PID 846 sleeping...


    ===============
    bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }'
        count syscalls by process name
    # bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }'
    Attached 1 probe
    ^C

    @[gssproxy]: 4
    @[gmain]: 10
    @[IBM.MgmtDomainR]: 10
    @[auditd]: 17
    @[systemd-userwor]: 27
    @[rmcd]: 40
    @[in:imjournal]: 48
    @[irqbalance]: 56
    @[bash]: 67
    @[multipathd]: 134
    @[bpftrace]: 223
    @[vi]: 564
    @[sshd-session]: 2542

    =============================

    Conclusion 

    Understanding probe types and BTF is essential for effective bpftrace usage. When BTF is missing or incomplete, fallback strategies like raw tracepoints or kernel source paths ensure smooth tracing. These insights help troubleshoot errors and write efficient tracing scripts. 

    Thursday, December 25, 2025

    Quantum Computing: A Different Universe, Not Just a Faster Computer

    One of the biggest misconceptions about quantum computing is that it’s just a super-fast version of your laptop. In reality, it’s a completely different paradigm—built on the principles of quantum physics.

    The Spinning Coin Analogy

    • Classical Computing: Imagine a coin lying flat on a table. It’s either Heads (1) or Tails (0). That’s a bit.
    • Quantum Computing: Now imagine that coin spinning. While it spins, it’s both Heads and Tails at the same time—until you stop it. That’s a qubit in superposition.

    Why Does This Matter?

    A classical computer solves problems one path at a time. A quantum computer can explore many paths simultaneously, thanks to three key principles:

    • Superposition: Qubits can exist in multiple states at once, enabling massive parallelism.
    • Entanglement: A mysterious link where two qubits remain connected no matter the distance—change one, and the other reacts instantly.
    • Interference: Quantum systems can amplify correct answers and cancel out wrong ones, similar to noise-canceling headphones.

    The Quantum Advantage

    This isn’t just theory—it’s practical. Quantum computing promises breakthroughs in:

    • Drug discovery: Simulating molecules at atomic precision.
    • AI & Machine Learning: Optimizing billion-parameter models faster than any classical supercomputer.
    • Cryptography: Redefining security in a post-quantum world.

    We’re entering the era of Quantum Utility, where quantum systems solve problems that would take classical machines millions of years.

    Quantum Computing: Why IBM is Leading the Race

    Quantum computing is no longer just a futuristic concept—it’s becoming a reality. Unlike classical computers, which rely on bits (0s and 1s), quantum computers use qubits, enabling them to solve problems that even the most powerful supercomputers cannot handle.

    Today, tech giants like IBM, Google, Microsoft, Amazon, and Nvidia, along with several innovative startups, are investing heavily in quantum technology. The goal? Achieve Quantum Advantage—the point where quantum computers outperform classical systems for meaningful tasks.

    The Quantum Boom

    • Quantum stocks and valuations have skyrocketed in 2025. Companies like IonQ, Rigetti, D-Wave, and Quantum Computing Inc. have seen gains up to 20x in the past year.
    • Venture capital funding for quantum startups hit $1.9 billion in 2024, a 138% increase over 2023.
    • Industries from finance to pharmaceuticals are preparing quantum strategies, anticipating one of the biggest revolutions since AI.

    What is Quantum Advantage?

    Quantum Advantage means performing a task on a quantum computer that no classical computer can achieve, regardless of size or power. This is the holy grail of quantum computing—and IBM is closer than ever.

    Quantum Hardware Modalities

    Unlike traditional computing, quantum technology comes in different flavors:

    • Superconducting Qubits (IBM, Google)
    • Trapped Ions (Quantinuum)
    • Neutral Atoms (QuEra)
    • Photonic Systems
    • Silicon Spin Qubits
    • Topological Qubits (still experimental)

    Each approach has pros and cons, but IBM leads in superconducting qubits, the most mature and scalable technology today.

    IBM Quantum: The Road to Advantage

    IBM has been working on quantum computing since the 1970s and has never missed a milestone on its published roadmap. Here’s why IBM stands out:

    Key Achievements

    • Superconducting Qubits cooled near absolute zero for stability.
    • Quantum Nighthawk Chip: 120 qubits, 30% faster than previous generation, enabling real-world applications.
    • Quantum Loon: Experimental processor for fault-tolerant quantum memory.
    • Error Correction Breakthrough: Detects errors in under 480 nanoseconds—8x faster than GPUs.
    • Qiskit Software: Industry-leading quantum programming framework, now with C-API for HPC acceleration.
    • 300mm Wafer Fabrication: Scaling production at Albany NanoTech Complex for faster development.

    IBM aims to achieve Quantum Advantage by 2026 and full fault tolerance by 2029.

    Conclusion: 

    Quantum computing won’t replace classical systems—it will complement them, solving problems in chemistry, optimization, and AI that are impossible today.

    The future is near. Quantum will transform industries, accelerate scientific discovery, and redefine computing as we know it.

    Friday, December 19, 2025

    Tool Calling in AI: Turning Language Models into Action-Driven Agents

    Large language models (LLMs) have transformed how humans interact with machines. However, the real breakthrough comes when these models stop being passive responders and start becoming active problem solvers. This transition is powered by a capability known as tool calling.

    In this blog, we will explore what tool calling is, why it matters, how it works internally, and how it enables agentic AI systems capable of real-world action.

    What Is Tool Calling?

    At its core, tool calling refers to an AI model’s ability to interact with external tools, APIs, databases, or systems to extend its native capabilities.

    Traditional LLMs operate purely on pretrained knowledge. They generate answers based on patterns learned during training. But, this approach has a hard limit:

    they cannot access real-time data, perform live computations, or take direct actions.

    Tool calling removes this limitation.

    With tool calling enabled, an AI system can:

    • Query live databases

    • Fetch real-time information (weather, stock prices, system status)

    • Execute functions or scripts

    • Trigger workflows and automation

    • Interact with enterprise systems

    This capability is sometimes called function calling, and it is one of the foundational pillars of agentic AI.

    Instead of merely answering questions, LLMs with tool calling can decide, act, and iterate—much like a digital agent.

    NOTE:  Agent is a system with an LLM at its core that is able to make decisions on what actions to take as it works to answer the prompt it received. The most common actions LLM agents can be built to take are: sending text or other media to the user, calling a tool to help answer the user, and calling another agent to help answer the user. Generally speaking, an LLM agent will also have a system prompt explaining what its role is and giving it some rules over when to call tools and/or reply to the user. For most Agents, the control flow can be shown as follows:


    source

    Why is Tool Calling Important 

    Limits if static knowledge: Even the most advanced LLMs are constrained by:

    • Training data cutoffs

    • Lack of real-time awareness

    • Inability to perform live computations

    • No direct access to user-specific systems

    Early models such as GPT-2 were entirely static. They produced impressive text but had no concept of now.
    Ask them about today’s weather or current stock prices, and they simply could not answer accurately.

    The Need for Real-World Interaction

    As AI moved into production systems—finance, healthcare, DevOps, customer support—the need for:

    • Live data

    • External computation

    • User-specific actions

    became unavoidable.

    This led to the introduction of tool calling, where models are trained to:

    1. Recognize when external help is needed

    2. Select the correct tool

    3. Generate structured requests

    4. Interpret structured responses

    Critically, tools often expect strict input schemas, not free-form text. Tool calling ensures model outputs conform to these schemas, making AI-system integration reliable and safe.

    How Does Tool Calling Work?

    Modern LLMs such as Claude, Llama 3, Mistral, and IBM Granite - all support tool calling, though implementation details may vary.

    At a high level, the process involves six steps.

    Step 1: Recognizing the Need for a Tool

    Imagine a user asks:

    “What’s the weather in San Francisco right now?”

    The model immediately understands:

    • This requires real-time data

    • The answer cannot come from its static training set

    At this point, the model decides to invoke a tool.
    A unique tool call ID is generated to track the request and its eventual response.

    Step 2: Selecting the Right Tool

    Next, the model chooses the most appropriate tool—perhaps a weather API.

    Each tool is described using metadata, including:

    • Tool (or function) name

    • Description

    • Input parameters

    • Input and output data types

    This metadata allows the model to reason about:

    • Which tool to use

    • What arguments it must provide

    Tool selection is not random—it is a learned decision based on context.

    Step 3: Preparing the Arguments (Args)

    Once the tool is selected, the model constructs structured arguments (often called args).

    For example:

    • City: San Francisco

    • Units: Celsius

    • Timestamp: current

    These arguments must strictly match the tool’s expected schema.

    To ensure consistency, developers often use templates or structured prompts that guide the model on:

    • Which tool to call

    • What arguments to pass

    This is where tool calling differs from free-form prompting—it is contract-driven.

    Tool Calling + RAG: A Powerful Combination

    Tool calling becomes even more effective when combined with Retrieval Augmented Generation (RAG).

    With RAG:

    • The model retrieves relevant structured and unstructured data

    • Then uses that data to generate a grounded response

    Benefits include:

    • Higher contextual accuracy

    • Reduced hallucinations

    • Lower API overhead

    • Greater flexibility across domains

    Unlike rigid tool calls, RAG allows more fluid reasoning by blending retrieved knowledge with generation.

    Step 4: Making the API Call

    Each tool is backed by an API, documented via:

    • Endpoints

    • HTTP methods

    • Request/response formats

    Many APIs require authentication via an API key.

    Once arguments are prepared, the model (or orchestration layer) sends an HTTP request to the external system.

    Step 5: Receiving and Processing the Response

    The external tool returns structured data—commonly in JSON format.

    For a weather API, this might include:

    • Temperature

    • Humidity

    • Wind speed

    The AI then:

    • Parses the response

    • Filters relevant fields

    • Transforms raw data into a human-friendly explanation

    Step 6: Acting or Responding

    Finally, the AI either:

    • Presents the information to the user, or

    • Confirms an action (e.g., “Your reminder has been scheduled.”)

    If the user asks follow-up questions, the model can repeat the cycle with refined parameters—enabling iterative reasoning.


    How Do LLMs Call Tools?

    For an LLM (Large Language Model) to call a tool, it needs a structured way to specify which tool it wants to use and what arguments to pass. Since an LLM outputs plain text tokens, an external system must parse this output and execute the tool call. This means the LLM should produce structured or semi-structured data consistently.

    Different APIs implement this differently, but the concept is the same across platforms. Let’s look at how the OpenAI Chat API handles this.

    When using the OpenAI Chat API, you provide a list of tools the LLM can access. Each tool is defined with:

    • Name of the tool
    • Description of what it does
    • Parameters (including type, description, and whether they are required)

    Here’s an example tool definition:

    {
        "type": "function",
        "function": {
            "name": "calculate_distance",
            "description": "Calculate the distance between two cities",
            "parameters": {
                "type": "object",
                "properties": {
                    "city_a": {
                        "type": "string",
                        "description": "Name of the first city, e.g., New York"
                    },
                    "city_b": {
                        "type": "string",
                        "description": "Name of the second city, e.g., Los Angeles"
                    },
                    "unit": {
                        "type": "string",
                        "enum": ["kilometers", "miles"]
                    }
                },
                "required": ["city_a", "city_b"]
            }
        }

    }

    This JSON would be included in the API call so the LLM knows it can use calculate_distance. If you don’t include it, the LLM won’t know the tool exists.

    How Does the LLM Decide to Call a Tool?

    When the LLM responds, you check the tool_calls property in the response. For example, in Python:

    response_message = response.choices[0].message
    tool_calls = response_message.tool_calls

    tool_calls will contain an array of tools the LLM wants to invoke, along with the arguments. Your system then executes the corresponding function or method with those arguments.This approach allows the LLM to reason about when to use a tool and provide structured arguments, while your application handles the actual execution

    source

    Agent Workflow as described below :

    1. Receive the Query
      The agent gets a natural-language request or task from the user or an external system.

    2. Discover Available Tools
      It looks up internal metadata or a tool registry to find relevant tools, schemas, and capabilities.

    3. Select and Invoke the Right Tool
      The LLM processes the query along with tool metadata (such as function names, input types, and descriptions).
      It chooses the most appropriate tool, prepares the input arguments, and generates a structured function call.

    4. Execute the Tool
      The agent shell or tool runner runs the selected function and retrieves the output (e.g., API response, database value, or computation result).

    5. Return the Final Response
      The LLM incorporates the tool’s result into its prompt and produces a natural-language answer for the user
      .

      =====================================

      Key Capabilities :

    • Dynamic Tool Selection
      Automatically picks the right tool based on the context of the task.

    • Schema-Aware Prompting
      Supports structured interfaces like OpenAPI, JSON Schema, and AWS function definitions for precise interactions.

    • Intelligent Output Handling
      Interprets results and chains outputs into logical reasoning for complex workflows.

    • Flexible Execution Modes
      Works in both stateless and session-aware environments.


    Common Use Cases :

    • Virtual Assistants with External Data Access
      Enhance assistants by connecting them to APIs and real-time data sources.

    • Financial Calculators and Estimators
      Perform dynamic computations and provide accurate projections.

    • API-Driven Knowledge Workers
      Automate tasks that require pulling and processing data from multiple services.

    • LLM-Powered Integrations
      Invoke AWS Lambda, Amazon SageMaker endpoints, and SaaS tools for advanced functionality.

    ==================================================

    LangChain and Tool Calling

    LangChain is one of the most widely used frameworks for implementing tool calling.

    It provides:

    • Tool registration

    • Argument parsing

    • Context-aware routing

    • Memory across multiple interactions

    Unlike basic tool calling, LangChain can:

    • Chain multiple tools together

    • Store previous tool outputs

    • Enable complex, multi-step agent workflows

    For example:

    1. Call a weather API

    2. Use results to trigger a clothing recommendation tool

    3. Generate a final personalized response

    This is a practical implementation of agentic AI.

    Common Types of Tool Calling Use Cases

    While possibilities are endless, most applications fall into a few major categories.

    1. Information Retrieval and Search

    AI pulls real-time data from:

    • Web search engines

    • Financial markets

    • Academic databases

    • News sources

    Example: Fetching live stock prices or breaking news inside a chatbot.

    2. Code Execution and Computation

    AI executes:

    • Mathematical calculations

    • Simulations

    • Scripts via Python or engines like Wolfram Alpha

    Useful for analytics, engineering, and scientific domains.

    3. Process Automation

    AI automates workflows by integrating with:

    • Calendars

    • Email systems

    • CRM tools (Salesforce)

    • Finance platforms (QuickBooks)

    This enables AI-driven business operations.

    4. Smart Devices and IoT Control

    Agentic systems can monitor and control:

    • Smart homes

    • Industrial sensors

    • Robotics platforms

    This opens the door to fully autonomous, end-to-end workflows.


    Final Thoughts

    Tool calling is not just a feature, it is a paradigm shift.

    It allows LLMs to:

    • Know when they don’t know

    • Reach outside themselves

    • Act in the real world

    • Continuously refine outcomes

    As AI systems evolve, tool calling will be the foundation that turns language models into true digital agents—capable of reasoning, acting, and collaborating across complex environments.If language is intelligence, tool calling is agency.

    --------------------------------------BACKUP INFO-------------------------------

    Sample code:

    Below is a complete Python example that shows:

    1. Defining tools
    2. Making a chat completion request
    3. Reading tool_calls and parsing JSON arguments
    4. Executing your local functions
    5. Returning tool outputs back to the model for a final answer
    # pip install openai
    from openai import OpenAI
    import json

    # Initialize client
    client = OpenAI(api_key="YOUR_API_KEY")

    # Define one tool
    tools = [
        {
            "type": "function",
            "function": {
                "name": "get_city_coordinates",
                "description": "Return approximate latitude and longitude for a city",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "city": {"type": "string", "description": "City name"}
                    },
                    "required": ["city"]
                }
            }
        }
    ]

    # Local function to execute
    def get_city_coordinates(city):
        # Hardcoded example
        coords = {
            "New York": (40.7128, -74.0060),
            "Los Angeles": (34.0522, -118.2437),
            "Bangalore": (12.9716, 77.5946)
        }
        lat, lon = coords.get(city, (None, None))
        return {"city": city, "latitude": lat, "longitude": lon}

    # User message
    messages = [{"role": "user", "content": "Give me the coordinates of Bangalore"}]

    # First call: model decides if it needs the tool
    resp = client.chat.completions.create(
        model="gpt-4o-mini",  # Use a tool-capable model
        messages=messages,
        tools=tools,
        tool_choice="auto"
    )

    # Check if model requested a tool
    tool_calls = resp.choices[0].message.tool_calls or []
    tool_msgs = []
    for call in tool_calls:
        # Parse JSON arguments
        args = json.loads(call.function.arguments)
        result = get_city_coordinates(args["city"])
        # Send tool result back to model
        tool_msgs.append({
            "role": "tool",
            "tool_call_id": call.id,
            "content": json.dumps(result)
        })

    # Second call: model uses tool result to finish answer
    finalfinal = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages + [resp.choices[0].message] + tool_msgs
    )

    How It Works

    1. tools: Defines the schema so the model knows what arguments to provide.
    2. First API call: Model decides if it needs the tool and returns tool_calls.
    3. Parse arguments: call.function.arguments is a JSON string → json.loads().
    4. Execute local function: get_city_coordinates(city).
    5. Send result back: Add a message with role="tool" and tool_call_id.
    6. Second API call: Model uses the tool output to generate the final answer.
    ============================
    EXAMPLE 2: How to add functions ?


    # pip install openai
    from openai import OpenAI
    import json

    client = OpenAI(api_key="YOUR_API_KEY")

    # 1) Define your tool schema. The model uses this to learn how to call your functions.
    tools = [
        {
            "type": "function",
            "function": {
                "name": "calculate_distance",
                "description": "Calculate great-circle distance between two cities",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "city_a": {"type": "string", "description": "First city, e.g., New York"},
                        "city_b": {"type": "string", "description": "Second city, e.g., Los Angeles"},
                        "unit": {
                            "type": "string",
                            "enum": ["kilometers", "miles"],
                            "description": "Distance unit"
                        }
                    },
                    "required": ["city_a", "city_b"]
                }
            }
        },
        {
            "type": "function",
            "function": {
                "name": "get_city_coordinates",
                "description": "Return approximate latitude and longitude for a city",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "city": {"type": "string", "description": "City name"}
                    },
                    "required": ["city"]
                }
            }
        }
    ]

    # 2) Implement the actual Python functions the model can call.
    # In production, replace these with real logic (DB lookup, API calls, etc.)
    CITY_DB = {
        "New York": (40.7128, -74.0060),
        "Los Angeles": (34.0522, -118.2437),
        "San Francisco": (37.7749, -122.4194),
        "Bangalore": (12.9716, 77.5946),
    }

    from math import radians, sin, cos, sqrt, atan2

    def get_city_coordinates(city: str):
        if city not in CITY_DB:
            raise ValueError(f"Unknown city: {city}")
        lat, lon = CITY_DB[city]
        return {"city": city, "latitude": lat, "longitude": lon}

    def calculate_distance(city_a: str, city_b: str, unit: str = "kilometers"):
        # Get coords
        lat1, lon1 = CITY_DB.get(city_a, (None, None))
        lat2, lon2 = CITY_DB.get(city_b, (None, None))
        if lat1 is None or lat2 is None:
            raise ValueError("One or both cities unknown")

        # Haversine
        R_km = 6371.0
        R_mi = 3958.8

        phi1, phi2 = radians(lat1), radians(lat2)
        dphi = radians(lat2 - lat1)
        dlambda = radians(lon2 - lon1)

        a = sin(dphi / 2) ** 2 + cos(phi1) * cos(phi2) * sin(dlambda / 2) ** 2
        c = 2 * atan2(sqrt(a), sqrt(1 - a))
        d_km = R_km * c
        d_mi = R_mi * c

        if unit == "miles":
            return {"city_a": city_a, "city_b": city_b, "distance": round(d_mi, 2), "unit": "miles"}
        else:
            return {"city_a": city_a, "city_b": city_b, "distance": round(d_km, 2), "unit": "kilometers"}

    # 3) Start the conversation. The user asks something that likely requires tools.
    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {
            "role": "user",
            "content": "What is the distance between New York and Los Angeles in miles? Also share coordinates for Los Angeles."
        },
    ]

    # 4) First model call with tools listed. The model may decide to call one or more tools.
    first_response = client.chat.completions.create(
        model="gpt-4o-mini",   # Choose a tool-capable model
        messages=messages,
        tools=tools,
        tool_choice="auto"      # Let the model decide whether and which tools to call
    )

    # 5) Check if the model wants to call tools.
    # This is where the JSON arguments live: tool_call.function.arguments is a JSON string.
    tool_calls = first_response.choices[0].message.tool_calls or []

    # If there are tool calls, execute them and gather results.
    tool_results_messages = []
    for call in tool_calls:
        tool_name = call.function.name
        # The arguments come as a JSON string. Parse it to a dict.
        args = json.loads(call.function.arguments)

        try:
            if tool_name == "calculate_distance":
                result = calculate_distance(
                    city_a=args.get("city_a"),
                    city_b=args.get("city_b"),
                    unit=args.get("unit", "kilometers"),
                )
            elif tool_name == "get_city_coordinates":
                result = get_city_coordinates(args.get("city"))
            else:
                result = {"error": f"Unknown tool {tool_name}"}
        except Exception as e:
            result = {"error": str(e)}

        # Add a tool result message. The role must be "tool" and the tool_call_id must match.
        tool_results_messages.append({
            "role": "tool",
            "tool_call_id": call.id,
            "content": json.dumps(result)
        })

    # 6) Send the tool outputs back to the model to let it finish the answer.
    final_messages = messages + [first_response.choices[0].message] + tool_results_messages


    final_response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=final_messages
    )

    Key points to notice :

    • Where do the JSON input args go?
      The model returns them in tool_calls[i].function.arguments as a JSON string. You must json.loads(...) that string to get a Python dict to call your function.

    • Returning tool outputs back to the model
      You send a new message with role="tool", include the tool_call_id from the original call, and put your tool’s output in content(commonly JSON).

    • Finalization step
      After you add the tool result messages, call the model again so it can synthesize a natural language answer using the tool outputs.

    ====================

    Example 3 : Minimal example showing a remote API call as the “execute locally” step

    workflow: 

    define tools → let the model decide → parse arguments → execute locally → return results → finalize answer

    NOTE: execute locally = execute in your code. Where your code reaches out is your choice.Local code that calls external systems, code makes a request to a remote DB or API.Your code enqueues a job to a worker or serverless function

    from openai import OpenAI
    import json
    import requests  # pip install requests

    client = OpenAI(api_key="YOUR_API_KEY")

    # Tool schema that asks the model to provide a username
    tools = [
        {
            "type": "function",
            "function": {
                "name": "get_user_profile",
                "description": "Fetch a user's profile from a remote service",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "username": {"type": "string", "description": "Login name"}
                    },
                    "required": ["username"]
                }
            }
        }
    ]

    # Local function that calls an external API
    def get_user_profile(username):
        try:
            # External call. This is the execute locally step in your app.
            resp = requests.get(
                f"https://api.example.com/users/{username}",
                timeout=5
            )
            resp.raise_for_status()
            data = resp.json()
            # Always return a JSON-serializable object
            return {
                "username": data.get("username"),
                "full_name": data.get("full_name"),
                "email": data.get("email"),
                "status": "ok"
            }
        except requests.exceptions.Timeout:
            return {"error": "timeout", "status": "failed"}
        except requests.exceptions.HTTPError as e:
            return {"error": f"http {e.response.status_code}", "status": "failed"}
        except Exception as e:
            return {"error": str(e), "status": "failed"}

    messages = [{"role": "user", "content": "Show the profile for user alice"}]

    # First call: model decides to call the tool and provides JSON args
    first = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages,
        tools=tools,
        tool_choice="auto"
    )

    tool_calls = first.choices[0].message.tool_calls or []
    tool_messages = []

    for call in tool_calls:
        args = json.loads(call.function.arguments)
        result = get_user_profile(args["username"])
        tool_messages.append({
            "role": "tool",
            "tool_call_id": call.id,
            "content": json.dumps(result)
        })

    # Second call: model reads tool outputs and finalizes the answer
    final_messages = messages + [first.choices[0].message] + tool_messages

    final = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=final_messages
    )

    Example 4: Multi-agent setup with orchestration demonstrates a two-agent workflow using tool calls. It shows:
    • Agent 1 (DataAgent) calculates distance between two cities using tools
    • Agent 2 (ReportAgent) formats the result using its own tool
    • An orchestrator glues the two agents together

    # pip install openai
    from openai import OpenAI
    import json
    from math import radians, sin, cos, sqrt, atan2

    # Initialize OpenAI client
    client = OpenAI(api_key="YOUR_API_KEY")

    # ---------------------------------------------
    # Shared data for tools
    # ---------------------------------------------
    CITY_DB = {
        "Bangalore": (12.9716, 77.5946),
        "Los Angeles": (34.0522, -118.2437),
        "New York": (40.7128, -74.0060),
        "San Francisco": (37.7749, -122.4194),
    }

    # ---------------------------------------------
    # Local tool implementations for Agent 1
    # ---------------------------------------------
    def get_city_coordinates(city: str):
        if city not in CITY_DB:
            raise ValueError(f"Unknown city: {city}")
        lat, lon = CITY_DB[city]
        return {"city": city, "latitude": lat, "longitude": lon}

    def calculate_distance(city_a: str, city_b: str, unit: str = "kilometers"):
        lat1, lon1 = CITY_DB.get(city_a, (None, None))
        lat2, lon2 = CITY_DB.get(city_b, (None, None))
        if lat1 is None or lat2 is None:
            raise ValueError("One or both cities unknown")

        # Haversine formula
        R_km = 6371.0
        R_mi = 3958.8

        phi1, phi2 = radians(lat1), radians(lat2)
        dphi = radians(lat2 - lat1)
        dlambda = radians(lon2 - lon1)

        a = sin(dphi / 2) ** 2 + cos(phi1) * cos(phi2) * sin(dlambda / 2) ** 2
        c = 2 * atan2(sqrt(a), sqrt(1 - a))
        d_km = R_km * c
        d_mi = R_mi * c

        if unit == "miles":
            return {"city_a": city_a, "city_b": city_b, "distance": round(d_mi, 2), "unit": "miles"}
        else:
            return {"city_a": city_a, "city_b": city_b, "distance": round(d_km, 2), "unit": "kilometers"}

    # ---------------------------------------------
    # Local tool implementations for Agent 2
    # ---------------------------------------------
    def format_summary(city_a: str, city_b: str, distance: float, unit: str):
        bullets = [
            f"Route: {city_a} to {city_b}",
            f"Distance: {distance} {unit}",
            "Method: Great circle approximation",
            "Use case: Travel planning and logistics"
        ]
        conclusion = f"In summary, the distance between {city_a} and {city_b} is {distance} {unit}."
        return {"bullets": bullets, "conclusion": conclusion}

    # ---------------------------------------------
    # Tool schemas
    # ---------------------------------------------
    DATA_AGENT_TOOLS = [
        {
            "type": "function",
            "function": {
                "name": "get_city_coordinates",
                "description": "Return latitude and longitude for a known city",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "city": {"type": "string", "description": "City name"}
                    },
                    "required": ["city"]
                }
            }
        },
        {
            "type": "function",
            "function": {
                "name": "calculate_distance",
                "description": "Calculate great circle distance between two cities",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "city_a": {"type": "string"},
                        "city_b": {"type": "string"},
                        "unit": {"type": "string", "enum": ["kilometers", "miles"]}
                    },
                    "required": ["city_a", "city_b"]
                }
            }
        }
    ]

    REPORT_AGENT_TOOLS = [
        {
            "type": "function",
            "function": {
                "name": "format_summary",
                "description": "Create a short bullet list and conclusion from computed distance",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "city_a": {"type": "string"},
                        "city_b": {"type": "string"},
                        "distance": {"type": "number"},
                        "unit": {"type": "string"}
                    },
                    "required": ["city_a", "city_b", "distance", "unit"]
                }
            }
        }
    ]

    # ---------------------------------------------
    # Agent class: runs a single tool-capable turn
    # ---------------------------------------------
    class Agent:
        def __init__(self, name: str, system_prompt: str, tools: list, tool_impls: dict):
            self.name = name
            self.system_prompt = system_prompt
            self.tools = tools
            self.tool_impls = tool_impls

        def run_turn(self, user_or_context_messages: list, model: str = "gpt-4o-mini"):
            """
            Takes incoming messages, lets the model decide tool calls,
            executes tools locally, returns final assistant text and the tool results.
            """
            # First call: let model decide whether to call tools
            first = client.chat.completions.create(
                model=model,
                messages=[{"role": "system", "content": self.system_prompt}] + user_or_context_messages,
                tools=self.tools,
                tool_choice="auto"
            )

            assistant_msg = first.choices[0].message
            tool_calls = assistant_msg.tool_calls or []
            tool_result_messages = []
            collected_results = []  # Keep structured results for orchestration

            # Execute each requested tool
            for call in tool_calls:
                fn_name = call.function.name
                args = json.loads(call.function.arguments)

                try:
                    fn = self.tool_impls.get(fn_name)
                    if fn is None:
                        result = {"error": f"Unknown tool {fn_name}"}
                    else:
                        result = fn(**args)
                except Exception as e:
                    result = {"error": str(e)}

                # Store the structured result for the orchestrator
                collected_results.append({"name": fn_name, "args": args, "result": result})

                # Return result to the model using role="tool" with matching tool_call_id
                tool_result_messages.append({
                    "role": "tool",
                    "tool_call_id": call.id,
                    "content": json.dumps(result)
                })

            # Second call: model reads tool outputs and produces a final assistant message
            final_messages = [{"role": "system", "content": self.system_prompt}] + user_or_context_messages +[assistant_msg] + tool_result_messages
            final = client.chat.completions.create(model=model, messages=final_messages)
            final_text = final.choices[0].message.content

            return final_text, collected_results

    # ---------------------------------------------
    # Instantiate two agents
    # ---------------------------------------------
    data_agent = Agent(
        name="DataAgent",
        system_prompt=(
            "You are a precise data agent. Use tools to compute distances and coordinates. "
            "Answer concisely and include structured context if helpful."
        ),
        tools=DATA_AGENT_TOOLS,
        tool_impls={
            "get_city_coordinates": get_city_coordinates,
            "calculate_distance": calculate_distance
        }
    )

    report_agent = Agent(
        name="ReportAgent",
        system_prompt=(
            "You are a clear technical writer. Format results with helpful bullets and a concise conclusion. "
            "Prefer calling the formatting tool when raw values are given."
        ),
        tools=REPORT_AGENT_TOOLS,
        tool_impls={
            "format_summary": format_summary
        }
    )

    # ---------------------------------------------
    # Orchestration: two agent workflow
    # ---------------------------------------------
    def run_two_agent_workflow(city_a: str, city_b: str, unit: str = "miles"):
        # Step 1: User asks for a report that requires computation
        user_message = {
            "role": "user",
            "content": f"Compute the distance between {city_a} and {city_b} in {unit}. Then present a short professional report."
        }

        # Step 2: DataAgent computes using its tools
        data_text, data_results = data_agent.run_turn([user_message])

        # Extract the distance result from DataAgent tool outputs
        # We search for calculate_distance result
        distance_payload = None
        for item in data_results:
            if item["name"] == "calculate_distance" and "result" in item:
                distance_payload = item["result"]
                break

        if distance_payload is None:
            # Fallback if model did not call the tool
            # In a production system you would retry or ask the model to call the tool explicitly
            raise RuntimeError("DataAgent did not produce a distance result")

        # Step 3: Prepare input for ReportAgent
        # We provide both the narrative from DataAgent and the structured JSON payload
        report_messages = [
            {"role": "user", "content": "Create a short professional report from the following computed data."},
            {"role": "user", "content": json.dumps(distance_payload)}
        ]

        # Step 4: ReportAgent formats the result using its tool
        report_text, _ = report_agent.run_turn(report_messages)

        # Step 5: Return final report to the caller
        return {
            "data_agent_answer": data_text,
            "report_agent_answer": report_text,
            "distance_payload": distance_payload
        }

    # ---------------------------------------------
    # Demo run
    # ---------------------------------------------
    if __name__ == "__main__":
        result = run_two_agent_workflow("Bangalore", "Los Angeles", unit="miles")
        print("\n=== DataAgent output ===\n")
        print(result["data_agent_answer"])
        print("\n=== ReportAgent final report ===\n")
        print(result["report_agent_answer"])
        print("\n=== Raw computed payload ===\n")

    How this works

    1. Each agent has its own system prompt, tool schema, and Python functions.
    2. For each agent, we make a first call with tools=... and tool_choice="auto".
    3. We parse assistant_msg.tool_calls[i].function.arguments which is a JSON string.
    4. We execute the requested local function and return a tool message with role="tool" and tool_call_id.
    5. We make a second call for the agent to finalize the answer.
    6. The orchestrator passes the computed JSON payload from Agent 1 to Agent 2, which formats the report via its own tool.
    NOTE: This is a minimal synchronous pattern. In production you might add retries, timeouts, logging, and guardrails.
    -------------------------------------------------

    AI Agents : Systems that act autonomously to achieve goals using reasoning , memory and tools. Build on top of LLM with planning and execution loops

    Types:
    1) Reactive  Agent. : Respond instantly  example chatbots 
    2) Planning agent : reason , plan  and execute Muti step tasks 
    3) collaborative agent : coordinate with other Agents or Humans

    Agent Architecture :

    1) Planner : Decides next actions 
    2) Executor: Performs action via tools or APIs
    3) Memory: short term memory(chat context) and log-term (Vector store)
    4) Reasoner : Evaluates progress and adapts plan
    5) Interface : connects with user or system

    -----------------------------------------

    Understanding RAG: How Retrieval-Augmented Generation Works

    RAG is a powerful technique that combines search and generation to make AI responses accurate, grounded, and up-to-date. Instead of relying only on what a language model learned during training, RAG allows it to pull in fresh, private, or domain-specific data on the fly.

    Let’s break it down into four key stages:

    1. Indexing – Preparing Your Knowledge Base

    Before AI can answer questions using your documents, those documents need to be transformed into a searchable format.

    How it works:

    • Start with raw content: PDFs, Word files, notes, web pages, etc.
    • Extract text: Pull plain text from these sources.
    • Chunking: Split long text into smaller, manageable pieces. This matters because LLMs can’t process huge blocks efficiently.
    • Vectorization: Convert each chunk into a numerical representation called a vector, which captures the meaning of the text.
    • Embedding model: A specialized model performs this text-to-vector conversion.
    • Store in a vector database: All vectors are saved in a database optimized for similarity search.

    2. Retrieval – Finding Relevant Information

    When a user asks a question, the system fetches the most relevant chunks from the indexed data.

    Steps:

    • User submits a query: Example: “What does the contract say about termination?”
    • Convert query to a vector: Using the same embedding model as before.
    • Similarity search: Compare the query vector with stored document vectors.
    • Return top matches: The system outputs the most relevant text chunks.

    3. Augmentation – Building Context for the Model

    The retrieved chunks are combined with the user’s question to create a rich, context-aware prompt.

    Process:

    • Gather relevant chunks.
    • Merge them into a clean context block.
    • Construct a new prompt that includes:
      1. The original question
      2. The retrieved context
    • This augmented prompt gives the LLM the background knowledge it needs.

    4. Generation – Producing a Grounded Answer

    Finally, the enriched prompt is sent to the language model.

    What happens:

    • The LLM reads both the question and the retrieved context.
    • It generates a response based on actual data, not guesses.
    • The output is accurate, explainable, and tied to your documents.

    ---------------------

    Why RAG Matters

    • Standard LLMs rely only on their training data.
    • RAG enables models to use your private or latest information.
    • Updating knowledge is as simple as updating your indexed documents.

     This approach is widely used in enterprise search, chatbots, legal document analysis, and customer support systems.

    -----------------------------
    How LLM Agents Combine Memory, Planning, and Tools for Intelligent Task Execution


    1. Receive query and fetch memory

      • The agent takes the user query.
      • It retrieves relevant session memory or long-term memory (for user preferences, past decisions, cached results).
    2. Discover tools via MCP

      • The agent searches the MCP Tools Registry for available tools, schemas, and capabilities.
      • Examples: OpenAPI specs, JSON Schema, AWS Lambda functions, SageMaker endpoints, SaaS connectors.
    3. Planning, reflection, and tool choice

      • The LLM plans steps, decomposes goals, validates arguments, and routes the query.
      • Uses self-critique to ensure correct tool selection and schema compliance.
      • Planning references memory to avoid redundant calls and to personalize results.
    4. Execute tools and collect observations

      • The MCP Tool Runner executes the chosen function.
      • The agent receives tool outputs, checks units, ranges, and business rules.
      • If errors occur, planning adapts, retries, or switches tools.
    5. Respond and update memory

      • The LLM synthesizes a natural-language answer.
      • Optionally writes key facts or decisions back to memory for future use.

    =======================================================

     Role of Memory in Tool Calling

    Memory acts as the context backbone for the agent. It ensures that the LLM doesn’t operate in isolation but instead uses relevant historical and contextual data to make better decisions during planning and tool invocation.

    Types of Memory

    1. Short-Term Memory (Session Memory)

      • Tracks the current conversation flow and intermediate steps.
      • Example: If the user asks, “Add a new test case for DAWR,” and later says, “Make it similar to the last one,” short-term memory recalls what “last one” refers to.
      • Stored in the agent’s working context (like a conversation buffer).
    2. Long-Term Memory

      • Stores persistent knowledge across sessions.
      • Example: Past tool calls, user preferences, previous bug reports, or test harness details.
      • Typically implemented using vector databases (e.g., Pinecone, Weaviate, FAISS) for semantic search.
      • Enables retrieval of relevant references during planning or content generation.

    Where Memory Fits in the Flow

    Referencing your diagram and updated version:

    • Step 1 (Receive Query):
      Memory is accessed immediately to enrich the query with historical context.
      Example: “User often works on Linux RAS components → prioritize related tools.”

    • Step 3 (Planning & Tool Choice):
      Memory helps the LLM plan better by recalling previous tool usage patterns, schema details, and user-specific constraints.
      Example: “Last time, the user preferred JSON schema-based prompts → use that format.”

    • Step 4 (Tool Execution):
      Memory can store execution results for future reuse.
      Example: Cache API responses or computed estimates to avoid redundant calls.

    • Step 5 (Response):
      Memory updates with new facts, decisions, and tool outputs for long-term learning.

    Why Memory Matters

    • Personalization: Tailors responses based on user history.
    • Efficiency: Avoids repeated tool calls by caching results.
    • Accuracy: Provides richer context for reasoning and planning.
    • Scalability: Enables complex workflows by chaining past knowledge.

    Practical Implementation

    • Short-Term: Conversation buffer in the agent shell.
    • Long-Term:
      • Vector DB for semantic retrieval.
      • Store tool metadata, execution logs, and user preferences.
      • Use embeddings to link queries with relevant past interactions.

    ================================

    n8n is an open-source workflow automation platform that helps you connect different apps, services, and APIs without writing a lot of custom code. It’s similar to tools like Zapier or Integromat, but with more flexibility and self-hosting options.

    Here’s what makes n8n special:

    • Visual Workflow Builder: You can create workflows using a drag-and-drop interface.
    • Integrations: It supports hundreds of apps and APIs (Slack, GitHub, Google Sheets, etc.).
    • Custom Logic: You can add JavaScript code snippets for advanced logic.
    • Self-Hosting: Unlike many SaaS automation tools, you can run n8n on your own server for full control.
    • Event-Driven Automation: Trigger workflows based on events (e.g., new email, webhook, database update)
    ======== Five Single AI Agent Architectures================

    1. AI Agent Using Tools

    • The agent receives a chat message and plans its actions.
    • It can access contacts, send emails, or send invitations using integrated tools.

    2. AI Agent Mixing Tools with MCP Servers

    • Triggered by another app through a webhook.
    • Uses an MCP (Model Context Protocol) server for specialized integrations (e.g., Atlassian).
    • Combines ready-to-use tools for other interactions.

    3. Agentic Workflow with a Router

    • A router acts as a conditional decision-maker.
    • The agent routes tasks based on conditions (e.g., if X happens, do Y).

    4. AI Agent with a Human in the Loop

    • The agent pauses for human approval before proceeding.
    • Example: Asking for Slack approval before executing an action.

    5. Dynamically Calling Other Agents

    • The agent autonomously decides whether to call another AI agent.
    • Option 1: Subagent via an AI Agent Tool node.
    • Option 2: Subagent or another agent via a Workflow Tool node.
    ====================================

    Understanding AI Agent Protocols: The Backbone of Agentic AI

    As AI evolves from single models to agentic architectures (systems where multiple AI agents work together), one question becomes critical:

    How do these agents talk to each other and coordinate tasks?

    That’s where AI Agent Protocols come in. Think of them as the “rules of communication” that allow agents to share information, collaborate, and connect with external tools. Without these protocols, agentic AI would be chaotic and unreliable.

    Here are the most important protocols you should know—explained simply:

    1. MCP — Model Context Protocol (by Anthropic)

    • What it does: Helps AI agents manage context and connect to external tools like Slack, GitHub, or APIs.
    • How it works: Uses a client-server setup with JSON-RPC for communication.
    • Example: Imagine an AI assistant in Slack that can also pull data from GitHub and update project status automatically.

    2. A2A — Agent-to-Agent Protocol (by Google)

    • What it does: Allows multiple AI agents to collaborate and share tasks.
    • How it works: Agents talk directly (peer-to-peer) or through a central coordinator.
    • Example: Two AI agents—one handling API calls and another managing database queries—working together to complete a workflow in Vertex AI.

    3. SLIM — Structured Language Interaction Model (by OpenAI)

    • What it does: Makes sure agents exchange messages in a structured, predictable way.
    • Why it matters: Prevents confusion when agents use tools or execute tasks.
    • Example: An agent asking another agent for a tool response in a clear format, so nothing gets lost in translation.

    4. ACP — Agent Communication Protocol (by IBM)

    • What it does: Handles discovery of helper agents, status updates, and message routing.
    • Where it’s used: Large enterprise systems with multiple agents and services.
    • Example: Orchestrating a complex workflow where one agent monitors servers, another handles alerts, and a third updates dashboards.

    Why This Matters

    These protocols are the foundation of agentic AI. They enable:

    • Coordination between agents
    • Scalability for large systems
    • Real-world execution beyond simple prompts

    As we move toward multi-agent, autonomous systems, understanding these protocols is essential for building reliable AI solutions.

    Pro Tip for Beginners: Start by experimenting with one protocol (like MCP) in a simple project—such as connecting an AI chatbot to an external API. Once you see how communication works, scaling to multi-agent systems becomes much easier.

    =========================

    • Conclusion :

      LLMs can dynamically decide when to use a tool, pass the right arguments, and incorporate the results into their final response. This approach transforms LLMs from passive text generators into active problem-solvers that can query APIs, run computations, or fetch real-time data.

      Understanding this workflow—define tools → let the model decide → parse arguments → execute locally → return results → finalize answer—is key to building powerful AI-driven applications. Whether you’re integrating with APIs, automating workflows, or creating intelligent assistants, tool calling is the foundation for making LLMs truly useful in real-world scenarios.