Monday, March 16, 2026

Claude Design uses Large Context Windows for Deeper Reasoning over RAG

Modern LLM systems typically choose between two ways of giving models the information they need: Retrieval Augmented Generation (RAG) or large context windows. Both solve different problems 

Claude (Anthropic) in VS Code primarily uses a large context window plus agentic code exploration, not classic RAG by default.RAG can be added, but it is optional and external.

Claude in VS Code analyzes code using large context windows and active file exploration, because code benefits more from precise, agent‑driven inspection than from passive RAG retrieval.

1. Large context window as the primary mechanism

Claude Code relies on very large context windows to analyze code. The VS Code extension automatically provides Claude with:

  • Your currently open file
  • Selected text ranges
  • Files you explicitly reference using @file or line ranges
  • Project memory files like CLAUDE.md

This behavior is documented in the official Claude Code VS Code docs, which describe direct file visibility and context passing rather than retrieval pipelines. 

Claude models are explicitly designed to support hundreds of thousands of tokens, which makes “read the code directly” feasible without a retrieval layer.

2. Agentic search instead of passive RAG

Rather than pre‑indexing your repo into a vector database (classic RAG), Claude Code acts as an agent that:

  • Searches files
  • Reads only relevant sections
  • Iteratively explores the codebase

This design choice is highlighted in community and practitioner analyses describing Claude Code as active investigation instead of “dump everything into context” RAG. 

Examples of agentic behavior include:

  • Grep‑like searches
  • Targeted file reads
  • Incremental context building

This is fundamentally different from traditional RAG, which retrieves chunks blindly based on similarity.

3. Why Anthropic chose this approach

Reason 1: Code is structured, not fuzzy text

Source code has:

  • Strong syntax
  • Explicit dependencies
  • Precise identifiers

Anthropic’s approach assumes it is better to search deterministically (file names, symbols, call paths) than rely on embedding similarity alone. 

Reason 2: Large context windows reduce RAG overhead

With large context windows:

  • Claude can read entire files when needed
  • No chunking or embedding errors
  • No stale indexes after code changes

This is reinforced by the existence of tooling that tracks context window usage, showing that Claude is designed to operate close to context limits rather than avoid them. 

Reason 3: RAG is optional, not built‑in

RAG for Claude Code exists as external or community tools, not as a default feature. For example:

  • DevRAG and MCP‑based tools add vector search to Claude Code
  • These are explicitly framed as token‑saving optimizations, not core architecture. 

This strongly implies that Anthropic does not consider RAG mandatory for code understanding.

Summary comparison

AspectClaude Code (VS Code)
Default approachLarge context window + agentic exploration
Classic RAG❌ Not default
Vector DB indexing❌ Optional / external
File accessDirect, on demand
Context controlExplicit and visible
Best atDeep, precise code reasoning

1. What RAG is used for RAG is presented as a technique that combines:

  • Large Language Models
  • External data retrieval mechanisms such as vector databasessemantic search, and embeddings

This allows models to answer questions using external, up‑to‑date, and trusted data, rather than relying only on their training data. 

2. RAG vs large context windows 

  • Are massive context windows replacing the need for RAG?

It explains that while long context windows allow more data to be passed directly into prompts, they do not automatically eliminate the need for structured retrieval approaches like RAG. 

3. Choosing the right approach Rather than saying RAG is obsolete,

  • The decision depends on the application use case
  • Factors like data freshness, trustworthiness, and AI workflow design matter
  • RAG remains relevant in many enterprise scenarios 

RAG is not “dead”; it is one of several viable approaches, and the right choice depends on your data, accuracy needs, and LLM workflow design. 



Retrieval Augmented Generation vs Large Context Windows

Use cases and advantages

Modern LLM systems typically choose between two ways of giving models the information they need: Retrieval Augmented Generation (RAG) or large context windows. Both solve different problems and are often misunderstood as competing approaches. In practice, they are complementary.

What RAG is good at

Core idea

RAG augments an LLM with external knowledge retrieval at inference time. Instead of relying only on what the model remembers from training, the system fetches relevant documents from databases, wikis, PDFs, or logs and injects them into the prompt before generation.

RAG use cases

1. Enterprise knowledge assistants

RAG is ideal when answers must come from proprietary or fast‑changing data, such as:

  • Internal wikis
  • Product documentation
  • Support runbooks
  • Compliance and policy documents

The model retrieves the most relevant documents and generates grounded answers, reducing hallucinations.

2. Regulated and audit‑heavy environments

RAG supports traceability by attaching responses to source documents. This is critical in:

  • Legal research
  • Healthcare decision support
  • Financial compliance systems

Many RAG systems explicitly return citations or document references.

3. Dynamic and real‑time information

LLMs are static after training. RAG solves this by pulling:

  • Latest regulations
  • Updated pricing
  • Live operational data

This is why RAG is widely used in customer support, finance, and industrial operations

Advantages of RAG

  1. Up‑to‑date knowledge
    The model can access information created after training without retraining.

  2. Reduced hallucinations
    Responses are grounded in retrieved documents rather than model memory alone. 

  3. Enterprise data isolation
    Sensitive internal data stays in your retrieval layer and does not become part of model training.

  4. Scales beyond context limits
    You do not need to fit all documents into the context window at once.

What large context windows are good at

Core idea

A large context window allows the model to see and reason over massive inputs directly, sometimes hundreds of thousands or even millions of tokens at once. 

Instead of retrieving small chunks, you load large sections or entire artifacts into the prompt.

Large context window use cases

1. Large codebase understanding

Large context models excel at:

  • Reading entire modules or repositories
  • Understanding cross‑file dependencies
  • Refactoring with global awareness

This is especially valuable for code analysis where structure and relationships matter more than fuzzy retrieval. 

2. Deep document analysis

Large context windows enable:

  • End‑to‑end reading of specifications
  • Full contract analysis
  • Research paper or RFC comprehension in one pass

This avoids chunking errors introduced by RAG pipelines. 

3. Long‑running reasoning and agent workflows

With large context:

  • Multi‑step reasoning stays coherent
  • The model remembers earlier constraints
  • No repeated retrieval calls are required

This is why agentic coding tools often prefer large context over classic RAG. 

Advantages of large context windows

  1. Holistic reasoning The model sees the entire artifact, enabling better global understanding.

  2. No retrieval errors There is no risk of missing relevant chunks due to poor embeddings or ranking.

  3. Simpler architecture No vector database, no indexing, no retriever tuning required.

  4. Better for structured data Code, configs, and logs benefit more from direct inspection than semantic similarity search. 

RAG vs large context window 

AspectRAGLarge context window
Best forEnterprise knowledge, docs, policiesCode, specs, deep analysis
Handles fresh dataYesNo
Needs external systemsYesNo
Risk of missing infoPossibleLow
Cost modelRetrieval + inferenceToken heavy inference
Architecture complexityHigherLower


  • Use RAG when correctness, freshness, and traceability matter.
  • Use large context windows when deep reasoning over structured artifacts like code is required.

-------------------------------------------

  • Claude (Anthropic) supports up to 1 million tokens of context window in its latest generally available models.
  • IBM watsonx Granite models support a 128K token context window across the Granite 3.1 and newer Granite 3.x families.

Anthropic has expanded Claude’s context window significantly:

  • Claude Opus 4.6 and Claude Sonnet 4.6
    • Maximum context window: 1,000,000 tokens
    • This is generally available with no long‑context pricing premium
    • Applies to Claude Code, API usage, and supported cloud platforms

This is confirmed in Anthropic’s official documentation and announcements.

  • Entire large codebases or monorepos can fit in a single session
  • Long‑running agentic workflows without frequent context compaction
  • Strong fit for context‑first code analysis over RAG

IBM has standardized the context length across the Granite family:

  • Granite 3.1 and Granite 3.3 models
    • Context window: 128,000 tokens
    • Applies to:
      • Granite 3.1 8B Instruct
      • Granite 3.1 2B
      • Granite 3.3 8B Instruct
      • Granite Guardian models
      • Granite Code models
    • Available in IBM watsonx.ai and open‑source releases

IBM explicitly states that all Granite 3.1 language models feature a 128K token context length.

What this means in practice

  • Suitable for long documents, enterprise policies, and medium‑sized repositories
  • Optimized for enterprise RAG pipelines
  • Strong balance between cost, performance, and governance

Model familyMax context window
Claude Opus 4.61,000,000 tokens
Claude Sonnet 4.61,000,000 tokens
IBM Granite 3.1 (all variants)128,000 tokens
IBM Granite 3.3 8B Instruct128,000 tokens

Claude

  • Optimized for context‑first and agentic workflows
  • Designed to analyze entire artifacts directly
  • Reduces dependency on RAG for code and reasoning tasks

Granite

  • Optimized for enterprise AI with governance
  • Designed to work with RAG and retrieval layers
  • Prioritizes cost control, explainability, and compliance

IBM explicitly positions Granite alongside RAG‑first architectures, including embedding models and document preprocessing frameworks like Docling. 

  • Claude enables large‑context‑first code analysis, which explains its architectural preference over RAG.
  • Granite intentionally caps context at 128K, encouraging retrieval‑based grounding for enterprise workloads.

In Granite models, “B” means billions of learned parameters. It measures model capacity, not context length or training data size. In model names like:

  • Granite 3.1 8B
  • Granite 3.1 2B
  • Granite Guardian 3.1 8B
  • Granite Code 3B
  • Granite 3.1 3B‑A800M (MoE)

the “B” stands for Billion parameters.

1B = 1 billion parameters

parameter is a learned numerical weight inside the neural network that stores knowledge acquired during training.

For example : IBM explicitly states that:

  • Granite‑3.0‑8B‑Instruct is an 8‑billion‑parameter model 

Think of parameters as:

  • The knobs inside the model
  • Each knob controls how strongly the model connects concepts
  • Training adjusts billions of these knobs to encode language, code, and reasoning patterns

More parameters generally mean:

  • Higher reasoning capacity
  • Better pattern recognition
  • Better generalization

but also:

  • Higher memory usage
  • Higher compute cost

This definition is consistent across Granite, Llama, Claude, GPT, and other LLM families. 

Dense models

Model nameMeaning
Granite 3.1 2B~2 billion parameters
Granite 3.1 8B~8 billion parameters
Granite Guardian 3.1 8B~8 billion parameters
Granite Guardian 3.1 2B~2 billion parameters

These are dense transformer models, meaning all parameters are active for every token processed.

IBM confirms Granite 8B models are 8‑billion‑parameter dense decoder‑only transformers. [ibm.com]

What about MoE models like 3B‑A800M

Example: Granite 3.1 3B‑A800M. This is a Mixture‑of‑Experts (MoE) model.

Meaning

  • 3B = total parameters in the model
  • A800M = approximately 800 million active parameters per token

IBM documents that Granite MoE models activate only a subset of experts per inference step, reducing compute cost while maintaining capacity. 

Why this matters

  • MoE models scale capacity without linear cost
  • You get large model intelligence with lower inference overhead

Why all Granite models still have 128K context

Parameter count (B) and context window size are independent dimensions.

IBM explicitly states:

  • All Granite 3.1 dense, MoE, and Guardian models support a 128K token context window 

That means 

  • 2B vs 8B affects model intelligence
  • 128K affects how much text the model can read at once

Parameter sizeWhat it impacts
More parametersBetter reasoning, coding, abstraction
Fewer parametersFaster, cheaper, easier to deploy
MoE architectureBetter scaling efficiency
Context windowHow much data can be processed per request

Proprietary frontier models

These dominate enterprise copilots, coding assistants, and research tools.

  1. OpenAI GPT series

    • Examples: GPT‑5.x
    • Known for strong reasoning, math, and general intelligence
    • Widely used in ChatGPT and enterprise APIs
  2. Anthropic Claude

    • Examples: Claude Opus 4.6, Sonnet 4.6
    • Strong in code analysis, safety, and long‑context reasoning
    • Notable for very large context windows 
  3. Google Gemini

    • Examples: Gemini 3 Pro, Gemini Flash
    • Multimodal first design with text, image, audio, and video
    • Tight integration with Google Workspace and Vertex AI 
  4. xAI Grok

    • Examples: Grok 4
    • Optimized for real‑time and social data analysis
    • Integrated with the X platform 

Enterprise and governance‑focused models

These are optimized for regulated industries and private deployments.

  1. IBM watsonx Granite

    • Examples: Granite 3.1 8B, Granite Guardian
    • Enterprise‑grade governance and RAG‑first architecture
    • Open models with Apache 2.0 licensing 
  2. Amazon Nova

    • Examples: Nova Premier
    • Designed for scalable enterprise workloads on AWS
    • Integrated with Bedrock and AWS tooling 

Open and open‑weight frontier models

Popular for self‑hosting, cost control, and customization.

  1. Meta Llama

    • Examples: Llama 4 Scout, Llama 4 Maverick
    • Widely adopted open‑weight models
    • Strong ecosystem and tooling support 
  2. Mistral

    • Examples: Mistral Large, Mixtral
    • Efficient architectures and strong reasoning
    • Apache‑licensed options for enterprise use 
  3. DeepSeek

    • Examples: DeepSeek V3, DeepSeek R1
    • High‑performance open models competitive with proprietary LLMs
    • Popular for reasoning and coding tasks
  4. Qwen

    • Examples: Qwen 3, Qwen 3.5
    • Strong multilingual and long‑context capabilities
    • Increasing adoption in open‑source deployments

Lightweight and edge‑focused models

Used where latency, cost, or on‑device inference matters.

  1. Microsoft Phi

    • Examples: Phi‑3, Phi‑4
    • Small, efficient models for constrained environments
    • Often embedded in tools and workflows 
  2. Gemma

    • Examples: Gemma 2, Gemma 3
    • Google‑released open models
    • Designed for research and local inference 

Today’s LLM ecosystem spans proprietary frontier models like GPT, Claude, and Gemini, enterprise‑focused platforms such as IBM Granite, and open or open‑weight models like Llama, Mistral, and DeepSeek. Each family makes different tradeoffs across reasoning quality, context window size, governance, cost, and deployability, which is why modern AI systems increasingly adopt multi‑model strategies instead of relying on a single LLM.


Friday, January 2, 2026

Understanding BPF Trace Probes and BTF: Practical Insights from Real-World Debugging

Introduction 

BPFTrace is a powerful tool for dynamic tracing in Linux, enabling developers and system engineers to observe kernel and user-space events in real time. While working with BPFTrace, you often encounter different probe types and kernel features like BTF (BPF Type Format). This blog explains what these probes mean, why BTF matters, and how to troubleshoot common issues. 

  1. What bpftrace probes mean

    • Explain probe types like tracepointrawtracepointkprobe, and fentry:
      • tracepoint: Stable kernel instrumentation points for syscalls and subsystems.
      • rawtracepoint: Low-level hooks for tracepoints with minimal decoding.
      • kprobe: Dynamic function entry probes for kernel symbols.
      • fentry: Modern BPF function entry probes using BTF type info.
  2. What is BTF and why it matters

    • BPF Type Format (BTF) provides kernel type metadata for BPF programs.
    • Enables automatic argument decoding and advanced probes like fentry.
    • How to check if BTF is present (/sys/kernel/btf/vmlinux) and what to do if missing (use BPFTRACE_KERNEL_SOURCE or simpler probes).
  3. Common errors and fixes

    • Example error: error: field has incomplete type 'const enum landlock_rule_type'
      • Cause: Incomplete type info due to missing or partial BTF.
      • Fix: Use raw syscalls tracepoints or point bpftrace to kernel sources.
  4. Practical examples

    • bpftrace -e 'tracepoint:syscalls:sys_enter_openat { printf("%s\n", comm); }'\ Meaning: Prints process names whenever openat() syscall is called.
    • Alternatives for PPC/RHEL when BTF is incomplete: bpftrace -e 'tracepoint:rawsyscalls:sysenter { @[comm] = count(); }interval:s:5 { print(@); clear(@); }'
  5. Tips for running tests and scripts

    • How to run bpftrace tests (ctest) and functional one-liners.
    • How to handle duration (interval probe or -c 'sleep N')


Background on BPF: 

BPF (Berkeley Packet Filter) started as a packet filtering mechanism in Unix systems but has evolved into eBPF (Extended BPF) in modern Linux kernels. eBPF is a technology that allows you to run sandboxed programs inside the kernel without changing kernel source code or loading kernel modules.

  • Key idea: eBPF programs are verified and JIT-compiled by the kernel, making them safe and efficient.
  • Capabilities: Observability, networking, security, and performance monitoring.

What is bpftrace?

bpftrace is a high-level front-end for eBPF. It provides a simple scripting language to attach probes to kernel/user events and collect data. It’s similar to DTrace but for Linux.

Why is it needed?

  • Traditional monitoring tools often lack deep kernel visibility.
  • eBPF allows low-overhead, dynamic tracing without rebooting or patching the kernel.
  • Useful for:
    • Performance analysis (CPU, I/O, latency)
    • Debugging production issues
    • Security auditing

When is it applied?

  • When you need real-time insights into kernel or application behavior.
  • Examples:
    • Trace system calls (openatreadwrite)
    • Monitor network packets
    • Profile application performance without intrusive instrumentation

Who can use this feature?

  • System administrators: For troubleshooting and performance tuning.
  • Kernel developers: For debugging kernel internals.
  • SRE/DevOps engineers: For observability in production.
  • Security teams: For detecting anomalies and enforcing policies.
-------------------------Examples of BPF trace command ------------
bpftrace -e 'tracepoint:syscalls:sys_enter_openat { printf("%s\n", comm); }'


That means -->    bpftrace -e '...': Run an inline bpftrace program given in quotes.
tracepoint:syscalls:sys_enter_openat: Attach a probe to the kernel tracepoint that fires whenever a process calls the openat() system call (used to open files).
{ printf("%s\n", comm); }: The action block. For every event, print the process name (comm) that triggered the syscall.

Every time any process calls openat(), bpftrace prints the name of that process.
This is useful for observing which processes are opening files in real time. It leverages Linux tracepoints, which are stable kernel instrumentation points, and uses bpftrace’s built-in variable comm (the current process name).

# bpftrace -e 'tracepoint:syscalls:sys_enter_openat { printf("%s\n", comm); }'
Attached 1 probe
irqbalance
gssproxy
rmcd
bash
sshd
systemd
====>it means those processes invoked openat() during tracing.
--------------------------------
NOTE:
  • Built-in variable comm: In bpftrace, comm is automatically populated with the command name of the current task (the process executing when the probe fires).
  • Execution flow: The action block { printf("%s\n", comm); } runs for every event. At that instant, the kernel context is the process making the syscall, so comm reflects that process name.
  • =======Examples========

    bpftrace -l '*sleep*'
        list probes containing "sleep"

    # bpftrace -l '*sleep*'
    fentry:cls_flower:fl_destroy_sleepable
    fentry:vmlinux:wq_worker_sleeping
    fentry:vmlinux:zpool_can_sleep_mapped
    kprobe:__bpf_prog_array_free_sleepable_cb
    kprobe:__probestub_mm_compaction_kcompactd_sleep
    kprobe:__probestub_mm_vmscan_kswapd_sleep
    rawtracepoint:sunrpc:rpc_task_sleep
    rawtracepoint:sunrpc:rpc_task_sync_sleep
    tracepoint:syscalls:sys_exit_clock_nanosleep
    tracepoint:syscalls:sys_exit_nanosleep
    tracepoint:vmscan:mm_vmscan_kswapd_sleep
    ======
    bpftrace -e 'kprobe:do_nanosleep { printf("PID %d sleeping...\n", pid); }'
        trace processes calling sleep

    # bpftrace -e 'kprobe:do_nanosleep { printf("PID %d sleeping...\n", pid); }'
    Attached 1 probe
    PID 846 sleeping...


    ===============
    bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }'
        count syscalls by process name
    # bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }'
    Attached 1 probe
    ^C

    @[gssproxy]: 4
    @[gmain]: 10
    @[IBM.MgmtDomainR]: 10
    @[auditd]: 17
    @[systemd-userwor]: 27
    @[rmcd]: 40
    @[in:imjournal]: 48
    @[irqbalance]: 56
    @[bash]: 67
    @[multipathd]: 134
    @[bpftrace]: 223
    @[vi]: 564
    @[sshd-session]: 2542

    =============================

    Conclusion 

    Understanding probe types and BTF is essential for effective bpftrace usage. When BTF is missing or incomplete, fallback strategies like raw tracepoints or kernel source paths ensure smooth tracing. These insights help troubleshoot errors and write efficient tracing scripts.