Modern LLM systems typically choose between two ways of giving models the information they need: Retrieval Augmented Generation (RAG) or large context windows. Both solve different problems
Claude (Anthropic) in VS Code primarily uses a large context window plus agentic code exploration, not classic RAG by default.RAG can be added, but it is optional and external.
Claude in VS Code analyzes code using large context windows and active file exploration, because code benefits more from precise, agent‑driven inspection than from passive RAG retrieval.
1. Large context window as the primary mechanism
Claude Code relies on very large context windows to analyze code. The VS Code extension automatically provides Claude with:
- Your currently open file
- Selected text ranges
- Files you explicitly reference using
@fileor line ranges - Project memory files like
CLAUDE.md
This behavior is documented in the official Claude Code VS Code docs, which describe direct file visibility and context passing rather than retrieval pipelines.
Claude models are explicitly designed to support hundreds of thousands of tokens, which makes “read the code directly” feasible without a retrieval layer.
2. Agentic search instead of passive RAG
Rather than pre‑indexing your repo into a vector database (classic RAG), Claude Code acts as an agent that:
- Searches files
- Reads only relevant sections
- Iteratively explores the codebase
This design choice is highlighted in community and practitioner analyses describing Claude Code as active investigation instead of “dump everything into context” RAG.
Examples of agentic behavior include:
- Grep‑like searches
- Targeted file reads
- Incremental context building
This is fundamentally different from traditional RAG, which retrieves chunks blindly based on similarity.
3. Why Anthropic chose this approach
Reason 1: Code is structured, not fuzzy text
Source code has:
- Strong syntax
- Explicit dependencies
- Precise identifiers
Anthropic’s approach assumes it is better to search deterministically (file names, symbols, call paths) than rely on embedding similarity alone.
Reason 2: Large context windows reduce RAG overhead
With large context windows:
- Claude can read entire files when needed
- No chunking or embedding errors
- No stale indexes after code changes
This is reinforced by the existence of tooling that tracks context window usage, showing that Claude is designed to operate close to context limits rather than avoid them.
Reason 3: RAG is optional, not built‑in
RAG for Claude Code exists as external or community tools, not as a default feature. For example:
- DevRAG and MCP‑based tools add vector search to Claude Code
- These are explicitly framed as token‑saving optimizations, not core architecture.
This strongly implies that Anthropic does not consider RAG mandatory for code understanding.
Summary comparison
| Aspect | Claude Code (VS Code) |
|---|---|
| Default approach | Large context window + agentic exploration |
| Classic RAG | ❌ Not default |
| Vector DB indexing | ❌ Optional / external |
| File access | Direct, on demand |
| Context control | Explicit and visible |
| Best at | Deep, precise code reasoning |
1. What RAG is used for RAG is presented as a technique that combines:
- Large Language Models
- External data retrieval mechanisms such as vector databases, semantic search, and embeddings
This allows models to answer questions using external, up‑to‑date, and trusted data, rather than relying only on their training data.
2. RAG vs large context windows
- Are massive context windows replacing the need for RAG?
It explains that while long context windows allow more data to be passed directly into prompts, they do not automatically eliminate the need for structured retrieval approaches like RAG.
3. Choosing the right approach Rather than saying RAG is obsolete,
- The decision depends on the application use case
- Factors like data freshness, trustworthiness, and AI workflow design matter
- RAG remains relevant in many enterprise scenarios
1. What RAG is used for RAG is presented as a technique that combines:
- Large Language Models
- External data retrieval mechanisms such as vector databases, semantic search, and embeddings
This allows models to answer questions using external, up‑to‑date, and trusted data, rather than relying only on their training data.
2. RAG vs large context windows
- Are massive context windows replacing the need for RAG?
It explains that while long context windows allow more data to be passed directly into prompts, they do not automatically eliminate the need for structured retrieval approaches like RAG.
3. Choosing the right approach Rather than saying RAG is obsolete,
- The decision depends on the application use case
- Factors like data freshness, trustworthiness, and AI workflow design matter
- RAG remains relevant in many enterprise scenarios
RAG is not “dead”; it is one of several viable approaches, and the right choice depends on your data, accuracy needs, and LLM workflow design.
Retrieval Augmented Generation vs Large Context Windows
RAG is not “dead”; it is one of several viable approaches, and the right choice depends on your data, accuracy needs, and LLM workflow design.
Retrieval Augmented Generation vs Large Context Windows
Use cases and advantages
Modern LLM systems typically choose between two ways of giving models the information they need: Retrieval Augmented Generation (RAG) or large context windows. Both solve different problems and are often misunderstood as competing approaches. In practice, they are complementary.
Modern LLM systems typically choose between two ways of giving models the information they need: Retrieval Augmented Generation (RAG) or large context windows. Both solve different problems and are often misunderstood as competing approaches. In practice, they are complementary.
What RAG is good at
Core idea
RAG augments an LLM with external knowledge retrieval at inference time. Instead of relying only on what the model remembers from training, the system fetches relevant documents from databases, wikis, PDFs, or logs and injects them into the prompt before generation.
RAG augments an LLM with external knowledge retrieval at inference time. Instead of relying only on what the model remembers from training, the system fetches relevant documents from databases, wikis, PDFs, or logs and injects them into the prompt before generation.
RAG use cases
1. Enterprise knowledge assistants
RAG is ideal when answers must come from proprietary or fast‑changing data, such as:
- Internal wikis
- Product documentation
- Support runbooks
- Compliance and policy documents
The model retrieves the most relevant documents and generates grounded answers, reducing hallucinations.
RAG is ideal when answers must come from proprietary or fast‑changing data, such as:
- Internal wikis
- Product documentation
- Support runbooks
- Compliance and policy documents
The model retrieves the most relevant documents and generates grounded answers, reducing hallucinations.
2. Regulated and audit‑heavy environments
RAG supports traceability by attaching responses to source documents. This is critical in:
- Legal research
- Healthcare decision support
- Financial compliance systems
Many RAG systems explicitly return citations or document references.
RAG supports traceability by attaching responses to source documents. This is critical in:
- Legal research
- Healthcare decision support
- Financial compliance systems
Many RAG systems explicitly return citations or document references.
3. Dynamic and real‑time information
LLMs are static after training. RAG solves this by pulling:
- Latest regulations
- Updated pricing
- Live operational data
This is why RAG is widely used in customer support, finance, and industrial operations
LLMs are static after training. RAG solves this by pulling:
- Latest regulations
- Updated pricing
- Live operational data
This is why RAG is widely used in customer support, finance, and industrial operations
Advantages of RAG
Up‑to‑date knowledge
The model can access information created after training without retraining.
Reduced hallucinations
Responses are grounded in retrieved documents rather than model memory alone.
Enterprise data isolation
Sensitive internal data stays in your retrieval layer and does not become part of model training.
Scales beyond context limits
You do not need to fit all documents into the context window at once.
Up‑to‑date knowledge
The model can access information created after training without retraining.Reduced hallucinations
Responses are grounded in retrieved documents rather than model memory alone.Enterprise data isolation
Sensitive internal data stays in your retrieval layer and does not become part of model training.Scales beyond context limits
You do not need to fit all documents into the context window at once.
What large context windows are good at
Core idea
A large context window allows the model to see and reason over massive inputs directly, sometimes hundreds of thousands or even millions of tokens at once.
Instead of retrieving small chunks, you load large sections or entire artifacts into the prompt.
A large context window allows the model to see and reason over massive inputs directly, sometimes hundreds of thousands or even millions of tokens at once.
Instead of retrieving small chunks, you load large sections or entire artifacts into the prompt.
Large context window use cases
1. Large codebase understanding
Large context models excel at:
- Reading entire modules or repositories
- Understanding cross‑file dependencies
- Refactoring with global awareness
This is especially valuable for code analysis where structure and relationships matter more than fuzzy retrieval.
Large context models excel at:
- Reading entire modules or repositories
- Understanding cross‑file dependencies
- Refactoring with global awareness
This is especially valuable for code analysis where structure and relationships matter more than fuzzy retrieval.
2. Deep document analysis
Large context windows enable:
- End‑to‑end reading of specifications
- Full contract analysis
- Research paper or RFC comprehension in one pass
This avoids chunking errors introduced by RAG pipelines.
Large context windows enable:
- End‑to‑end reading of specifications
- Full contract analysis
- Research paper or RFC comprehension in one pass
This avoids chunking errors introduced by RAG pipelines.
3. Long‑running reasoning and agent workflows
With large context:
- Multi‑step reasoning stays coherent
- The model remembers earlier constraints
- No repeated retrieval calls are required
This is why agentic coding tools often prefer large context over classic RAG.
With large context:
- Multi‑step reasoning stays coherent
- The model remembers earlier constraints
- No repeated retrieval calls are required
This is why agentic coding tools often prefer large context over classic RAG.
Advantages of large context windows
Holistic reasoning The model sees the entire artifact, enabling better global understanding.
No retrieval errors There is no risk of missing relevant chunks due to poor embeddings or ranking.
Simpler architecture No vector database, no indexing, no retriever tuning required.
Better for structured data Code, configs, and logs benefit more from direct inspection than semantic similarity search.
Holistic reasoning The model sees the entire artifact, enabling better global understanding.
No retrieval errors There is no risk of missing relevant chunks due to poor embeddings or ranking.
Simpler architecture No vector database, no indexing, no retriever tuning required.
Better for structured data Code, configs, and logs benefit more from direct inspection than semantic similarity search.
RAG vs large context window
Aspect RAG Large context window Best for Enterprise knowledge, docs, policies Code, specs, deep analysis Handles fresh data Yes No Needs external systems Yes No Risk of missing info Possible Low Cost model Retrieval + inference Token heavy inference Architecture complexity Higher Lower
| Aspect | RAG | Large context window |
|---|---|---|
| Best for | Enterprise knowledge, docs, policies | Code, specs, deep analysis |
| Handles fresh data | Yes | No |
| Needs external systems | Yes | No |
| Risk of missing info | Possible | Low |
| Cost model | Retrieval + inference | Token heavy inference |
| Architecture complexity | Higher | Lower |
- Use RAG when correctness, freshness, and traceability matter.
- Use large context windows when deep reasoning over structured artifacts like code is required.
-------------------------------------------
- Use RAG when correctness, freshness, and traceability matter.
- Use large context windows when deep reasoning over structured artifacts like code is required.
-------------------------------------------
- Claude (Anthropic) supports up to 1 million tokens of context window in its latest generally available models.
- IBM watsonx Granite models support a 128K token context window across the Granite 3.1 and newer Granite 3.x families.
- Claude (Anthropic) supports up to 1 million tokens of context window in its latest generally available models.
- IBM watsonx Granite models support a 128K token context window across the Granite 3.1 and newer Granite 3.x families.
Anthropic has expanded Claude’s context window significantly:
- Claude Opus 4.6 and Claude Sonnet 4.6
- Maximum context window: 1,000,000 tokens
- This is generally available with no long‑context pricing premium
- Applies to Claude Code, API usage, and supported cloud platforms
This is confirmed in Anthropic’s official documentation and announcements.
- Entire large codebases or monorepos can fit in a single session
- Long‑running agentic workflows without frequent context compaction
- Strong fit for context‑first code analysis over RAG
Anthropic has expanded Claude’s context window significantly:
- Claude Opus 4.6 and Claude Sonnet 4.6
- Maximum context window: 1,000,000 tokens
- This is generally available with no long‑context pricing premium
- Applies to Claude Code, API usage, and supported cloud platforms
This is confirmed in Anthropic’s official documentation and announcements.
- Entire large codebases or monorepos can fit in a single session
- Long‑running agentic workflows without frequent context compaction
- Strong fit for context‑first code analysis over RAG
IBM has standardized the context length across the Granite family:
- Granite 3.1 and Granite 3.3 models
- Context window: 128,000 tokens
- Applies to:
- Granite 3.1 8B Instruct
- Granite 3.1 2B
- Granite 3.3 8B Instruct
- Granite Guardian models
- Granite Code models
- Available in IBM watsonx.ai and open‑source releases
IBM explicitly states that all Granite 3.1 language models feature a 128K token context length.
What this means in practice
- Suitable for long documents, enterprise policies, and medium‑sized repositories
- Optimized for enterprise RAG pipelines
- Strong balance between cost, performance, and governance
IBM has standardized the context length across the Granite family:
- Granite 3.1 and Granite 3.3 models
- Context window: 128,000 tokens
- Applies to:
- Granite 3.1 8B Instruct
- Granite 3.1 2B
- Granite 3.3 8B Instruct
- Granite Guardian models
- Granite Code models
- Available in IBM watsonx.ai and open‑source releases
IBM explicitly states that all Granite 3.1 language models feature a 128K token context length.
What this means in practice
- Suitable for long documents, enterprise policies, and medium‑sized repositories
- Optimized for enterprise RAG pipelines
- Strong balance between cost, performance, and governance
Model family Max context window Claude Opus 4.6 1,000,000 tokens Claude Sonnet 4.6 1,000,000 tokens IBM Granite 3.1 (all variants) 128,000 tokens IBM Granite 3.3 8B Instruct 128,000 tokens
| Model family | Max context window |
|---|---|
| Claude Opus 4.6 | 1,000,000 tokens |
| Claude Sonnet 4.6 | 1,000,000 tokens |
| IBM Granite 3.1 (all variants) | 128,000 tokens |
| IBM Granite 3.3 8B Instruct | 128,000 tokens |
Claude
- Optimized for context‑first and agentic workflows
- Designed to analyze entire artifacts directly
- Reduces dependency on RAG for code and reasoning tasks
- Optimized for context‑first and agentic workflows
- Designed to analyze entire artifacts directly
- Reduces dependency on RAG for code and reasoning tasks
Granite
- Optimized for enterprise AI with governance
- Designed to work with RAG and retrieval layers
- Prioritizes cost control, explainability, and compliance
IBM explicitly positions Granite alongside RAG‑first architectures, including embedding models and document preprocessing frameworks like Docling.
- Optimized for enterprise AI with governance
- Designed to work with RAG and retrieval layers
- Prioritizes cost control, explainability, and compliance
IBM explicitly positions Granite alongside RAG‑first architectures, including embedding models and document preprocessing frameworks like Docling.
- Claude enables large‑context‑first code analysis, which explains its architectural preference over RAG.
- Granite intentionally caps context at 128K, encouraging retrieval‑based grounding for enterprise workloads.
In Granite models, “B” means billions of learned parameters. It measures model capacity, not context length or training data size. In model names like:
- Claude enables large‑context‑first code analysis, which explains its architectural preference over RAG.
- Granite intentionally caps context at 128K, encouraging retrieval‑based grounding for enterprise workloads.
In Granite models, “B” means billions of learned parameters. It measures model capacity, not context length or training data size. In model names like:
- Granite 3.1 8B
- Granite 3.1 2B
- Granite Guardian 3.1 8B
- Granite Code 3B
- Granite 3.1 3B‑A800M (MoE)
the “B” stands for Billion parameters.
- Granite 3.1 8B
- Granite 3.1 2B
- Granite Guardian 3.1 8B
- Granite Code 3B
- Granite 3.1 3B‑A800M (MoE)
the “B” stands for Billion parameters.
1B = 1 billion parameters
A parameter is a learned numerical weight inside the neural network that stores knowledge acquired during training.
For example : IBM explicitly states that:
- Granite‑3.0‑8B‑Instruct is an 8‑billion‑parameter model
1B = 1 billion parameters
A parameter is a learned numerical weight inside the neural network that stores knowledge acquired during training.
For example : IBM explicitly states that:
- Granite‑3.0‑8B‑Instruct is an 8‑billion‑parameter model
Think of parameters as:
- The knobs inside the model
- Each knob controls how strongly the model connects concepts
- Training adjusts billions of these knobs to encode language, code, and reasoning patterns
More parameters generally mean:
- Higher reasoning capacity
- Better pattern recognition
- Better generalization
but also:
- Higher memory usage
- Higher compute cost
This definition is consistent across Granite, Llama, Claude, GPT, and other LLM families.
- The knobs inside the model
- Each knob controls how strongly the model connects concepts
- Training adjusts billions of these knobs to encode language, code, and reasoning patterns
More parameters generally mean:
- Higher reasoning capacity
- Better pattern recognition
- Better generalization
but also:
- Higher memory usage
- Higher compute cost
This definition is consistent across Granite, Llama, Claude, GPT, and other LLM families.
Dense models
Model name Meaning Granite 3.1 2B ~2 billion parameters Granite 3.1 8B ~8 billion parameters Granite Guardian 3.1 8B ~8 billion parameters Granite Guardian 3.1 2B ~2 billion parameters
These are dense transformer models, meaning all parameters are active for every token processed.
IBM confirms Granite 8B models are 8‑billion‑parameter dense decoder‑only transformers. [ibm.com]
| Model name | Meaning |
|---|---|
| Granite 3.1 2B | ~2 billion parameters |
| Granite 3.1 8B | ~8 billion parameters |
| Granite Guardian 3.1 8B | ~8 billion parameters |
| Granite Guardian 3.1 2B | ~2 billion parameters |
These are dense transformer models, meaning all parameters are active for every token processed.
IBM confirms Granite 8B models are 8‑billion‑parameter dense decoder‑only transformers. [ibm.com]
What about MoE models like 3B‑A800M
Example: Granite 3.1 3B‑A800M. This is a Mixture‑of‑Experts (MoE) model.
Meaning
- 3B = total parameters in the model
- A800M = approximately 800 million active parameters per token
IBM documents that Granite MoE models activate only a subset of experts per inference step, reducing compute cost while maintaining capacity.
- 3B = total parameters in the model
- A800M = approximately 800 million active parameters per token
IBM documents that Granite MoE models activate only a subset of experts per inference step, reducing compute cost while maintaining capacity.
Why this matters
- MoE models scale capacity without linear cost
- You get large model intelligence with lower inference overhead
- MoE models scale capacity without linear cost
- You get large model intelligence with lower inference overhead
Why all Granite models still have 128K context
Parameter count (B) and context window size are independent dimensions.
IBM explicitly states:
- All Granite 3.1 dense, MoE, and Guardian models support a 128K token context window
That means
- 2B vs 8B affects model intelligence
- 128K affects how much text the model can read at once
Parameter count (B) and context window size are independent dimensions.
IBM explicitly states:
- All Granite 3.1 dense, MoE, and Guardian models support a 128K token context window
That means
- 2B vs 8B affects model intelligence
- 128K affects how much text the model can read at once
Parameter size What it impacts More parameters Better reasoning, coding, abstraction Fewer parameters Faster, cheaper, easier to deploy MoE architecture Better scaling efficiency Context window How much data can be processed per request
| Parameter size | What it impacts |
|---|---|
| More parameters | Better reasoning, coding, abstraction |
| Fewer parameters | Faster, cheaper, easier to deploy |
| MoE architecture | Better scaling efficiency |
| Context window | How much data can be processed per request |
Proprietary frontier models
These dominate enterprise copilots, coding assistants, and research tools.
OpenAI GPT series
- Examples: GPT‑5.x
- Known for strong reasoning, math, and general intelligence
- Widely used in ChatGPT and enterprise APIs
Anthropic Claude
- Examples: Claude Opus 4.6, Sonnet 4.6
- Strong in code analysis, safety, and long‑context reasoning
- Notable for very large context windows
Google Gemini
- Examples: Gemini 3 Pro, Gemini Flash
- Multimodal first design with text, image, audio, and video
- Tight integration with Google Workspace and Vertex AI
xAI Grok
- Examples: Grok 4
- Optimized for real‑time and social data analysis
- Integrated with the X platform
Enterprise and governance‑focused models
These are optimized for regulated industries and private deployments.
IBM watsonx Granite
- Examples: Granite 3.1 8B, Granite Guardian
- Enterprise‑grade governance and RAG‑first architecture
- Open models with Apache 2.0 licensing
Amazon Nova
- Examples: Nova Premier
- Designed for scalable enterprise workloads on AWS
- Integrated with Bedrock and AWS tooling
Open and open‑weight frontier models
Popular for self‑hosting, cost control, and customization.
Meta Llama
- Examples: Llama 4 Scout, Llama 4 Maverick
- Widely adopted open‑weight models
- Strong ecosystem and tooling support
Mistral
- Examples: Mistral Large, Mixtral
- Efficient architectures and strong reasoning
- Apache‑licensed options for enterprise use
DeepSeek
- Examples: DeepSeek V3, DeepSeek R1
- High‑performance open models competitive with proprietary LLMs
- Popular for reasoning and coding tasks
Qwen
- Examples: Qwen 3, Qwen 3.5
- Strong multilingual and long‑context capabilities
- Increasing adoption in open‑source deployments
Lightweight and edge‑focused models
Used where latency, cost, or on‑device inference matters.
Microsoft Phi
- Examples: Phi‑3, Phi‑4
- Small, efficient models for constrained environments
- Often embedded in tools and workflows
Gemma
- Examples: Gemma 2, Gemma 3
- Google‑released open models
- Designed for research and local inference
Today’s LLM ecosystem spans proprietary frontier models like GPT, Claude, and Gemini, enterprise‑focused platforms such as IBM Granite, and open or open‑weight models like Llama, Mistral, and DeepSeek. Each family makes different tradeoffs across reasoning quality, context window size, governance, cost, and deployability, which is why modern AI systems increasingly adopt multi‑model strategies instead of relying on a single LLM.
No comments:
Post a Comment