RAG on AWS is implemented through Amazon Bedrock Knowledge Bases. It retrieves relevant information from your data sources and provides it as context to a foundation model, grounding responses in your actual data and reducing hallucinations.

RAG Architecture on AWS – Bedrock Knowledge Bases Guide

Q: How much does RAG cost on AWS?

RAG costs have three components: embedding, vector storage (OpenSearch Serverless or Aurora pgvector), and FM generation. For most workloads, the FM generation cost dominates. Claude Haiku and Nova Micro are cheapest for RAG generation.

Q: How do I prevent hallucinations in RAG?

Enable Bedrock Guardrails contextual grounding checks, use higher Top-K for broader retrieval, add reranking for precision, and use hierarchical chunking to provide more context around matches.

Table of Contents hide

What is RAG (Retrieval-Augmented Generation)?

Amazon Bedrock Knowledge Bases — Managed RAG

Chunking Strategies Explained

Advanced RAG Techniques on AWS

RAG vs Fine-tuning vs Prompt Engineering

Building RAG — Step by Step

Cost Optimization

AWS Certification Exam Practice Questions

Related AWS AI Guides

Frequently Asked Questions

What is RAG (Retrieval-Augmented Generation)?

RAG is a technique that enhances Large Language Model (LLM) responses by retrieving relevant information from external data sources before generating an answer. Instead of relying solely on the model’s training data, RAG grounds responses in your actual documents, significantly reducing hallucinations and providing up-to-date, verifiable answers.

RAG solves three critical LLM limitations:

Knowledge cutoff — LLMs only know what they were trained on. RAG provides real-time access to current data.
Hallucinations — Without grounding, LLMs may generate plausible but incorrect information. RAG cites actual sources.

Domain specificity — General models lack your proprietary business knowledge. RAG connects them to your data.

RAG Architecture — End-to-End Flow
Ingestion Pipeline (Offline)
📄 Documents
(S3, Web, Confluence)
→
✂️ Chunking
(Fixed/Semantic/Hierarchical)
→
🔢 Embedding
(Titan/Cohere)
→
📊 Vector Store
(OpenSearch/Aurora/Pinecone)
Query Pipeline (Runtime)
❓ User Query
→
🔢 Embed Query
→
🔍 Vector Search
(Top-K similar)
→
📝 Context + Query
→ Prompt
→
🤖 FM Response
(with citations)

Amazon Bedrock Knowledge Bases — Managed RAG

Amazon Bedrock Knowledge Bases provides fully managed RAG that handles the entire pipeline automatically — ingestion, chunking, embedding, storage, retrieval, and augmented generation. You provide data sources and choose a model; Bedrock handles everything else.

Key Components

Data Sources — S3, Confluence, SharePoint, Salesforce, Web Crawler, or Custom via Lambda connector

Chunking Strategies — Fixed-size, semantic, hierarchical, or no chunking (for pre-processed data)
Parsing — Standard text extraction or Foundation Model parsing (uses Claude to interpret complex layouts, tables, images)
Embedding Models — Amazon Titan Embeddings V2, Cohere Embed, or bring your own

Vector Stores — Amazon OpenSearch Serverless (default), Aurora PostgreSQL, Pinecone, Redis Enterprise, MongoDB Atlas
Foundation Model — Any Bedrock FM for generation (Claude, Nova, Llama, Mistral)

Chunking Strategies Explained

Strategy	How It Works	Best For
Fixed-size	Split at fixed token count (e.g., 512 tokens) with configurable overlap	Simple documents, uniform content
Semantic	Uses embedding similarity to detect natural topic boundaries	Documents with distinct sections/topics
Hierarchical	Creates parent (larger) and child (smaller) chunks; retrieves child, returns parent for context	Long documents where context around a match matters
No chunking	Treats each file as a single chunk	Pre-processed data, short documents, FAQs
FM Parsing	Uses a foundation model to interpret document layout before chunking	Complex documents with tables, charts, images

Advanced RAG Techniques on AWS

Metadata Filtering

Attach metadata to documents (department, date, product, access level) and filter at query time to narrow the search space. This improves relevance and enables access control.

Hybrid Search

Combine vector similarity search (semantic) with keyword search (lexical) for better recall. Bedrock Knowledge Bases supports hybrid search with configurable weighting between semantic and keyword matches.

Query Decomposition

For complex multi-part questions, Bedrock can decompose the query into sub-queries, retrieve relevant chunks for each, and synthesize a comprehensive answer.

Reranking

After initial retrieval, a reranker model (e.g., Cohere Rerank or Amazon Rerank) scores and reorders results by relevance. This improves precision by filtering out semantically similar but contextually irrelevant chunks.

Guardrails Integration

Apply Bedrock Guardrails to RAG responses for content filtering, PII masking, and contextual grounding checks — which verify that the response is actually supported by the retrieved source documents.

RAG vs Fine-tuning vs Prompt Engineering

Approach	When to Use	Pros	Cons
RAG	Ground answers in specific documents, real-time data	No training needed, data stays current, citable sources	Retrieval quality depends on chunking, adds latency
Fine-tuning	Teach model a specific style, domain vocabulary, or task format	Better task-specific performance, lower inference cost	Requires training data, expensive, can become stale
Prompt Engineering	Guide model behavior with instructions and examples	No training, instant iteration, works with any model	Limited by context window, no persistent knowledge

Best practice: Start with prompt engineering, add RAG when you need domain-specific grounding, and fine-tune only when you need a specific output format or style that prompting can’t achieve.

Building RAG — Step by Step

Prepare data — Upload documents to S3 (PDF, HTML, TXT, DOCX, CSV, MD, XLS)

Create Knowledge Base — Choose embedding model, vector store, and chunking strategy
Sync data source — Bedrock ingests, chunks, embeds, and stores vectors
Test with Retrieve API — Verify relevance of retrieved chunks before full RAG

Enable generation — Connect a foundation model for RetrieveAndGenerate API
Add Guardrails — Apply contextual grounding checks to prevent hallucinations
Integrate with Agent — Optionally connect to a Bedrock Agent for multi-step workflows

Cost Optimization

Embedding — Titan Embeddings V2 is ~$0.00002/1K tokens (one-time during ingestion + query time)
Vector Store — OpenSearch Serverless starts at ~$0.24/hr per OCU pair (consider Aurora PostgreSQL pgvector for lower cost at scale)
Generation — Depends on FM choice (Claude Haiku/Nova Micro are cheapest for RAG)

Tip: Use metadata filtering to reduce the number of chunks retrieved, lowering both retrieval cost and FM input token cost

AWS Certification Exam Practice Questions

Question 1:

A company’s RAG system retrieves relevant document chunks but the FM sometimes generates answers that contradict the retrieved information. Which Bedrock feature specifically addresses this?

Content filters set to HIGH
Contextual grounding check in Guardrails
Automated Reasoning checks
Denied topics configuration

Show Answer

Answer: B – Contextual grounding checks verify that the FM’s response is faithful to and supported by the retrieved source documents. It detects when the model “hallucinates” information not present in the context. Automated Reasoning uses formal logic for policy-based validation, which is different from source grounding.

Question 2:

A healthcare company has documents containing complex medical tables, embedded diagrams, and multi-column layouts. Standard chunking produces poor-quality chunks that miss table context. Which parsing approach should they use?

Fixed-size chunking with 1024 token overlap
Foundation Model parsing with a customized extraction prompt
Semantic chunking with max tokens set to 2048
No chunking with each page as a single chunk

Show Answer

Answer: B – Foundation Model parsing uses an FM (e.g., Claude) to interpret complex document layouts including tables, charts, and multi-column text before chunking. You can customize the extraction prompt to specify how tables should be serialized. This preserves structural information that standard text extraction would lose.

Question 3:

A legal firm needs their RAG system to only return answers from documents the requesting user has permission to access. Different users have access to different case files. How should they implement this?

Create separate Knowledge Bases per user
Use metadata filtering with user-specific access tags at query time
Implement IAM policies on the vector store
Use Guardrails to filter responses based on user role

Show Answer

Answer: B – Metadata filtering allows you to tag documents with access control metadata (e.g., case_id, department, clearance_level) during ingestion, then pass user-specific filters at query time. This ensures the vector search only returns chunks from documents the user is authorized to access, without duplicating data.

Question 4:

A company’s RAG system returns accurate but overly long answers because it retrieves too many chunks. They want to improve precision without reducing recall. Which technique helps?

Reduce the Top-K parameter from 10 to 3
Apply a reranker model after initial retrieval
Switch from semantic to fixed-size chunking
Increase the embedding model dimensions

Show Answer

Answer: B – A reranker scores and reorders retrieved chunks by contextual relevance. It retrieves broadly (high recall) then filters precisely (high precision). Reducing Top-K would reduce both recall and precision. The reranker keeps recall high while eliminating less-relevant results before they reach the FM.

Question 5:

An enterprise wants to implement RAG with their data in Confluence and SharePoint. They need the knowledge base to stay current as documents are updated. What is the MOST operationally efficient approach?

Export documents to S3 nightly and sync the Knowledge Base on a schedule

Use native Confluence and SharePoint connectors with incremental sync
Build a custom Lambda pipeline to poll for changes and update the vector store
Use Amazon Kendra with connectors and integrate with Bedrock via API

Show Answer

Answer: B – Bedrock Knowledge Bases has native connectors for Confluence, SharePoint, and other sources. These support incremental sync that only processes changed documents, keeping the knowledge base current without full re-ingestion. This is more operationally efficient than building custom pipelines or exporting to S3.

Related AWS AI Guides

Frequently Asked Questions

What is RAG in AWS?

RAG (Retrieval-Augmented Generation) on AWS is implemented through Amazon Bedrock Knowledge Bases. It retrieves relevant information from your data sources (S3, Confluence, SharePoint, web) and provides it as context to a foundation model, grounding responses in your actual data and reducing hallucinations.

How much does RAG cost on AWS?

RAG costs have three components: embedding ($0.00002/1K tokens for Titan V2), vector storage (OpenSearch Serverless from $0.24/hr/OCU pair or Aurora PostgreSQL pgvector), and FM generation (varies by model — Claude Haiku and Nova Micro are cheapest). For most workloads, the FM generation cost dominates.

RAG vs Fine-tuning — which should I use?

Use RAG when you need answers grounded in specific documents that change over time. Use fine-tuning when you need to change the model’s behavior, output format, or domain vocabulary. They can be combined: fine-tune for style, RAG for knowledge.

How do I prevent hallucinations in RAG?

Enable Bedrock Guardrails contextual grounding checks, which verify that the FM’s response is supported by the retrieved source chunks. Also: use higher Top-K for broader retrieval, add reranking for precision, and use hierarchical chunking to provide more context around matches.