RAG Architecture on AWS – Bedrock Knowledge Bases Guide

What is RAG (Retrieval-Augmented Generation)?

RAG is a technique that enhances Large Language Model (LLM) responses by retrieving relevant information from external data sources before generating an answer. Instead of relying solely on the model’s training data, RAG grounds responses in your actual documents, significantly reducing hallucinations and providing up-to-date, verifiable answers.

RAG solves three critical LLM limitations:

  • Knowledge cutoff — LLMs only know what they were trained on. RAG provides real-time access to current data.
  • Hallucinations — Without grounding, LLMs may generate plausible but incorrect information. RAG cites actual sources.
  • Domain specificity — General models lack your proprietary business knowledge. RAG connects them to your data.
RAG Architecture — End-to-End Flow
Ingestion Pipeline (Offline)
📄 Documents
(S3, Web, Confluence)
✂️ Chunking
(Fixed/Semantic/Hierarchical)
🔢 Embedding
(Titan/Cohere)
📊 Vector Store
(OpenSearch/Aurora/Pinecone)
Query Pipeline (Runtime)
❓ User Query
🔢 Embed Query
🔍 Vector Search
(Top-K similar)
📝 Context + Query
→ Prompt
🤖 FM Response
(with citations)

Amazon Bedrock Knowledge Bases — Managed RAG

Amazon Bedrock Knowledge Bases provides fully managed RAG that handles the entire pipeline automatically — ingestion, chunking, embedding, storage, retrieval, and augmented generation. You provide data sources and choose a model; Bedrock handles everything else.

Key Components

  • Data Sources — S3, Confluence, SharePoint, Salesforce, Web Crawler, or Custom via Lambda connector
  • Chunking Strategies — Fixed-size, semantic, hierarchical, or no chunking (for pre-processed data)
  • Parsing — Standard text extraction or Foundation Model parsing (uses Claude to interpret complex layouts, tables, images)
  • Embedding Models — Amazon Titan Embeddings V2, Cohere Embed, or bring your own
  • Vector Stores — Amazon OpenSearch Serverless (default), Aurora PostgreSQL, Pinecone, Redis Enterprise, MongoDB Atlas
  • Foundation Model — Any Bedrock FM for generation (Claude, Nova, Llama, Mistral)

Chunking Strategies Explained

Strategy How It Works Best For
Fixed-size Split at fixed token count (e.g., 512 tokens) with configurable overlap Simple documents, uniform content
Semantic Uses embedding similarity to detect natural topic boundaries Documents with distinct sections/topics
Hierarchical Creates parent (larger) and child (smaller) chunks; retrieves child, returns parent for context Long documents where context around a match matters
No chunking Treats each file as a single chunk Pre-processed data, short documents, FAQs
FM Parsing Uses a foundation model to interpret document layout before chunking Complex documents with tables, charts, images

Advanced RAG Techniques on AWS

Metadata Filtering

Attach metadata to documents (department, date, product, access level) and filter at query time to narrow the search space. This improves relevance and enables access control.

Hybrid Search

Combine vector similarity search (semantic) with keyword search (lexical) for better recall. Bedrock Knowledge Bases supports hybrid search with configurable weighting between semantic and keyword matches.

Query Decomposition

For complex multi-part questions, Bedrock can decompose the query into sub-queries, retrieve relevant chunks for each, and synthesize a comprehensive answer.

Reranking

After initial retrieval, a reranker model (e.g., Cohere Rerank or Amazon Rerank) scores and reorders results by relevance. This improves precision by filtering out semantically similar but contextually irrelevant chunks.

Guardrails Integration

Apply Bedrock Guardrails to RAG responses for content filtering, PII masking, and contextual grounding checks — which verify that the response is actually supported by the retrieved source documents.

RAG vs Fine-tuning vs Prompt Engineering

Approach When to Use Pros Cons
RAG Ground answers in specific documents, real-time data No training needed, data stays current, citable sources Retrieval quality depends on chunking, adds latency
Fine-tuning Teach model a specific style, domain vocabulary, or task format Better task-specific performance, lower inference cost Requires training data, expensive, can become stale
Prompt Engineering Guide model behavior with instructions and examples No training, instant iteration, works with any model Limited by context window, no persistent knowledge

Best practice: Start with prompt engineering, add RAG when you need domain-specific grounding, and fine-tune only when you need a specific output format or style that prompting can’t achieve.

Building RAG — Step by Step

  1. Prepare data — Upload documents to S3 (PDF, HTML, TXT, DOCX, CSV, MD, XLS)
  2. Create Knowledge Base — Choose embedding model, vector store, and chunking strategy
  3. Sync data source — Bedrock ingests, chunks, embeds, and stores vectors
  4. Test with Retrieve API — Verify relevance of retrieved chunks before full RAG
  5. Enable generation — Connect a foundation model for RetrieveAndGenerate API
  6. Add Guardrails — Apply contextual grounding checks to prevent hallucinations
  7. Integrate with Agent — Optionally connect to a Bedrock Agent for multi-step workflows

Cost Optimization

  • Embedding — Titan Embeddings V2 is ~$0.00002/1K tokens (one-time during ingestion + query time)
  • Vector Store — OpenSearch Serverless starts at ~$0.24/hr per OCU pair (consider Aurora PostgreSQL pgvector for lower cost at scale)
  • Generation — Depends on FM choice (Claude Haiku/Nova Micro are cheapest for RAG)
  • Tip: Use metadata filtering to reduce the number of chunks retrieved, lowering both retrieval cost and FM input token cost

AWS Certification Exam Practice Questions

Question 1:

A company’s RAG system retrieves relevant document chunks but the FM sometimes generates answers that contradict the retrieved information. Which Bedrock feature specifically addresses this?

  1. Content filters set to HIGH
  2. Contextual grounding check in Guardrails
  3. Automated Reasoning checks
  4. Denied topics configuration
Show Answer

Answer: B – Contextual grounding checks verify that the FM’s response is faithful to and supported by the retrieved source documents. It detects when the model “hallucinates” information not present in the context. Automated Reasoning uses formal logic for policy-based validation, which is different from source grounding.

Question 2:

A healthcare company has documents containing complex medical tables, embedded diagrams, and multi-column layouts. Standard chunking produces poor-quality chunks that miss table context. Which parsing approach should they use?

  1. Fixed-size chunking with 1024 token overlap
  2. Foundation Model parsing with a customized extraction prompt
  3. Semantic chunking with max tokens set to 2048
  4. No chunking with each page as a single chunk
Show Answer

Answer: B – Foundation Model parsing uses an FM (e.g., Claude) to interpret complex document layouts including tables, charts, and multi-column text before chunking. You can customize the extraction prompt to specify how tables should be serialized. This preserves structural information that standard text extraction would lose.

Question 3:

A legal firm needs their RAG system to only return answers from documents the requesting user has permission to access. Different users have access to different case files. How should they implement this?

  1. Create separate Knowledge Bases per user
  2. Use metadata filtering with user-specific access tags at query time
  3. Implement IAM policies on the vector store
  4. Use Guardrails to filter responses based on user role
Show Answer

Answer: B – Metadata filtering allows you to tag documents with access control metadata (e.g., case_id, department, clearance_level) during ingestion, then pass user-specific filters at query time. This ensures the vector search only returns chunks from documents the user is authorized to access, without duplicating data.

Question 4:

A company’s RAG system returns accurate but overly long answers because it retrieves too many chunks. They want to improve precision without reducing recall. Which technique helps?

  1. Reduce the Top-K parameter from 10 to 3
  2. Apply a reranker model after initial retrieval
  3. Switch from semantic to fixed-size chunking
  4. Increase the embedding model dimensions
Show Answer

Answer: B – A reranker scores and reorders retrieved chunks by contextual relevance. It retrieves broadly (high recall) then filters precisely (high precision). Reducing Top-K would reduce both recall and precision. The reranker keeps recall high while eliminating less-relevant results before they reach the FM.

Question 5:

An enterprise wants to implement RAG with their data in Confluence and SharePoint. They need the knowledge base to stay current as documents are updated. What is the MOST operationally efficient approach?

  1. Export documents to S3 nightly and sync the Knowledge Base on a schedule
  2. Use native Confluence and SharePoint connectors with incremental sync
  3. Build a custom Lambda pipeline to poll for changes and update the vector store
  4. Use Amazon Kendra with connectors and integrate with Bedrock via API
Show Answer

Answer: B – Bedrock Knowledge Bases has native connectors for Confluence, SharePoint, and other sources. These support incremental sync that only processes changed documents, keeping the knowledge base current without full re-ingestion. This is more operationally efficient than building custom pipelines or exporting to S3.

Related AWS AI Guides

Frequently Asked Questions

What is RAG in AWS?

RAG (Retrieval-Augmented Generation) on AWS is implemented through Amazon Bedrock Knowledge Bases. It retrieves relevant information from your data sources (S3, Confluence, SharePoint, web) and provides it as context to a foundation model, grounding responses in your actual data and reducing hallucinations.

How much does RAG cost on AWS?

RAG costs have three components: embedding ($0.00002/1K tokens for Titan V2), vector storage (OpenSearch Serverless from $0.24/hr/OCU pair or Aurora PostgreSQL pgvector), and FM generation (varies by model — Claude Haiku and Nova Micro are cheapest). For most workloads, the FM generation cost dominates.

RAG vs Fine-tuning — which should I use?

Use RAG when you need answers grounded in specific documents that change over time. Use fine-tuning when you need to change the model’s behavior, output format, or domain vocabulary. They can be combined: fine-tune for style, RAG for knowledge.

How do I prevent hallucinations in RAG?

Enable Bedrock Guardrails contextual grounding checks, which verify that the FM’s response is supported by the retrieved source chunks. Also: use higher Top-K for broader retrieval, add reranking for precision, and use hierarchical chunking to provide more context around matches.

Posted in AWS

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.