Query Processing – Making RAG Actually Understand You
Query Processing – Making RAG Actually Understand You

Query Processing – Making RAG Actually Understand You

Query Processing – Making RAG Actually Understand You

If you’ve been following along, you now know why LLMs hallucinate — they’re probabilistic prediction engines, not databases. They guess what comes next based on statistical patterns learned from internet.

The solution? Stop making them guess. Give them access to your data.

In Part 2, we explored two approaches: RAG and CAG. RAG (Retrieval-Augmented Generation) retrieves relevant documents at runtime, while CAG (Cache-Augmented Generation) pre-loads knowledge into models context window.

If you’re building something serious — especially with large, dynamic knowledge bases — RAG is the right choice. And at the heart of every RAG system lies a vector database.

The Problem: Users Ask Terrible Questions

When someone types how do i deploy k8s into a vector database, they’re essentially speaking Greek. The retrieval system returns… nothing relevant.

Real Examples of Query Failures

User query: How do I troubleshoot my database?
System retrieval: Returns general database troubleshooting docs
User thinks: Wait, my question about specific tech stack (PostgreSQL on AWS with connection pooling issues)

User query: My API is slow
System retrieval: Returns documentation about API rate limits
Actual problem: Database N+1 query problem in backend — completely different issue

User query: Compare vector databases
System retrieval: Returns individual pages about FAISS, Pinecone, Weaviate
What user actually wanted: A comparison table showing features, pricing, pros/cons side-by-side

These aren’t edge cases. This is normal. Most users don’t know how to ask questions effectively. They don’t know your terminology. They don’t know what’s possible.

A great RAG system compensates for bad user queries.


The Solution: Query Processing

Query processing is the unsung hero of production RAG systems. While everyone obsesses over embeddings and vector databases, the real differentiator is how intelligently you handle user queries.

Query processing has six key stages:

  1. Query Decomposition — Breaking complex questions into pieces
  2. Query Routing — Deciding where to send the query
  3. Query Expansion — Adding synonyms and related terms
  4. Retrieval — Getting candidate documents from the vector database
  5. Re-ranking — Optimizing for relevance and diversity
  6. Context Management — Using conversation history to improve future queries

1. Query Decomposition: Breaking It Down

Complex queries often contain multiple questions or require multi-step reasoning. If you search the vector database with the raw query, you’ll miss relevant information.

Why Decomposition Matters

Consider this query:

“How do I deploy Kubernetes on AWS and set up monitoring with Prometheus while also configuring auto-scaling for my Node.js microservices?”

If you search for this as-is, the vector database might return:

  • Documentation about Kubernetes deployment (Kubernetes, AWS)
  • Documentation about Prometheus (monitoring)
  • Documentation about auto-scaling (Kubernetes, microservices)

But each piece might be incomplete. The deployment docs won’t mention monitoring. The monitoring docs won’t mention auto-scaling. The user needs all three pieces combined.

The Decomposition Process

Step A: Identify Intent

First, determine what the user actually wants. Intent falls into several categories:

Factual Information:

“What is the difference between PostgreSQL and MySQL?”
“Who developed the Kubernetes project?”

Procedural Help:

“How do I deploy a React app to Vercel?”
“Steps to set up CI/CD with GitHub Actions”

Creative/Brainstorming:

“Help me come up with ideas for a side project using AI”
“Suggest ways to improve my database performance”

Troubleshooting/Debugging:

“My database connection keeps timing out”
“Getting 403 errors when calling the API”

Step B: Extract Entities

Entities are the key nouns and concepts in the query. These are what you’ll use to filter or enhance your search:

Query: “Deploy Kubernetes on AWS with Prometheus monitoring”

  • Entities: Kubernetes, AWS, Prometheus, monitoring
  • Entity types: Platform (Kubernetes), Cloud Provider (AWS), Tool (Prometheus), Domain (monitoring)

Step C: Break Down Complex Queries

Multi-part queries need to be split into individual sub-queries:

Original: “Compare Docker and Kubernetes, then tell me which one I should use for a startup with 3 developers”

Decomposed:

  1. Sub-query 1: “What are the differences between Docker and Kubernetes?”
  2. Sub-query 2: “What is Docker best suited for?”
  3. Sub-query 3: “What is Kubernetes best suited for?”
  4. Sub-query 4: “Which is better for small teams?”
  5. Sub-query 5: “Resource requirements for Docker vs Kubernetes”

Step D: Resolve Ambiguity

Ambiguous terms can derail retrieval. You need to resolve them based on context:

Query: “Compare vector databases”

Context: User is evaluating options for their company (FAISS, Pinecone, Weaviate, Qdrant)

Resolution: Explicitly search for comparison tables, feature matrices, and pricing — not individual vector DB internals

Query: “How do I deploy my app?”

Ambiguity: What kind of app? Node.js? Python? Static site?

Resolution: Ask clarifying question OR retrieve documentation for common frameworks and let the user choose

Practical Implementation

from typing import List, Dict

def decompose_query(query: str) -> Dict:
    """
    Decompose a user query into intent, entities, and sub-queries.
    """
    # Use an LLM to analyze the query
    prompt = f"""
    Analyze this user query and return a JSON object with:
    - intent: factual/procedural/creative/troubleshooting
    - entities: list of key nouns and concepts
    - sub_queries: list of individual questions if this is a multi-part query
    - clarifications: list of ambiguous terms that need resolution


    Query: "{query}"

    Return only valid JSON.
    """

    # Call your LLM here (OpenAI, Claude, etc.)
    result = call_llm(prompt)

    return json.loads(result)

# Example
query = "Deploy Kubernetes on AWS with Prometheus monitoring and auto-scaling for Node.js"
decomposed = decompose_query(query)

# Result:
{
    "intent": "procedural",
    "entities": ["Kubernetes", "AWS", "Prometheus", "auto-scaling", "Node.js"],
    "sub_queries": [
        "Deploy Kubernetes on AWS",
        "Set up Prometheus monitoring for Kubernetes",
        "Configure auto-scaling in Kubernetes",
        "Deploy Node.js applications on Kubernetes"
    ],
    "clarifications": ["Auto-scaling type: HPA, VPA, or cluster autoscaler?"]
}

2. Query Routing: Intent Classification

Not all queries should go to the same place. Different intents require different retrieval strategies. Query routing decides where to send each query.

Why Routing Matters

A single vector database is rarely enough for complex use cases. Consider:

  • Factual queries: Need precise answers from documentation
  • Troubleshooting queries: Need forum posts, Stack Overflow answers, bug reports
  • Comparison queries: Need comparison tables, feature matrices
  • Code queries: Need code examples, tutorials, API references

If everything goes to one vector database, you’ll get irrelevant results.

Routing Strategies

Strategy 1: Intent-Based Routing

Classify the query by intent and route to specialized collections:

def route_by_intent(query: str, intent: str) -> str:
    """
    Route query to the appropriate vector collection based on intent.
    """
    routing_map = {
        "factual": "docs_collection",
        "procedural": "tutorials_collection",
        "troubleshooting": "forum_collection",
        "creative": "ideas_collection",
        "comparison": "comparisons_collection"
    }

    return routing_map.get(intent, "default_collection")

# Example
query = "How do I fix a 504 Gateway Timeout error?"
intent = "troubleshooting"
collection = route_by_intent(query, intent)  # "forum_collection"

Strategy 2: Entity-Based Routing

Route based on what the user is asking about:

def route_by_entity(query: str, entities: List[str]) -> List[str]:
    """
    Route query to collections relevant to the entities mentioned.
    """
    # Map entities to collections
    entity_collections = {
        "Kubernetes": ["k8s_docs", "k8s_tutorials", "k8s_forum"],
        "PostgreSQL": ["postgres_docs", "postgres_troubleshooting"],
        "React": ["react_docs", "react_tutorials", "react_stackoverflow"],
        "AWS": ["aws_docs", "aws_tutorials"]
    }

    # Find all relevant collections
    collections = set()
    for entity in entities:
        if entity in entity_collections:
            collections.update(entity_collections[entity])

    return list(collections)

Strategy 3: Hybrid Routing

Combine intent and entity-based routing:

def hybrid_routing(query: str, intent: str, entities: List[str]) -> List[str]:
    """
    Combine intent and entity-based routing for optimal results.
    """
    # Start with intent-based collection
    primary_collection = route_by_intent(query, intent)

    # Get entity-specific collections
    entity_collections = route_by_entity(query, entities)

    # Combine and deduplicate
    all_collections = list(set([primary_collection] + entity_collections))

    return all_collections

Practical Example: Customer Support Bot

Let’s say you’re building a customer support bot for a SaaS product:

# User queries and their routing

query1 = "How do I reset my password?"
intent1 = "procedural"
entities1 = ["password", "account"]
route1 = ["docs_collection", "account_docs"]
# Result: Search account documentation

query2 = "I'm getting error code ERR-5002 when syncing data"
intent2 = "troubleshooting"
entities2 = ["error", "sync", "ERR-5002"]
route2 = ["forum_collection", "error_codes", "troubleshooting"]
# Result: Search forum posts and error code database

query3 = "Compare your pricing plans"
intent3 = "comparison"
entities3 = ["pricing", "plans"]
route3 = ["comparisons_collection", "pricing_page"]
# Result: Search pricing comparison tables

3. Query Expansion: Making It Smarter

Users rarely search with perfect terminology. Query expansion adds synonyms and related terms to improve retrieval.

Why Expansion Matters

User searches: “cheap reliable cars”

Vector database might miss:

  • “affordable dependable vehicles”
  • “budget-friendly sedans”
  • “low-cost automobiles”

User searches: “k8s deployment”

Vector database might miss:

  • “Kubernetes deployment guide”
  • “deploying applications on Kubernetes”
  • “Kubernetes cluster setup”

Expansion Techniques

Technique 1: Synonym Expansion

Add synonyms for key terms:

from typing import List

def expand_with_synonyms(query: str) -> List[str]:
    """
    Generate queries with synonyms for key terms.
    """
    # Define synonyms mapping
    synonyms = {
        "cheap": ["affordable", "budget-friendly", "low-cost", "inexpensive"],
        "reliable": ["dependable", "trustworthy", "durable", "consistent"],
        "cars": ["vehicles", "automobiles", "transportation"],
        "k8s": ["Kubernetes", "kubernetes"],
        "deploy": ["deployment", "deploying", "setup", "install"],
        "database": ["db", "data store", "storage", "persistence layer"]
    }

    # For each key term, generate variations
    expanded_queries = [query]

    for term, syns in synonyms.items():
        if term.lower() in query.lower():
            for syn in syns:
                expanded = query.replace(term, syn, 1)
                expanded_queries.append(expanded)

    return list(set(expanded_queries))

# Example
query = "cheap reliable cars"
expanded = expand_with_synonyms(query)

# Result:
# [
#     "cheap reliable cars",
#     "affordable reliable cars",
#     "budget-friendly reliable cars",
#     "cheap dependable cars",
#     "cheap trustworthy cars"
# ]

Technique 2: Entity Expansion

Add related entities:

def expand_with_entities(query: str, entities: List[str]) -> List[str]:
    """
    Add related entities to the query.
    """
    # Define related entities
    related_entities = {
        "Kubernetes": ["k8s", "K8s", "kubernetes", "container orchestration"],
        "PostgreSQL": ["postgres", "pg", "Postgres", "relational database"],
        "React": ["React.js", "ReactJS", "reactjs", "frontend framework"],
        "AWS": ["Amazon Web Services", "Amazon Cloud", "EC2", "S3"]
    }

    expanded_queries = [query]

    for entity in entities:
        if entity in related_entities:
            for related in related_entities[entity]:
                if related.lower() not in query.lower():
                    expanded = query.replace(entity, related, 1)
                    expanded_queries.append(expanded)

    return list(set(expanded_queries))

Technique 3: LLM-Based Expansion

Use an LLM to generate expanded queries:

def llm_expand_query(query: str) -> List[str]:
    """
    Use an LLM to generate expanded queries.
    """
    prompt = f"""
    Generate 5-10 alternative ways to phrase this query.
    Keep the same intent but use different terminology and synonyms.

    Original query: "{query}"

    Return only a JSON array of strings.
    """

    result = call_llm(prompt)
    expanded_queries = json.loads(result)

    # Include original
    expanded_queries.insert(0, query)

    return expanded_queries

# Example
query = "How do I deploy Kubernetes on AWS?"
expanded = llm_expand_query(query)

# Result:
# [
#     "How do I deploy Kubernetes on AWS?",
#     "What are the steps to set up Kubernetes on Amazon Web Services?",
#     "Kubernetes deployment guide for AWS",
#     "Deploying containerized applications on AWS EKS",
#     "How to configure Kubernetes cluster on Amazon EC2",
#     "AWS Kubernetes setup tutorial"
# ]

4. Retrieval: Getting Documents from Vector Database

After processing the query, it’s time to retrieve relevant documents from the vector database.

Retrieval Configuration

from qdrant_client import QdrantClient
from qdrant_client.models import Filter, FieldCondition, MatchValue

def retrieve_documents(
    query_embedding: List[float],
    collection_name: str,
    top_k: int = 10,
    score_threshold: float = 0.7
) -> List[Dict]:
    """
    Retrieve top-k documents from vector database.
    """
    client = QdrantClient(url="http://localhost:6333")

    results = client.search(
        collection_name=collection_name,
        query_vector=query_embedding,
        query_filter=None,  # No metadata filter
        limit=top_k,
        score_threshold=score_threshold  # Only return high-quality matches
    )

    return results

Hybrid Retrieval (Vector + Keyword)

Combine vector search with keyword search for better results:

def hybrid_retrieval(
    query: str,
    query_embedding: List[float],
    collection_name: str,
    top_k: int = 10
) -> List[Dict]:
    """
    Combine vector and keyword search.
    """
    # Vector search
    vector_results = retrieve_documents(
        query_embedding,
        collection_name,
        top_k=top_k * 2  # Get more candidates
    )

    # Keyword search (BM25/TF-IDF)
    keyword_results = keyword_search(query, collection_name, top_k=top_k * 2)

    # Combine and re-rank
    combined = combine_and_rerank(vector_results, keyword_results)

    return combined[:top_k]

5. Re-ranking: Quality Control

Initial retrieval may return noisy or irrelevant results. Re-ranking optimizes for relevance and diversity.

Why Re-ranking Matters

Vector similarity isn’t perfect. The top result might be semantically similar but not actually relevant:

Query: “How to fix PostgreSQL connection timeout”

Top vector result: “PostgreSQL connection pool configuration” (93% similarity)

But actual answer needed: “PostgreSQL timeout settings and troubleshooting guide” (85% similarity)

Re-ranking Techniques

Technique 1: Cross-Encoder Re-ranking

Use a cross-encoder model for more accurate relevance scoring:

from sentence_transformers import CrossEncoder

# Load a cross-encoder model (more accurate than vector similarity)
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

def rerank_results(query: str, documents: List[Dict]) -> List[Dict]:
    """
    Re-rank documents using cross-encoder.
    """
    # Prepare query-document pairs
    pairs = [(query, doc['text']) for doc in documents]

    # Compute relevance scores
    scores = reranker.predict(pairs)

    # Sort by score
    reranked = sorted(zip(documents, scores), key=lambda x: x[1], reverse=True)

    return [doc for doc, score in reranked]


# Example
query = "How to fix PostgreSQL connection timeout"
documents = retrieve_documents(...)  # From vector database
reranked = rerank_results(query, documents)

Technique 2: Diversity Re-ranking

Ensure results cover different aspects of the query:

def diversity_rerank(documents: List[Dict], top_k: int = 10) -> List[Dict]:
    """
    Re-rank to ensure diversity in results.
    """
    selected = []
    remaining = documents.copy()

    while len(selected) < top_k and remaining:
        # Pick the highest-scored document
        best = max(remaining, key=lambda x: x['score'])
        selected.append(best)
        remaining.remove(best)

        # Remove similar documents (by topic or category)
        selected_categories = set([d.get('category') for d in selected])
        remaining = [d for d in remaining if d.get('category') not in selected_categories]

    return selected

Technique 3: Metadata-Aware Re-ranking

Prioritize documents with better metadata:

def metadata_aware_rerank(documents: List[Dict]) -> List[Dict]:
    """
    Re-rank considering document metadata.
    """
    for doc in documents:
        # Boost score based on metadata
        boost = 1.0

        # Recent documents get a boost
        if doc.get('created_at') > '2024-01-01':
            boost *= 1.1


        # Official documentation gets a boost
        if doc.get('source') == 'official_docs':
            boost *= 1.2


        # Upvoted forum posts get a boost
        if doc.get('votes', 0) > 10:
            boost *= 1.15

        doc['adjusted_score'] = doc['score'] * boost


    # Sort by adjusted score
    return sorted(documents, key=lambda x: x['adjusted_score'], reverse=True)

6. Context Management: Conversation History

A good query processor maintains conversation history and uses previous interactions to improve future retrievals.

Why Context Matters

User: "How do I deploy Kubernetes?"

System: Returns general Kubernetes deployment guides

User (follow-up): "What about on AWS?"

Without context: Returns general AWS guides (irrelevant)

With context: Understands "deploy Kubernetes on AWS"

Context Management Strategies

Strategy 1: Query Context Injection

class ConversationContext:
    def __init__(self):
        self.history = []
        self.entities = set()

    def add_message(self, role: str, content: str):
        """Add a message to conversation history."""
        self.history.append({"role": role, "content": content})

        # Extract entities from user messages
        if role == "user":
            entities = extract_entities(content)
            self.entities.update(entities)

    def get_enhanced_query(self, query: str) -> str:
        """Enhance query with conversation context."""
        if not self.history:
            return query

        # Get last few exchanges
        recent = self.history[-4:]  # Last 2 exchanges

        # Build context string
        context_parts = []
        for msg in recent:
            context_parts.append(f"{msg['role']}: {msg['content']}")

        context = "\n".join(context_parts)

        # Ask LLM to enhance the query
        prompt = f"""
        Given this conversation history, rewrite the user's latest query
        to include relevant context from previous messages.

        Conversation:
        {context}

        Latest query: "{query}"

        Return only the enhanced query, nothing else.
        """

        enhanced = call_llm(prompt)
        return enhanced

# Example
ctx = ConversationContext()
ctx.add_message("user", "How do I deploy Kubernetes?")
ctx.add_message("assistant", "You can deploy Kubernetes using kubeadm, minikube, or cloud services like EKS, GKE, AKS.")

query = "What about on AWS?"
enhanced = ctx.get_enhanced_query(query)
# Result: "How do I deploy Kubernetes on AWS EKS?"

Strategy 2: Entity Persistence

def persist_entities(query: str, context: ConversationContext) -> str:
    """
    Add previously mentioned entities to the query.
    """
    entities = list(context.entities)

    if not entities:
        return query

    # Ask LLM to incorporate entities
    prompt = f"""
    The user mentioned these entities in previous messages: {', '.join(entities)}

    Rewrite this query to include relevant entities if they fit:
    "{query}"

    Return only the rewritten query.
    """

    enhanced = call_llm(prompt)
    return enhanced

7. Prompt Engineering for RAG

The right prompt architecture can make or break your RAG system. Even with perfect retrieval, a bad prompt will produce poor answers.

Common RAG Prompt Mistakes

  1. No system instructions: "Here's a document about X, answer based on it"
  2. Ignoring retrieval scores: Not using distance/similarity metadata
  3. Letting it hallucinate: "Feel free to use your general knowledge if context doesn't have answer"
  4. Not saying I don't know: Better to say "Based on the provided context, I don't have that information"

The Perfect RAG Prompt

def build_rag_prompt(query: str, documents: List[Dict]) -> str:
    """
    Build an optimized RAG prompt.
    """
    # Sort documents by relevance score
    sorted_docs = sorted(documents, key=lambda x: x['score'], reverse=True)

    # Build context section
    context_parts = []
    for i, doc in enumerate(sorted_docs, 1):
        context_parts.append(f"""
[Document {i} - Relevance: {doc['score']:.2%}]
Source: {doc.get('source', 'Unknown')}

{doc['text']}
---
        """)

    context = "\n".join(context_parts)

    # Build the prompt
    prompt = f"""You are a helpful assistant. Answer the user's question using ONLY the provided context.


{context}



- Use ONLY the information from the documents above
- If the answer is not in the context, say: "Based on the provided context, I don't have that information."
- Cite which document you're using when answering (e.g., "According to Document 2...")
- Do not make up information or use outside knowledge
- Be concise and direct



{query}



"""

    return prompt

Example: Good vs Bad RAG Responses

Bad RAG Response:
"To fix your database timeout issue, try increasing the connection timeout setting in your configuration file. Also check your network latency and consider using a connection pool."

Problem: Uses outside knowledge not in retrieved docs, doesn't cite sources.

Good RAG Response:
"According to Document 3 (PostgreSQL Troubleshooting Guide), connection timeout issues can be resolved by adjusting the `statement_timeout` parameter in `postgresql.conf`. Document 1 suggests setting `statement_timeout = 30000` (30 seconds) as a starting point. Document 5 recommends checking your `pg_hba.conf` for authentication issues that might appear as timeouts."

Good: Cites specific documents, uses only retrieved information.


Putting It All Together: Complete Query Processing Pipeline

def query_processing_pipeline(query: str) -> Dict:
    """
    Complete query processing pipeline.
    """
    # Step 1: Decompose query
    decomposed = decompose_query(query)

    # Step 2: Route to appropriate collections
    collections = hybrid_routing(
        query,
        decomposed['intent'],
        decomposed['entities']
    )

    # Step 3: Expand query
    expanded_queries = llm_expand_query(query)

    # Step 4: Retrieve from each collection
    all_results = []
    for expanded in expanded_queries:
        embedding = get_embedding(expanded)


        for collection in collections:
            results = retrieve_documents(
                embedding,
                collection,
                top_k=5
            )
            all_results.extend(results)

    # Step 5: Re-rank results
    reranked = rerank_results(query, all_results)
    final = metadata_aware_rerank(reranked)
    final = diversity_rerank(final, top_k=10)

    # Step 6: Build RAG prompt
    prompt = build_rag_prompt(query, final)

    # Step 7: Generate answer
    answer = call_llm(prompt)

    return {
        'answer': answer,
        'sources': [doc.get('source') for doc in final],
        'documents': final
    }

# Example
query = "How do I deploy Kubernetes on AWS with Prometheus monitoring?"
result = query_processing_pipeline(query)
print(result['answer'])

Key Insight

Query Processing is the unsung hero of RAG systems.

Most people focus on embeddings and vector databases. But smart RAG systems win or lose on query processing. A great retrieval with wrong routing is useless. Perfect embeddings with bad decomposition will miss relevant information.

The companies with the best RAG systems aren't necessarily using the best vector databases — they're using the most intelligent query processing.


Connecting the Dots: Bridge to Part 5

In Part 4, we explored how to make RAG actually understand user queries through intelligent query processing. Part 5 will show you how to build a RAG system that doesn't just store vectors, but thinks.

We'll cover:

  • Designing a production RAG architecture
  • Choosing the right tools and frameworks
  • Evaluating RAG system performance
  • Common pitfalls and how to avoid them

This is Part 4 of a 7-part series on AI & RAG.

Previous: Vector Databases – The Memory Palace of AI

Next: Building a RAG System (coming soon)