Retrieval-Augmented Generation with Spring Boot and Vector Databases

Introduction

You’re building an enterprise Java application that needs to answer questions based on your company’s internal documents. A standard language model can’t do this — it only knows what it learned during training. Enter Retrieval-Augmented Generation (RAG), a pattern that combines search with generation. In this tutorial, you’ll learn how to integrate Spring Boot with vector databases to build a RAG system. We’ll demystify every concept: what a vector is, how search works in this context, why context matters, and how RAG ties it all together. By the end, you’ll have a working example that queries documents intelligently.

What Is RAG? (Hint: It’s Not Just a Tech Buzzword)

RAG stands for Retrieval-Augmented Generation. Plain-English definition: instead of asking a language model to answer from memory, you first find relevant information in a database, then feed that information to the model as context.

How it works under the hood: RAG has two phases. First, retrieval — you search a database for documents related to the user’s question. Second, generation — you pass those documents plus the original question to a language model, which crafts an answer using the retrieved content.

Real-world analogy: Imagine you’re a lawyer preparing for a case. You don’t recite the law from memory. You pull relevant statutes from your filing cabinet, then write your argument based on those documents. RAG is that filing cabinet for AI.

Here’s a basic Spring Boot service that orchestrates RAG:

@Service
public class RagService {
    private final VectorSearchService searchService;
    private final LanguageModelClient lmClient;
    
    public String answer(String question) {
        // Phase 1: Retrieve relevant documents
        List<String> documents = searchService.search(question, 3);
        
        // Phase 2: Generate answer with context
        String prompt = buildPrompt(question, documents);
        return lmClient.generate(prompt);
    }
    
    private String buildPrompt(String question, List<String> docs) {
        return "Context:\n" + String.join("\n", docs) 
             + "\n\nQuestion: " + question;
    }
}

Non-obvious insight: RAG isn’t just about getting better answers — it’s about getting verifiable answers. Because you know which documents were used, you can audit the response. This is crucial in regulated industries.

Vectors: The Math That Makes Similarity Search Possible

A vector is a list of numbers that represents meaning. Plain-English definition: it’s a mathematical way to say “this document is about this topic” by assigning coordinates in a high-dimensional space.

How it works under the hood: An embedding model converts text (a sentence, paragraph, or document) into a vector — say, 384 numbers. Similar concepts produce vectors that are close together in this space. “Cat” and “kitten” will have vectors near each other; “cat” and “accounting” will be far apart.

Real-world analogy: Think of vectors as GPS coordinates for ideas. New York and Boston have coordinates near each other on a map. “Machine learning” and “neural networks” have vectors near each other in vector space.

Here’s how you generate a vector in Spring Boot using the Spring AI library:

@Component
public class EmbeddingService {
    private final EmbeddingModel embeddingModel;
    
    public float[] embed(String text) {
        // The model converts text into a vector of floats
        List<Double> vector = embeddingModel.embed(text);
        return vector.stream()
            .mapToDouble(Double::doubleValue)
            .mapToObj(d -> (float) d)
            .toArray();
    }
}

Gotcha: Different embedding models produce vectors of different dimensions. OpenAI’s ada-002 gives 1536 dimensions. All-MiniLM-L6-v2 gives 384. Your database must match dimensions with your model, or searches will fail silently.

Search: Finding Needles in a Vector Haystack

Search in the context of RAG means vector similarity search. Plain-English definition: you find documents whose vectors are closest to the question’s vector.

How it works under the hood: The system calculates distance between the query vector and every document vector in the database. The most common metric is cosine similarity — it measures the angle between vectors. The smaller the angle, the more similar the content.

Real-world analogy: You’re at a party looking for someone who likes jazz. You find the jazz fans by seeing who stands closest to the jazz-appreciation group. Vector search does this with mathematical precision.

Here’s a search implementation using Spring Data for Vector databases:

@Repository
public interface DocumentRepository 
    extends VectorRepository<Document, String> {
    
    @VectorQuery(distance = Distance.COSINE, topK = 5)
    List<Document> findSimilarDocuments(
        @VectorQueryField float[] queryVector
    );
}

@Service
public class VectorSearchService {
    private final DocumentRepository repo;
    private final EmbeddingService embeddingService;
    
    public List<String> search(String query, int limit) {
        float[] queryVector = embeddingService.embed(query);
        return repo.findSimilarDocuments(queryVector)
            .stream()
            .limit(limit)
            .map(Document::getContent)
            .toList();
    }
}

Expert insight: Search performance degrades with database size. For production systems with millions of vectors, you need Approximate Nearest Neighbor (ANN) algorithms. Libraries like Faiss or pgvector’s IVFFlat index make searches fast but slightly less accurate.

Context: Why It’s the Secret Sauce

Context is the relevant information you feed to the language model alongside the user’s question. Plain-English definition: it’s the background material that helps the model give a grounded, accurate answer.

How it works under the hood: Language models have a limited “context window” — typically 4K to 128K tokens. Your RAG system must select the most relevant documents and fit them within this window. Too little context, and the model guesses. Too much, and it gets confused or runs out of space.

Real-world analogy: You’re explaining a complex project to a new team member. You don’t dump every email from the last year on their desk. You pick the three most relevant documents and say, “Start with these.”

Here’s how you manage context size in Spring Boot:

@Service
public class ContextBuilder {
    private static final int MAX_TOKENS = 4000;
    private final TokenCounter tokenCounter;
    
    public String buildContext(String question, List<String> documents) {
        StringBuilder context = new StringBuilder();
        int tokensUsed = 0;
        
        // Reserve tokens for the question and response
        int maxDocTokens = MAX_TOKENS - tokenCounter.count(question) - 500;
        
        for (String doc : documents) {
            int docTokens = tokenCounter.count(doc);
            if (tokensUsed + docTokens > maxDocTokens) {
                break; // Context window full
            }
            context.append(doc).append("\n");
            tokensUsed += docTokens;
        }
        return context.toString();
    }
}

Non-obvious insight: Context ordering matters. Models pay more attention to content at the beginning and end of their context window (primacy and recency effects). Place critical information at the start of your context for best results.

Database: More Than Just Storage

A database in RAG isn’t just a place to store vectors. Plain-English definition: it’s a purpose-built system that stores both text content and its vector representations, optimized for similarity search.

How it works under the hood: Vector databases like Pinecone, Weaviate, or pgvector store each document as a record with: (1) the original text, (2) its vector embedding, (3) metadata for filtering. They use specialized indexes (HNSW, IVF) to search through vectors at high speed.

Real-world analogy: A regular database is like a library sorted by author name. A vector database is like a library where books are arranged by how similar their topics are. You walk in, describe what you want, and the librarian points you to the nearest shelf.

Here’s a Spring Boot configuration for pgvector (PostgreSQL extension):

@Configuration
@EnableTransactionManagement
public class VectorDatabaseConfig {
    
    @Bean
    public JdbcTemplate vectorJdbcTemplate(DataSource dataSource) {
        JdbcTemplate template = new JdbcTemplate(dataSource);
        // Enable pgvector extension
        template.execute("CREATE EXTENSION IF NOT EXISTS vector");
        return template;
    }
    
    @Bean
    public DataSourceInitializer dataSourceInitializer() {
        return new DataSourceInitializerBuilder()
            .withScript("schema.sql")
            .build();
    }
}

Gotcha: Not all databases handle vectors equally. Traditional relational databases don’t support vectors natively. pgvector adds vector support to PostgreSQL, but for pure speed, consider specialized databases like Pinecone or Qdrant.

Comparison Table: How Everything Fits Together

Concept	Job in RAG	Real-World Role	Key Implementation Detail
RAG	Orchestrator	The lawyer at their desk	Coordinates retrieval + generation
Vector	Representation	GPS coordinates of meaning	Embedding model output (384-1536 floats)
Search	Retrieval method	Finding the right shelf	Cosine similarity over vectors
Context	Input to LLM	The documents on the lawyer’s desk	Token-counted and ordered
Database	Storage engine	The filing cabinet	Specialized index for fast vector lookup

Key Takeaways

RAG is a pattern: retrieve first, then generate with that context.
Vectors encode meaning as numbers — similar ideas have similar vectors.
Search in RAG means vector similarity, not keyword matching.
Context must be carefully trimmed to fit the model’s attention window.
Database choice matters: pgvector for simple needs, specialized DBs for high volume.
Always check vector dimensions match between your embedding model and database.
Order context strategically — early and late positions get more model attention.

Build this RAG system, and your enterprise Java app will answer questions with facts, not hallucinations. That’s the difference between guessing and knowing.