AI/ML Architecture
Building a Production RAG Pipeline with Pinecone and Cohere Rerank
Ryan•Lead Architect•
The Hallucination Problem
LLMs are brilliant at reasoning, but they lack specific context about your business. If you ask an LLM about your internal HR policy, it will confidently guess. Retrieval-Augmented Generation (RAG) fixes this by injecting your data into the prompt.
The Architecture
For enterprise RAG, a naive similarity search isn't enough. We utilize a two-stage retrieval pipeline:
- Dense Retrieval (Pinecone): Fast vector search to find the top 50 relevant chunks of text.
- Reranking (Cohere): A cross-encoder model that scores exactly how relevant each chunk is to the user's query, selecting the top 5.
Implementation
import { Pinecone } from '@pinecone-database/pinecone';
import { CohereClient } from 'cohere-ai';
// 1. Initial Retrieval from Pinecone
const index = pinecone.Index('enterprise-docs');
const queryRes = await index.query({
vector: userQueryEmbedding,
topK: 50,
includeMetadata: true
});
const documents = queryRes.matches.map(m => m.metadata.text);
// 2. Rerank with Cohere
const cohere = new CohereClient({ token: process.env.COHERE_API_KEY });
const rerankRes = await cohere.rerank({
query: userQuery,
documents: documents,
topN: 5,
model: 'rerank-english-v3.0'
});
// Only the absolute best context is injected into the LLM
const finalContext = rerankRes.results.map(r => documents[r.index]).join('\n');
This two-stage approach dramatically reduces hallucinations and ensures the LLM is citing the correct internal documents.