How to Build a Production RAG Pipeline in Python: Step-by-Step with LangChain & Pinecone

April 27, 2026

What You Will Build

This tutorial walks through building a production RAG pipeline in Python from scratch. By the end, you will have a working system that ingests documents, splits them into chunks, stores embeddings in Pinecone, retrieves relevant context at query time, and generates grounded answers using an LLM. The stack: Python 3.11+, LangChain, Pinecone, and OpenAI.

Why LangChain and Pinecone

LangChain handles the orchestration layer: document loading, text splitting, prompt assembly, and LLM calls. Pinecone handles the vector storage and similarity search. You could swap Pinecone for Weaviate or Qdrant, and the LangChain code barely changes. That flexibility matters when requirements shift mid-project.

This combination works well for teams that want to move fast without building retrieval infrastructure from scratch. For hobby projects, Chroma (in-memory) is simpler. For production workloads with millions of vectors, Pinecone's managed infrastructure saves operational headaches.

Step 1: Set Up Your Environment

Install the required packages:

pip install langchain langchain-openai langchain-pinecone pinecone-client python-dotenv tiktoken

Create a .env file with your API keys:

OPENAI_API_KEY=sk-...
PINECONE_API_KEY=pcsk_...
PINECONE_INDEX_NAME=rag-production

Create the Pinecone index through their dashboard or via the SDK. Use 1536 dimensions if you are using OpenAI's text-embedding-ada-002, or 3072 for text-embedding-3-large.

Step 2: Load and Chunk Your Documents

Document loading depends on your source format. LangChain ships loaders for PDFs, CSVs, HTML, Notion exports, and about 80 other formats.

from langchain_community.document_loaders import DirectoryLoader, TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

loader = DirectoryLoader('./docs', glob='**/*.txt', loader_cls=TextLoader)
documents = loader.load()

splitter = RecursiveCharacterTextSplitter(
chunk_size=800,
chunk_overlap=100,
separators=['

', '
', '. ', ' ']
)
chunks = splitter.split_documents(documents)

Chunk size matters more than most tutorials admit. Too large (2000+ tokens) and the retriever returns noisy context. Too small (under 200 tokens) and you lose paragraph-level coherence. Start with 600 to 800 characters and adjust based on your retrieval precision metrics.

Step 3: Generate Embeddings and Store in Pinecone

from langchain_openai import OpenAIEmbeddings
from langchain_pinecone import PineconeVectorStore

embeddings = OpenAIEmbeddings(model='text-embedding-3-large')

vectorstore = PineconeVectorStore.from_documents(
documents=chunks,
embedding=embeddings,
index_name='rag-production'
)

For large document sets (50,000+ chunks), batch the upserts. Pinecone accepts up to 100 vectors per upsert call. LangChain handles batching internally, but watch for rate limits on the OpenAI embeddings API. Add a small delay or use exponential backoff.

Step 4: Build the Retrieval Chain

from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

llm = ChatOpenAI(model='gpt-4o', temperature=0)

prompt_template = PromptTemplate(
input_variables=['context', 'question'],
template="""Answer the question based only on the following context. If the context does not contain enough information, say so.

Context:
{context}

Question: {question}

Answer:"""
)

qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type='stuff',
retriever=vectorstore.as_retriever(search_kwargs={'k': 5}),
chain_type_kwargs={'prompt': prompt_template},
return_source_documents=True
)

The retriever pulls the top 5 most similar chunks. You can increase k for broader recall or decrease it to reduce noise and token costs. In our experience, k=4 to k=6 works well for most document sets.

Step 5: Query the Pipeline

result = qa_chain.invoke({'query': 'What is our refund policy?'})
print(result['result'])
for doc in result['source_documents']:
print(f"Source: {doc.metadata['source']}")

Production Hardening Checklist

A working demo is not a production system. Before deploying, handle these:

Metadata filtering: Tag chunks with source, date, and category metadata. Use Pinecone's metadata filters to scope retrieval by user, tenant, or document type.
Caching: Cache frequent queries to avoid redundant embedding lookups and LLM calls. Redis works well here.
Monitoring: Log every query, the retrieved chunks, and the generated answer. You will need these logs for debugging bad answers.
Evaluation: Set up an eval pipeline that tests retrieval precision and answer accuracy against a golden dataset of question-answer pairs.
Error handling: Pinecone and OpenAI both have outages. Wrap calls in retry logic with timeouts.

Common Pitfalls

Chunking by fixed character count without respecting sentence boundaries. This splits sentences in half, which confuses the retriever.

Not including metadata in chunks. If the retriever pulls a chunk that says "as mentioned in Section 3," the LLM has no way to resolve that reference without source metadata.

Skipping the evaluation step. Teams often build the pipeline, try three queries manually, and call it done. Then it fails on edge cases in production.

Frequently Asked Questions

How much does a Pinecone-based RAG pipeline cost to run?

The Pinecone starter tier is free for up to 100,000 vectors. Beyond that, the standard plan starts around $70/month. OpenAI embedding costs are roughly $0.13 per million tokens for text-embedding-3-large. The LLM inference cost depends on your query volume and model choice.

Can I use this with open-source models instead of OpenAI?

Yes. Swap ChatOpenAI for ChatOllama or a HuggingFace endpoint. The LangChain interface stays the same. You will need a local embedding model too, such as sentence-transformers/all-MiniLM-L6-v2.