DEV Community

Cover image for Building RAG Applications with LangChain(Part-3)
Dharmendra Singh
Dharmendra Singh

Posted on

Building RAG Applications with LangChain(Part-3)

Part 3: Text Splitters — The Art of Chunking for LLMs

Series Progress

Why Do We Need Text Splitting?

Large documents can overwhelm LLMs if passed in raw. Text splitting is essential in Retrieval-Augmented Generation (RAG) systems for these reasons:

  • Breaks long documents into manageable, context-rich chunks
  • Improves vector search accuracy (better embeddings)
  • Enables retrieving only relevant content
  • Prevents exceeding token limits of LLM prompts

Without smart chunking, your RAG pipeline may hallucinate or return irrelevant results.

Key Concepts

Chunk Size

The maximum size of each split, typically in characters or tokens.

  • Bigger chunks = more context, but risk overflow
  • Smaller chunks = less context, but safer for prompt limits

Chunk Overlap

Extra content from the previous chunk to maintain continuity.

  • Helps the model retain context across chunks
  • Common values: 30–50 tokens or characters

Common Text Splitters in LangChain

LangChain offers various built-in splitters, each optimized for different use cases:

1. CharacterTextSplitter

  • Simple, general-purpose splitter by character length
  • Works well for raw or unstructured text

2. RecursiveCharacterTextSplitter

  • Smart splitter that tries to preserve structure (e.g., paragraphs, sections)
  • Ideal for Markdown, source code, or articles

3. TokenTextSplitter

  • Token-aware (e.g., works with OpenAI/Gemini tokens)
  • Prevents prompt overflow

4. MarkdownHeaderTextSplitter

  • Splits based on heading levels in Markdown documents
  • Great for blogs, technical docs, wikis

5. Language-Specific Splitters

  • e.g., PythonCodeSplitter
  • Maintains function/class blocks in source code files

How Do Text Splitters Work?

Step-by-Step Breakdown

  1. Receive Raw Content Usually as Document objects loaded from PDFs, web pages, etc.
  2. Choose a Splitting Strategy
  • By characters: \n, .,
  • By tokens: using tokenizer
  • By structure: headers, code blocks

3. Split into Segments

  • Uses a hierarchy: try largest delimiter first (\n\n\n. → space)
  • If still too long, falls back to character-level splits

4. Build Overlapping Chunks

  • Ensures each chunk fits within chunk_size
  • Adds chunk_overlap tokens for context preservation

5. Return New Document Chunks

  • Each chunk retains metadata (source, page number, etc.)

Code Example 1: Recursive Character Splitter

from langchain.text_splitter import RecursiveCharacterTextSplitter  

splitter = RecursiveCharacterTextSplitter(  
    chunk_size=500,  
    chunk_overlap=50  
)  

chunks = splitter.split_documents(documents)  

print(f"Chunks created: {len(chunks)}")  
print(chunks[0].page_content)
Enter fullscreen mode Exit fullscreen mode

Recommended for most use cases.

Intelligently handles structure and fallback splitting._

Code Example 2: Token Splitter

from langchain.text_splitter import TokenTextSplitter  

splitter = TokenTextSplitter(  
    chunk_size=200,  
    chunk_overlap=20  
)  

chunks = splitter.split_documents(documents)
Enter fullscreen mode Exit fullscreen mode

Useful when working with LLMs that have strict token limits (e.g., OpenAI, Gemini, Claude)._

Code Example 3: Markdown Header Splitter

from langchain.text_splitter import MarkdownHeaderTextSplitter  

md_text = """# RAG Tutorial  
LangChain is awesome.  

## Embeddings  
This is how it works."""  

splitter = MarkdownHeaderTextSplitter(  
    headers_to_split_on=[('#', 'H1'), ('##', 'H2')]  
)  

docs = splitter.split_text(md_text)
Enter fullscreen mode Exit fullscreen mode

Best for docs, blogs, or tutorials with clear header structure_

Mini Workflow Example

from langchain.document_loaders import PyPDFLoader  
from langchain.text_splitter import RecursiveCharacterTextSplitter  

#Load PDF  
loader = PyPDFLoader("data/report.pdf")  
documents = loader.load()  

#Split into chunks  
splitter = RecursiveCharacterTextSplitter(chunk_size=400, chunk_overlap=50)  
chunks = splitter.split_documents(documents)  

#Preview first chunks  
for chunk in chunks[:2]:  
    print(chunk.metadata)  
    print(chunk.page_content[:100])
Enter fullscreen mode Exit fullscreen mode

Best Practices

  • Use RecursiveCharacterTextSplitter for general use
  • Always set chunk_overlap (30–50) to retain context
  • Keep chunk_size within your model’s max context window
  • Clean up input data before splitting (especially scanned PDFs)
  • Preserve original metadata (title, page number, etc.) in each chunk

TL;DR — What to Do / What to Avoid

Do This:

  • Use structure-aware splitters like RecursiveCharacterTextSplitter
  • Tune chunk_size and chunk_overlap to match your use case
  • Retain and attach document metadata to each chunk
  • Use token-aware splitting for LLM compatibility

Avoid This:

  • Splitting by fixed length without overlap
  • Using chunks that are too small or too large
  • Dropping metadata (leads to loss of context)
  • Ignoring token limitations of your LLM

Coming Next

In Part 4, we’ll explore Embeddings and Vector Stores — turning chunks into vectors and enabling semantic search through similarity.

Missed a Part?

Top comments (0)