Part 3: Text Splitters — The Art of Chunking for LLMs
Series Progress
- Part 1: RAG Architecture
- Part 2: Document Loaders You’re here: Part 3 — Text Splitting
Why Do We Need Text Splitting?
Large documents can overwhelm LLMs if passed in raw. Text splitting is essential in Retrieval-Augmented Generation (RAG) systems for these reasons:
- Breaks long documents into manageable, context-rich chunks
- Improves vector search accuracy (better embeddings)
- Enables retrieving only relevant content
- Prevents exceeding token limits of LLM prompts
Without smart chunking, your RAG pipeline may hallucinate or return irrelevant results.
Key Concepts
Chunk Size
The maximum size of each split, typically in characters or tokens.
- Bigger chunks = more context, but risk overflow
- Smaller chunks = less context, but safer for prompt limits
Chunk Overlap
Extra content from the previous chunk to maintain continuity.
- Helps the model retain context across chunks
- Common values: 30–50 tokens or characters
Common Text Splitters in LangChain
LangChain offers various built-in splitters, each optimized for different use cases:
1. CharacterTextSplitter
- Simple, general-purpose splitter by character length
- Works well for raw or unstructured text
2. RecursiveCharacterTextSplitter
- Smart splitter that tries to preserve structure (e.g., paragraphs, sections)
- Ideal for Markdown, source code, or articles
3. TokenTextSplitter
- Token-aware (e.g., works with OpenAI/Gemini tokens)
- Prevents prompt overflow
4. MarkdownHeaderTextSplitter
- Splits based on heading levels in Markdown documents
- Great for blogs, technical docs, wikis
5. Language-Specific Splitters
- e.g.,
PythonCodeSplitter
- Maintains function/class blocks in source code files
How Do Text Splitters Work?
Step-by-Step Breakdown
- Receive Raw Content
Usually as
Document
objects loaded from PDFs, web pages, etc. - Choose a Splitting Strategy
- By characters:
\n
,.
, - By tokens: using tokenizer
- By structure: headers, code blocks
3. Split into Segments
- Uses a hierarchy: try largest delimiter first (
\n\n
→\n
→.
→ space) - If still too long, falls back to character-level splits
4. Build Overlapping Chunks
- Ensures each chunk fits within
chunk_size
- Adds
chunk_overlap
tokens for context preservation
5. Return New Document Chunks
- Each chunk retains metadata (source, page number, etc.)
Code Example 1: Recursive Character Splitter
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=50
)
chunks = splitter.split_documents(documents)
print(f"Chunks created: {len(chunks)}")
print(chunks[0].page_content)
Recommended for most use cases.
Intelligently handles structure and fallback splitting._
Code Example 2: Token Splitter
from langchain.text_splitter import TokenTextSplitter
splitter = TokenTextSplitter(
chunk_size=200,
chunk_overlap=20
)
chunks = splitter.split_documents(documents)
Useful when working with LLMs that have strict token limits (e.g., OpenAI, Gemini, Claude)._
Code Example 3: Markdown Header Splitter
from langchain.text_splitter import MarkdownHeaderTextSplitter
md_text = """# RAG Tutorial
LangChain is awesome.
## Embeddings
This is how it works."""
splitter = MarkdownHeaderTextSplitter(
headers_to_split_on=[('#', 'H1'), ('##', 'H2')]
)
docs = splitter.split_text(md_text)
Best for docs, blogs, or tutorials with clear header structure_
Mini Workflow Example
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
#Load PDF
loader = PyPDFLoader("data/report.pdf")
documents = loader.load()
#Split into chunks
splitter = RecursiveCharacterTextSplitter(chunk_size=400, chunk_overlap=50)
chunks = splitter.split_documents(documents)
#Preview first chunks
for chunk in chunks[:2]:
print(chunk.metadata)
print(chunk.page_content[:100])
Best Practices
- Use
RecursiveCharacterTextSplitter
for general use - Always set
chunk_overlap
(30–50) to retain context - Keep
chunk_size
within your model’s max context window - Clean up input data before splitting (especially scanned PDFs)
- Preserve original metadata (title, page number, etc.) in each chunk
TL;DR — What to Do / What to Avoid
Do This:
- Use structure-aware splitters like
RecursiveCharacterTextSplitter
- Tune
chunk_size
andchunk_overlap
to match your use case - Retain and attach document metadata to each chunk
- Use token-aware splitting for LLM compatibility
Avoid This:
- Splitting by fixed length without overlap
- Using chunks that are too small or too large
- Dropping metadata (leads to loss of context)
- Ignoring token limitations of your LLM
Coming Next
In Part 4, we’ll explore Embeddings and Vector Stores — turning chunks into vectors and enabling semantic search through similarity.
Top comments (0)