Chunking: Built-in methods
Five chunkers ship under railtracks.retrieval.chunking. Pick one based on
your source format and whether you need offsets back.
Summary
| Chunker | Best for | Offsets on Chunk? |
|---|---|---|
RecursiveCharacterChunker |
Default choice. General text and markdown bodies. | Yes (character spans) |
MarkdownHeaderChunker |
Markdown with # / ## hierarchy; header context in metadata. |
Yes (when body spans are known) |
SentenceChunker |
Sentence-window retrieval; overlap measured in sentences. | Yes |
SemanticChunker |
Topic boundaries via embeddings; variable chunk size. | Yes (unit spans in source text) |
FixedTokenChunker |
Hard token budget per chunk (e.g. matching embedder max). | No (see note) |
RecursiveCharacterChunker
Recursively splits on an ordered list of separators (paragraphs → lines →
sentence-like breaks → words → characters), then merges fragments into
chunks of at most chunk_size units (characters by default, or whatever
length_fn measures), with overlap between adjacent chunks.
from railtracks.retrieval import Document
from railtracks.retrieval.chunking import RecursiveCharacterChunker
doc = Document(content=long_text, type=DocumentType.TEXT, source="doc.txt")
chunks = RecursiveCharacterChunker(
chunk_size=800,
overlap=100,
separators=["\n\n", "\n", ". ", " ", ""], # optional; sensible defaults exist
).chunk(doc)
When to use: unstructured or lightly structured text where you don't
need heading-aware metadata. Pair with a token length_fn (a
Tokenizer.count) if you also want to budget by tokens while keeping
character offsets; that combination is what FixedTokenChunker gives up.
MarkdownHeaderChunker
Splits markdown on heading lines matching configured # prefixes. Each
emitted chunk carries heading context in metadata (headers, section).
If chunk_size is set and a section body is too long, the body is split
further using a fallback splitter (defaults to a zero-overlap RecursiveSplitter).
from railtracks.retrieval import Document
from railtracks.retrieval.chunking import MarkdownHeaderChunker
doc = Document(content=md_text, type=DocumentType.MARKDOWN, source="guide.md")
chunks = MarkdownHeaderChunker(
headers_to_split_on=["#", "##", "###"], # optional
chunk_size=1000, # optional; omit to never subdivide bodies
).chunk(doc)
When to use: markdown knowledge bases, READMEs, documentation sites - anywhere section boundaries are meaningful for retrieval (a hit in "Authentication > OAuth Setup" tells you more than a hit somewhere on page 4).
SentenceChunker
Detects sentence boundaries (default: regex on ./!/? + whitespace),
then groups chunk_size consecutive sentences with overlap
sentences shared between adjacent windows. Each chunk gets
metadata["sentence_count"].
from railtracks.retrieval import Document
from railtracks.retrieval.chunking import SentenceChunker
doc = Document(content=long_text, type=DocumentType.TEXT, source="article.txt")
chunks = SentenceChunker(chunk_size=5, overlap=1).chunk(doc)
When to use: sentence-window expansion (retrieve narrow windows, then
fetch neighbouring sentences for context). Inject a custom Splitter if
the regex default is too crude for your language.
SemanticChunker
Splits a document into units (sentences by default via RegexSentenceSplitter), embeds each unit with an injected Embedding provider, and merges consecutive units wherever cosine distance between neighbor embeddings exceeds a percentile-based threshold. Chunk count and size adapt to the document rather than a fixed window.
from railtracks.retrieval import Document
from railtracks.retrieval.chunking import SemanticChunker
from railtracks.retrieval.embedding import OpenAIEmbedding
doc = Document(content=long_text, type=DocumentType.TEXT, source="article.txt")
chunks = SemanticChunker(
embedder=OpenAIEmbedding(),
threshold_percentile=95.0,
).chunk(doc)
async def async_chunking():
# Async pipelines: prefer achunk (calls embedder.aembed)
chunks = await SemanticChunker(embedder=OpenAIEmbedding()).achunk(doc)
from railtracks.retrieval import Document
from railtracks.retrieval.chunking import SemanticChunker
from railtracks.retrieval.embedding import OpenAIEmbedding
doc = Document(content=long_text, type=DocumentType.TEXT, source="article.txt")
chunks = SemanticChunker(
embedder=OpenAIEmbedding(),
threshold_percentile=95.0,
).chunk(doc)
# Async pipelines: prefer achunk (calls embedder.aembed)
chunks = await SemanticChunker(embedder=OpenAIEmbedding()).achunk(doc)
| Parameter | Description |
|---|---|
embedder |
Required Embedding implementation. chunk() uses embed(); achunk() uses aembed(). |
sentence_splitter |
Optional Splitter for units. Defaults to RegexSentenceSplitter. |
threshold_percentile |
Percentile (0–100) of pairwise cosine distances in the document; distances above this value become breakpoints. Higher → fewer, larger chunks. Default 95.0. |
combine_neighbors |
When True, each string sent to the embedder includes neighboring units for richer context. Chunk text and offsets still come from original unit spans in document.content. |
window |
Neighbor radius on each side when combine_neighbors=True. Default 1. |
Pipeline (high level):
- Split
document.contentinto positioned units(text, start, end). - Embed unit texts (or contextualized strings if
combine_neighbors=True). - Compute paired cosine distance between each adjacent embedding pair.
- Break after units where distance exceeds
numpy.percentile(distances, threshold_percentile). - Merge units between breakpoints; each chunk is
document.content[first_start:last_end].
When to use: long prose where topic shifts matter more than fixed character, token, or sentence counts, and you already run an embedder in the pipeline.
Offsets: Yes. For every chunk, document.content[s:e] == chunk.content where (s, e) = chunk.offsets, spanning from the first merged unit’s start through the last unit’s end (including whitespace between sentences, as in the source).
Optional dependency
SemanticChunker depends on scikit-learn (and numpy). Install with pip install 'railtracks[semantic]' or include it via pip install 'railtracks[retrieval]'. Without the extra, importing the chunker fails at module load time.
FixedTokenChunker
Encodes the document once, slices the token list into windows of
chunk_size tokens with overlap tokens between windows, then decodes
each window back to text. Default tokenizer is tiktoken cl100k_base.
from railtracks.retrieval import Document
from railtracks.retrieval.chunking import FixedTokenChunker
doc = Document(content=long_text, type=DocumentType.TEXT, source="blob.txt")
chunks = FixedTokenChunker(chunk_size=400, overlap=50).chunk(doc)
When to use: you need chunk sizes aligned to the embedder's hard token limit.
Offsets and token chunking
FixedTokenChunker currently leaves Chunk.offsets as None -
character-accurate spans for tiktoken-style windows require extra
tokenizer plumbing. If you need both token budgeting and offsets,
prefer RecursiveCharacterChunker with a token length_fn until
token offsets are wired up.
Choosing a chunker
| Situation | Start with |
|---|---|
| Plain text, HTML-to-text, PDF extract, mixed prose | RecursiveCharacterChunker |
| Markdown with real heading structure | MarkdownHeaderChunker (optionally + chunk_size) |
| Sentence-aligned retrieval windows | SentenceChunker |
| Topic- or embedding-driven boundaries | SemanticChunker |
| Hard token budget per chunk | FixedTokenChunker |
Custom chunkers
Subclass Chunker, implement chunk(self, document), and always build
results via _make_chunks so invariants (document_id, dense index,
metadata, offsets) stay correct. Implement Splitter for reusable
str → list[str] logic if the same boundary detection shows up in more
than one chunker.
from railtracks.retrieval import Document
from railtracks.retrieval.chunking import Chunker
class ParagraphChunker(Chunker):
def chunk(self, document: Document):
pieces = document.content.split("\n\n")
return self._make_chunks(document, pieces)
See also
- Chunking overview: objects, layers, pipeline placement.
- Ingestion components: upstream
Documentproduction. - Embeddings methods: picking an embedder whose token limit matches your chunk size.