Chunking
Chunking turns each Document
produced by a loader into a list of smaller Chunk objects. Embedders,
stores, and retrieval all operate on chunks, not whole documents.
For the chunkers that ship with Railtracks (when to pick which) see Built-in Methods.
The Chunk object
Every chunker returns Chunk instances (Chunk API Reference)
offsets
offsetslets you map any chunk back to an
exact substring of Document.content for citation-style grounding,
highlight rendering, or debugging which span actually matched a query. Not
every chunker can populate it; see Built-in Methods for
which ones do.
Layered API
Chunking is built from three reusable ideas:
| Layer | Role |
|---|---|
Tokenizer |
encode / decode / count; used by token-aware chunkers |
Splitter |
split(text) -> list[str]; reusable boundary detection |
Chunker |
chunk(document) -> list[Chunk]; applies a splitter, enforces invariants |
Concrete chunkers live in railtracks.retrieval.chunking.
Writing your own Chunker
Subclasses implement one abstract method:
The returned chunks must satisfy these invariants:
document_idmatchesdocument.idindexis dense and 0-based across the returned listmetadatainherits fromdocument.metadata(per-chunk extras may be overlaid on top)offsets, when set, are valid(start, end)ranges intodocument.content
The base class exposes _make_chunks as a convenience helper that
enforces all of the above in one place. The shipped chunkers use it,
and you should too unless you have a specific reason not to.
chunk() is expected to be CPU-bound
achunk() is derived from chunk() via
asyncio.to_thread.
That keeps the event loop responsive for pure text splitting, but it
ties up a worker thread per call. If your chunker genuinely needs
async I/O (e.g. a remote tokenization service), override achunk()
directly with a real async implementation rather than leaning on the
default to_thread wrapper.
Quickstart
from uuid import uuid4
from railtracks.retrieval import Document, DocumentType
from railtracks.retrieval.chunking import RecursiveCharacterChunker
doc = Document(
content=(
"This is a sample document that will be split into multiple overlapping chunks. "
"Chunkers are useful for breaking up large texts for retrieval and question answering. "
"Overlaps ensure context is preserved between chunks. "
"Feel free to adjust chunk_size and overlap to see how chunking behaves."
),
type=DocumentType.TEXT,
id=uuid4(),
source="example.txt",
metadata={"author": "Test User"},
)
chunks = RecursiveCharacterChunker(chunk_size=60, overlap=15).chunk(doc)
for c in chunks:
print(f"Chunk #{c.index}: offsets={c.offsets}, length={len(c.content)}")
print(f"Content: {c.content!r}")
print("-----")
Chunk #0: offsets=(0, 60), length=60
Content: 'This is a sample document that will be split into multiple overlapping chunks. '
-----
Chunk #1: offsets=(15, 75), length=60
Content: 'Chunkers are useful for breaking up large texts for retrieval and question answering. '
-----
...
Next steps
- Built-in Methods: parameters, defaults, and when to use each chunker.
- Ingestion components: producing
Documentinstances upstream. - Embeddings: vectorizing the chunks this stage produces.