Skip to content

Chunking

Chunking turns each Document produced by a loader into a list of smaller Chunk objects. Embedders, stores, and retrieval all operate on chunks, not whole documents.

For the chunkers that ship with Railtracks (when to pick which) see Built-in Methods.


The Chunk object

Every chunker returns Chunk instances (Chunk API Reference)

offsets

offsetslets you map any chunk back to an exact substring of Document.content for citation-style grounding, highlight rendering, or debugging which span actually matched a query. Not every chunker can populate it; see Built-in Methods for which ones do.


Layered API

Chunking is built from three reusable ideas:

Layer Role
Tokenizer encode / decode / count; used by token-aware chunkers
Splitter split(text) -> list[str]; reusable boundary detection
Chunker chunk(document) -> list[Chunk]; applies a splitter, enforces invariants

Concrete chunkers live in railtracks.retrieval.chunking.

Writing your own Chunker

Subclasses implement one abstract method:

def chunk(self, document: Document) -> list[Chunk]: ...

The returned chunks must satisfy these invariants:

  • document_id matches document.id
  • index is dense and 0-based across the returned list
  • metadata inherits from document.metadata (per-chunk extras may be overlaid on top)
  • offsets, when set, are valid (start, end) ranges into document.content

The base class exposes _make_chunks as a convenience helper that enforces all of the above in one place. The shipped chunkers use it, and you should too unless you have a specific reason not to.

chunk() is expected to be CPU-bound

achunk() is derived from chunk() via asyncio.to_thread. That keeps the event loop responsive for pure text splitting, but it ties up a worker thread per call. If your chunker genuinely needs async I/O (e.g. a remote tokenization service), override achunk() directly with a real async implementation rather than leaning on the default to_thread wrapper.


Quickstart

from uuid import uuid4

from railtracks.retrieval import Document, DocumentType
from railtracks.retrieval.chunking import RecursiveCharacterChunker

doc = Document(
    content=(
        "This is a sample document that will be split into multiple overlapping chunks. "
        "Chunkers are useful for breaking up large texts for retrieval and question answering. "
        "Overlaps ensure context is preserved between chunks. "
        "Feel free to adjust chunk_size and overlap to see how chunking behaves."
    ),
    type=DocumentType.TEXT,
    id=uuid4(),
    source="example.txt",
    metadata={"author": "Test User"},
)

chunks = RecursiveCharacterChunker(chunk_size=60, overlap=15).chunk(doc)

for c in chunks:
    print(f"Chunk #{c.index}: offsets={c.offsets}, length={len(c.content)}")
    print(f"Content: {c.content!r}")
    print("-----")
Chunk #0: offsets=(0, 60), length=60
Content: 'This is a sample document that will be split into multiple overlapping chunks. '
-----
Chunk #1: offsets=(15, 75), length=60
Content: 'Chunkers are useful for breaking up large texts for retrieval and question answering. '
-----
...

Next steps