Chunking: Built-in methods

Five chunkers ship under railtracks.retrieval.chunking. Pick one based on your source format and whether you need offsets back.

Summary

Chunker	Best for	Offsets on `Chunk`?
`RecursiveCharacterChunker`	Default choice. General text and markdown bodies.	Yes (character spans)
`MarkdownHeaderChunker`	Markdown with `#` / `##` hierarchy; header context in metadata.	Yes (when body spans are known)
`SentenceChunker`	Sentence-window retrieval; overlap measured in sentences.	Yes
`SemanticChunker`	Topic boundaries via embeddings; variable chunk size.	Yes (unit spans in source text)
`FixedTokenChunker`	Hard token budget per chunk (e.g. matching embedder max).	No (see note)

`RecursiveCharacterChunker`

Recursively splits on an ordered list of separators (paragraphs → lines → sentence-like breaks → words → characters), then merges fragments into chunks of at most chunk_size units (characters by default, or whatever length_fn measures), with overlap between adjacent chunks.

from railtracks.retrieval import Document
from railtracks.retrieval.chunking import RecursiveCharacterChunker

doc = Document(content=long_text, type=DocumentType.TEXT, source="doc.txt")
chunks = RecursiveCharacterChunker(
    chunk_size=800,
    overlap=100,
    separators=["\n\n", "\n", ". ", " ", ""],  # optional; sensible defaults exist
).chunk(doc)

When to use: unstructured or lightly structured text where you don't need heading-aware metadata. Pair with a token length_fn (a Tokenizer.count) if you also want to budget by tokens while keeping character offsets; that combination is what FixedTokenChunker gives up.

`MarkdownHeaderChunker`

Splits markdown on heading lines matching configured # prefixes. Each emitted chunk carries heading context in metadata (headers, section). If chunk_size is set and a section body is too long, the body is split further using a fallback splitter (defaults to a zero-overlap RecursiveSplitter).

from railtracks.retrieval import Document
from railtracks.retrieval.chunking import MarkdownHeaderChunker

doc = Document(content=md_text, type=DocumentType.MARKDOWN, source="guide.md")
chunks = MarkdownHeaderChunker(
    headers_to_split_on=["#", "##", "###"],  # optional
    chunk_size=1000,                          # optional; omit to never subdivide bodies
).chunk(doc)

When to use: markdown knowledge bases, READMEs, documentation sites - anywhere section boundaries are meaningful for retrieval (a hit in "Authentication > OAuth Setup" tells you more than a hit somewhere on page 4).

`SentenceChunker`

Detects sentence boundaries (default: regex on ./!/? + whitespace), then groups chunk_size consecutive sentences with overlap sentences shared between adjacent windows. Each chunk gets metadata["sentence_count"].

from railtracks.retrieval import Document
from railtracks.retrieval.chunking import SentenceChunker

doc = Document(content=long_text, type=DocumentType.TEXT, source="article.txt")
chunks = SentenceChunker(chunk_size=5, overlap=1).chunk(doc)

When to use: sentence-window expansion (retrieve narrow windows, then fetch neighbouring sentences for context). Inject a custom Splitter if the regex default is too crude for your language.

`SemanticChunker`

Splits a document into units (sentences by default via RegexSentenceSplitter), embeds each unit with an injected Embedding provider, and merges consecutive units wherever cosine distance between neighbor embeddings exceeds a percentile-based threshold. Chunk count and size adapt to the document rather than a fixed window.

from railtracks.retrieval import Document
from railtracks.retrieval.chunking import SemanticChunker
from railtracks.retrieval.embedding import OpenAIEmbedding

doc = Document(content=long_text, type=DocumentType.TEXT, source="article.txt")
chunks = SemanticChunker(
    embedder=OpenAIEmbedding(),
    threshold_percentile=95.0,
).chunk(doc)

async def async_chunking():
    # Async pipelines: prefer achunk (calls embedder.aembed)
    chunks = await SemanticChunker(embedder=OpenAIEmbedding()).achunk(doc)

from railtracks.retrieval import Document
from railtracks.retrieval.chunking import SemanticChunker
from railtracks.retrieval.embedding import OpenAIEmbedding

doc = Document(content=long_text, type=DocumentType.TEXT, source="article.txt")
chunks = SemanticChunker(
    embedder=OpenAIEmbedding(),
    threshold_percentile=95.0,
).chunk(doc)

    # Async pipelines: prefer achunk (calls embedder.aembed)
    chunks = await SemanticChunker(embedder=OpenAIEmbedding()).achunk(doc)

Parameter	Description
`embedder`	Required `Embedding` implementation. `chunk()` uses `embed()`; `achunk()` uses `aembed()`.
`sentence_splitter`	Optional `Splitter` for units. Defaults to `RegexSentenceSplitter`.
`threshold_percentile`	Percentile (0–100) of pairwise cosine distances in the document; distances above this value become breakpoints. Higher → fewer, larger chunks. Default `95.0`.
`combine_neighbors`	When `True`, each string sent to the embedder includes neighboring units for richer context. Chunk text and offsets still come from original unit spans in `document.content`.
`window`	Neighbor radius on each side when `combine_neighbors=True`. Default `1`.

Pipeline (high level):

Split document.content into positioned units (text, start, end).
Embed unit texts (or contextualized strings if combine_neighbors=True).
Compute paired cosine distance between each adjacent embedding pair.
Break after units where distance exceeds numpy.percentile(distances, threshold_percentile).
Merge units between breakpoints; each chunk is document.content[first_start:last_end].

When to use: long prose where topic shifts matter more than fixed character, token, or sentence counts, and you already run an embedder in the pipeline.

Offsets: Yes. For every chunk, document.content[s:e] == chunk.content where (s, e) = chunk.offsets, spanning from the first merged unit’s start through the last unit’s end (including whitespace between sentences, as in the source).

Optional dependency

SemanticChunker depends on scikit-learn (and numpy). Install with pip install 'railtracks[semantic]' or include it via pip install 'railtracks[retrieval]'. Without the extra, importing the chunker fails at module load time.

`FixedTokenChunker`

Encodes the document once, slices the token list into windows of chunk_size tokens with overlap tokens between windows, then decodes each window back to text. Default tokenizer is tiktoken cl100k_base.

from railtracks.retrieval import Document
from railtracks.retrieval.chunking import FixedTokenChunker

doc = Document(content=long_text, type=DocumentType.TEXT, source="blob.txt")
chunks = FixedTokenChunker(chunk_size=400, overlap=50).chunk(doc)

When to use: you need chunk sizes aligned to the embedder's hard token limit.

Offsets and token chunking

FixedTokenChunker currently leaves Chunk.offsets as None - character-accurate spans for tiktoken-style windows require extra tokenizer plumbing. If you need both token budgeting and offsets, prefer RecursiveCharacterChunker with a token length_fn until token offsets are wired up.

Choosing a chunker

Situation	Start with
Plain text, HTML-to-text, PDF extract, mixed prose	`RecursiveCharacterChunker`
Markdown with real heading structure	`MarkdownHeaderChunker` (optionally + `chunk_size`)
Sentence-aligned retrieval windows	`SentenceChunker`
Topic- or embedding-driven boundaries	`SemanticChunker`
Hard token budget per chunk	`FixedTokenChunker`

Custom chunkers

Subclass Chunker, implement chunk(self, document), and always build results via _make_chunks so invariants (document_id, dense index, metadata, offsets) stay correct. Implement Splitter for reusable str → list[str] logic if the same boundary detection shows up in more than one chunker.

from railtracks.retrieval import Document
from railtracks.retrieval.chunking import Chunker


class ParagraphChunker(Chunker):
    def chunk(self, document: Document):
        pieces = document.content.split("\n\n")
        return self._make_chunks(document, pieces)

Chunking: Built-in methods

Summary

RecursiveCharacterChunker

MarkdownHeaderChunker

SentenceChunker

SemanticChunker

FixedTokenChunker