Skip to content

Ingestion components

A document loader reads raw data from a source (a file, directory, URL, database, dataset) and produces Document objects. Loaders are stage one of the pipeline; everything downstream (chunkers, embedders, stores) consumes Documents.

For the streaming-and-safety side of ingestion (events, re-ingest, multi-tenant writes, sanitization), see the Ingestion page.


The Document object

Every loader produces Document instances. (Document API Reference)

source is the natural identity of a document for re-ingest staleness checks. Loaders that read files set it to the file path; HTTP loaders set it to the URL; the Hugging Face loader sets it to {dataset}/{split}. If you write a custom loader and want skip-by-hash idempotency, set source to something stable.


Built-in loaders

Loader Handles Extra install
TextLoader .txt, .md files & directories None
CSVLoader .csv files & directories None
JSONLoader .json files & directories None
PyPDFLoader .pdf files & directories (embedded text only) pip install "railtracks[pdf]"
PyPDFOCRLoader .pdf files & directories with OCR fallback pip install "railtracks[ocr]" + Tesseract
HuggingFaceDatasetLoader Hugging Face Hub datasets (streaming) pip install "railtracks[huggingface]"
LangChainLoaderAdapter Any LangChain loader Depends on wrapped loader

TextLoader, CSVLoader, JSONLoader, SanitizingLoader, and the base classes are re-exported from railtracks.retrieval.loaders. The optional loaders (PDF, OCR, Hugging Face) live under their own submodules; import them directly to avoid pulling in optional dependencies you don't need.


The unified loader interface

All loaders share three methods. Prefer astream() for anything that might not fit in memory: it's the only path that interleaves with chunking/embedding/storage:

    loader = TextLoader("docs/")

    # Sync, returns list[Document]. Fine for tests, small corpora, scripts.
    docs = loader.load()

    # Async, collects all documents before returning. Same memory profile as load().
    docs = await loader.aload()

    # Async, yields one Document at a time. The only mode that streams.
    async for doc in loader.astream():
        print(doc.source, doc.type, len(doc.content))

load() and aload() are convenience wrappers around astream(). They collect everything into a list before returning, so use them only when you don't anticipate memory constraints.


Next steps

You want to… Read
See the built-in loaders (text, CSV, JSON, PDF, OCR, Hugging Face) Built-in loaders
Check out Integrations for loaders that read data from cloud services (ie AWS S3, Azure Blob, etc)
Write your own loader for an unsupported source Custom loaders
Run the pipeline end-to-end Ingestion (write path)