✂️ RAG Text Chunker

🔒 Seguro e Cliente-side

Paste a long document and split it into overlapping chunks for embedding-based retrieval. Pick chunks-by-characters for predictable size or chunks-by-token-estimate for closer alignment with embedding model limits. All processing happens in your browser.

Source text

Chunk size

Overlap

Unit

Token mode is an approximate (~4 chars/token English, ~1.5 chars/token CJK). For exact tokenization, use the embedding model's tokenizer.

How to use

Paste your source document.
Pick chunk size and overlap. Common starting points: 512 chunk / 50 overlap (chars), or 256 / 32 (tokens).
Choose characters for predictable byte-size or tokens to align with embedding model context limits.
Click Chunk. Each chunk is numbered with byte length; click any chunk to copy it, or use Copy all (JSON) to grab the whole array.

Why overlap? Overlap preserves context across chunk boundaries — a sentence split mid-thought won't lose its anchor. Common ratio: overlap = 10–20% of chunk size.

Splitting strategy is plain sliding window; no semantic boundary detection. For prose, that's usually fine. For code or structured documents, consider a parser-aware splitter.

✂️ RAG Text Chunker

How to use

Related Tools