βοΈ RAG Text Chunker
π Sicher & Clientseitig
Paste a long document and split it into overlapping chunks for embedding-based retrieval. Pick chunks-by-characters for predictable size or chunks-by-token-estimate for closer alignment with embedding model limits. All processing happens in your browser.
Token mode is an approximate (~4 chars/token English, ~1.5 chars/token CJK). For exact tokenization, use the embedding model's tokenizer.
How to use
- Paste your source document.
- Pick chunk size and overlap. Common starting points: 512 chunk / 50 overlap (chars), or 256 / 32 (tokens).
- Choose characters for predictable byte-size or tokens to align with embedding model context limits.
- Click Chunk. Each chunk is numbered with byte length; click any chunk to copy it, or use Copy all (JSON) to grab the whole array.
Why overlap? Overlap preserves context across chunk boundaries β a sentence split mid-thought won't lose its anchor. Common ratio: overlap = 10β20% of chunk size.
Splitting strategy is plain sliding window; no semantic boundary detection. For prose, that's usually fine. For code or structured documents, consider a parser-aware splitter.