βœ‚οΈ RAG Text Chunker

πŸ”’ Seguro e Cliente-side

Paste a long document and split it into overlapping chunks for embedding-based retrieval. Pick chunks-by-characters for predictable size or chunks-by-token-estimate for closer alignment with embedding model limits. All processing happens in your browser.

Token mode is an approximate (~4 chars/token English, ~1.5 chars/token CJK). For exact tokenization, use the embedding model's tokenizer.

How to use

  1. Paste your source document.
  2. Pick chunk size and overlap. Common starting points: 512 chunk / 50 overlap (chars), or 256 / 32 (tokens).
  3. Choose characters for predictable byte-size or tokens to align with embedding model context limits.
  4. Click Chunk. Each chunk is numbered with byte length; click any chunk to copy it, or use Copy all (JSON) to grab the whole array.

Why overlap? Overlap preserves context across chunk boundaries β€” a sentence split mid-thought won't lose its anchor. Common ratio: overlap = 10–20% of chunk size.

Splitting strategy is plain sliding window; no semantic boundary detection. For prose, that's usually fine. For code or structured documents, consider a parser-aware splitter.