I've written a lot of RAG pipelines over the last year, and one consistent pain-point is writing regex to chunk the documents correctly.
Right now, the most common chunking algorithms are:
- Split every 1000 characters
- Split on whitespace
- Recursively split on: (many newlines, then one newline, then periods, then spaces)
The best is recursive character text splitter, but regex is super brittle and when it fails to match it ends up creating huge chunks. Worse, this solution also has the overhead of needing to maintain regexes for every single filetype.
Here we propose LlamaChunk, an inference-efficient method of LLM-powered chunking. Using this method, it only requires a single LLM inference over your document in order to provide the most optimal recursive character text splitting, without needing to hope that a bunch of hard-coded rules work on your unstructured data.
It scored better than LlamaIndex's recursive character text splitter and that was including some custom regex work to improve it. If you put enough effort into the regex you could probably get there, but the whole point of the agentic chunking is for it to be automatic and contextual.
I like the approach! but does it work on long documents? If you’re doing a single llm pass, how does the llm keep track of the chunks it’s already made?
For long documents we have a rolling window strategy. So, we cut the document into 5,000 token groupings for use in inference. There's also a 400 token overlap, and we prefer the earlier chunk for overlap tokens.
For example, if Group #0 overlaps with Group #1 at index 5,200, then we use the logprob from Group #0, because it had more context. Group #1 gets the benefit of context for indices 5,000-5,400, even though we toss out the logprobs for that range.
No need to keep track of chunks that it's already made, we just want heatmap values and then we use those heatmaps to split at the hottest character that's around our target chunk length (Or use a threshold value and binary search the threshold for our target # Chunks or average chunk size).
Hey HN!
I've written a lot of RAG pipelines over the last year, and one consistent pain-point is writing regex to chunk the documents correctly.
Right now, the most common chunking algorithms are:
- Split every 1000 characters
- Split on whitespace
- Recursively split on: (many newlines, then one newline, then periods, then spaces)
The best is recursive character text splitter, but regex is super brittle and when it fails to match it ends up creating huge chunks. Worse, this solution also has the overhead of needing to maintain regexes for every single filetype.
Here we propose LlamaChunk, an inference-efficient method of LLM-powered chunking. Using this method, it only requires a single LLM inference over your document in order to provide the most optimal recursive character text splitting, without needing to hope that a bunch of hard-coded rules work on your unstructured data.
We're hoping to build a community for state-of-the-art RAG research @ https://discord.com/invite/yv2hQQytne , come join!
That's cool! How does it perform compared to more "naive" methods? How did you go about comparing that performance, and was it in a real world RAG?
Yep benchmarks are available at https://github.com/ZeroEntropy-AI/llama-chunk?tab=readme-ov-... , we used this dataset https://github.com/ZeroEntropy-AI/legalbenchrag which is a retrieval-focused version of LegalBench.
It scored better than LlamaIndex's recursive character text splitter and that was including some custom regex work to improve it. If you put enough effort into the regex you could probably get there, but the whole point of the agentic chunking is for it to be automatic and contextual.
I like the approach! but does it work on long documents? If you’re doing a single llm pass, how does the llm keep track of the chunks it’s already made?
Good question!
For long documents we have a rolling window strategy. So, we cut the document into 5,000 token groupings for use in inference. There's also a 400 token overlap, and we prefer the earlier chunk for overlap tokens.
For example, if Group #0 overlaps with Group #1 at index 5,200, then we use the logprob from Group #0, because it had more context. Group #1 gets the benefit of context for indices 5,000-5,400, even though we toss out the logprobs for that range.
Details about the groupings are here: https://github.com/ZeroEntropy-AI/llama-chunk/blob/0cac10a34...
--
No need to keep track of chunks that it's already made, we just want heatmap values and then we use those heatmaps to split at the hottest character that's around our target chunk length (Or use a threshold value and binary search the threshold for our target # Chunks or average chunk size).
[dead]