RANKWITHME.AI

You already have the answers. We help the internet find them.

Structure before ads — your business, clearly defined, permanently visible

RESEARCH — 2026.03.04 GÜNTHER · MOHR · WILLIAMS · WANG · XIAO — 2024
Text Embeddings Neural Retrieval RAG Architecture Jina AI · Weaviate arXiv:2409.04701v3 cs.CL · cs.IR
How a Simple Reordering of Operations Changes What AI Systems Can Remember — And What It Means for Any Corpus Built on Relationships
A reading of Late Chunking: Contextual Chunk Embeddings Using Long-Context Embedding Models by Michael Günther, Isabelle Mohr, Daniel James Williams, Bo Wang, and Han Xiao — Jina AI GmbH & Weaviate B.V. (2024)
arXiv:2409.04701v3 — Late Chunking: Contextual Chunk Embeddings Using Long-Context Embedding Models Read Paper →
WHY WE READ THIS PAPER 01 / 06

Every morning we sit down with a piece of research that connects to the problem we are working on. Today's paper is one of the most quietly consequential we have encountered — not because it proposes something radical, but because it solves a problem so fundamental that its solution changes the calculus for anyone building retrieval systems at scale.

Late Chunking is authored by researchers at Jina AI and Weaviate — two organizations doing serious, production-oriented work in the embedding and retrieval space. The paper is rigorous, its code is public, its benchmarks are reproducible, and its authors are honest about where the method works best and where it doesn't. That combination of qualities — practical ambition and scientific honesty — is exactly what makes research worth reading carefully.

We work on federated knowledge graph architecture for the American business economy. Our problem is not identical to theirs. But the resonance between their findings and the structural choices we have made in our own system is specific enough to be worth sharing openly. When two different research trajectories converge on the same architectural answer from different directions, that convergence is worth paying attention to.

We are grateful this paper exists and grateful its authors published it openly. Anyone building RAG systems, embedding pipelines, or structured document corpora should read it.

THE PROBLEM THEY SET OUT TO SOLVE 02 / 06

To understand why late chunking matters, you need to understand a tension that sits at the heart of every modern AI retrieval system. Dense vector search — the technology that powers most of what people call "AI search" — works best with short, focused text. The shorter the passage, the more precisely its meaning can be captured in a single vector. But real documents are long, and real meaning lives in context that spans many sentences, paragraphs, and pages.

The standard solution is chunking: split the document into smaller pieces and embed each piece separately. Store those embeddings. At query time, find the chunk whose embedding is most similar to the query. Simple, effective, and widely deployed.

The problem is what chunking destroys in the process.

Plain Language — The Lost Context Problem

Imagine a Wikipedia article about Berlin. The first sentence says "Berlin is the capital and largest city of Germany." The second sentence says "Its more than 3.85 million inhabitants make it the EU's most populous city." The third says "The city is also one of the states of Germany."

Now imagine chopping that article into three separate chunks — one per sentence — and embedding each chunk independently. The embedding model for chunk two sees only the word "Its." It doesn't know what "Its" refers to. The model for chunk three sees only "The city." It doesn't know which city. The connection to Berlin has been severed by the act of splitting.

When a user searches for "Berlin population," the chunk that actually answers the question — the second sentence — gets a much lower relevance score than it deserves, because its embedding was computed without any knowledge of what "Its" referred to. The answer is in the corpus. The retrieval system can't find it.

The researchers demonstrate this precisely. Using a standard embedding model, the sentence "Its more than 3.85 million inhabitants make it the European Union's most populous city" achieves a cosine similarity of only 0.7084 to the query "Berlin" when embedded naively — despite being entirely about Berlin. With late chunking, that same sentence achieves 0.8249. The difference is context.

This is not a minor edge case. It is the fundamental limitation of naive chunking at scale, and it affects every retrieval system that splits documents before embedding them — which is most of them.

HOW LATE CHUNKING WORKS 03 / 06

The insight behind late chunking is elegant precisely because it doesn't require building something new. It requires changing the order of two existing operations.

In naive chunking, the sequence is: split first, then embed. Each chunk enters the embedding model as an isolated string, unaware of what came before or after it.

In late chunking, the sequence is: embed first, then split. The entire document is passed through the transformer model first, generating a token-level embedding for every single token in the document — each one informed by the full document context. Only then are those token embeddings grouped into chunks and compressed into single chunk vectors via mean pooling.

01
Feed the full document to the transformer. Every token gets an embedding that reflects the entire surrounding document — not just its immediate neighbors. "Its" knows it refers to Berlin. "The city" knows which city it is.
02
Apply your chunking strategy — but only to identify boundaries. Sentence boundaries, fixed token counts, semantic similarity breaks — any chunking algorithm works. The chunks are used purely to define where the pooling windows start and stop. The actual splitting hasn't happened yet.
03
Mean pool the token embeddings within each chunk boundary. Instead of pooling the entire document into one vector, pool only the tokens belonging to each chunk. The result is a single vector per chunk — but each vector was computed from token embeddings that already carry full document context. The chunks remember where they came from.
No additional training required. No additional model calls required. No new architecture required. Late chunking works with any long-context embedding model that uses mean pooling — which describes most modern embedding models — applied exactly as it is today, with only the operation order changed.

Even the most capable embedding models have context window limits. A document longer than 8,192 tokens can't be fed to the transformer in one pass. The researchers solve this with a technique they call long late chunking — processing the document in overlapping macro-windows.

Think of it this way: instead of reading a long book in one sitting, you read chapters with deliberate overlap. You finish chapter three, then back up to the last few pages of chapter two before starting chapter four. The overlap ensures that nothing gets processed in isolation at the seams — every macro-window begins with context from the prior window already in view. Those overlap tokens serve as a contextual bridge between processing passes, preventing the boundary problem that would otherwise emerge at every macro-chunk seam.

The evaluation results confirm this matters: when the researchers stopped truncating documents at 8,192 tokens and let long late chunking handle the full length, retrieval scores went up — because truncation had been silently discarding information that turned out to be relevant.

WHAT THE NUMBERS SHOW 04 / 06

The researchers ran late chunking against naive chunking across three embedding models, three chunking strategies, and four retrieval benchmarks from the BeIR suite. The results are consistent in a way that matters more than any single number.

Chunking Strategy Relative Improvement Absolute Improvement Consistent Across Models
Sentence Boundaries +3.63% +1.9 pts
Fixed-Size Boundaries +3.46% +1.8 pts
Semantic Sentence Boundaries +2.70% +1.5 pts

The improvements are consistent rather than dramatic — and the researchers say so plainly. What makes these numbers meaningful is not their size but their universality. Late chunking improved results for every model tested, every chunking strategy tested, and every dataset tested. A method that works reliably across all conditions is more valuable in production than a method that shows large gains in ideal conditions and fails elsewhere.

Small Chunks
The smaller the chunk, the larger the advantage. A very small chunk — a single sentence, a short phrase — has almost no internal context to rely on. It is almost entirely dependent on surrounding text for its meaning. Late chunking delivers that surrounding context. This is where the gap between naive and late chunking is widest.
Coherent Docs
Documents where surrounding context is genuinely relevant to every passage — technical articles, legal documents, structured entity records — benefit maximally from late chunking. The method's core assumption is that context matters. In coherent documents, it always does.
Long Documents
Combined with the long late chunking extension, the method recovers signal that truncation would otherwise discard. For documents where the relevant passage might be anywhere across dozens of pages, long late chunking ensures the entire document is processed with contextual continuity intact.
Needle Tasks
The researchers report this directly and without hedging: on synthetic datasets where a short relevant passage is deliberately embedded in completely unrelated surrounding text, late chunking performs worse than naive chunking. The method faithfully encodes that surrounding irrelevance into every chunk embedding. When context is noise, encoding it is harmful. The researchers call this out. It earns trust.
SPAN POOLING TRAINING AND THE COST QUESTION 05 / 06

Late chunking works without any additional training. The researchers demonstrate this clearly across all their benchmarks. But they also ask: can we do even better if we fine-tune an embedding model specifically to perform span-level pooling rather than full-document pooling?

Their answer is yes — modestly. The fine-tuning method they call span pooling trains the model on query-document pairs where the training signal comes not from the full document embedding but from a specific annotated span within the document. The model learns to concentrate the relevant information from a full-document context pass into a smaller window's pooled embedding.

The improvements from span pooling training are consistent but small. The researchers are candid about why: the training data is limited to roughly 470,000 pairs drawn entirely from Wikipedia. A more diverse training set — particularly one drawn from the same domain as the target retrieval corpus — would likely produce meaningfully larger gains. This is an open invitation for anyone with a domain-specific corpus to explore further.

The researchers compare late chunking directly to a contextual embedding approach described in a blog post from Anthropic — a method that uses a large language model to prepend relevant context to each chunk before embedding. Both methods produce nearly identical retrieval quality. Both dramatically outperform naive chunking.

The difference is cost. The LLM-based approach requires a full language model inference call for every chunk in every document — at scale, this is a significant and compounding expense. Late chunking produces the same quality of contextual embedding using only the embedding model itself, with no additional LLM call, at no additional cost beyond what naive chunking already requires.

What This Means for Large Corpora

If you are processing thousands of documents with tens of thousands of chunks each, the difference between one embedding model call per chunk and one embedding model call plus one LLM call per chunk is not a rounding error. It is a pipeline that is economically viable versus one that is not — or one that requires significant infrastructure investment to maintain.

Late chunking closes that gap entirely. The quality is equivalent. The cost is not.

WHAT THIS PAPER ILLUMINATED FOR US 06 / 06

We work on a different problem. But reading this paper carefully, several things came into sharp focus.

The central insight of late chunking — that the relationship between a piece of text and its surrounding document is as semantically important as the text itself — is not specific to embedding pipelines. It is a general principle about how meaning works in interconnected systems. A chunk without its document context is impoverished the same way an entity without its graph context is impoverished. The surrounding structure is not decorative. It is load-bearing.

The Root-LD federation we are building encodes this principle structurally. Every entity carries three layers: an identity anchor, a content body, and a recursive edge map that links it to related entities across all domains. That recursive layer is, in essence, the surrounding context that late chunking preserves at the embedding level. The researchers arrived at this through retrieval benchmarks. We arrived at it through knowledge graph architecture. The convergence is not coincidental — it reflects something true about how interconnected information needs to be represented to be useful to machines.

The long late chunking approach resonates especially for document-intensive applications where the relevant signal might be distributed across dozens of pages, where patterns emerge from the relationship between page 3 and page 27 rather than from either page alone. Standard chunking cannot see those patterns. A processing strategy built on contextual continuity across the full document — with overlapping windows and accumulated context at every seam — is the architectural answer. The researchers' work gives that intuition empirical grounding.

We also note the span pooling training finding as a direction worth watching. The researchers acknowledge that their training data — drawn entirely from Wikipedia — limits the gains. Domain-specific span-annotated training data, built from the same type of documents you intend to retrieve against, is likely where the meaningful performance frontier lies. This is an open research direction and one we find genuinely exciting.

We are grateful to Michael Günther, Isabelle Mohr, Daniel James Williams, Bo Wang, and Han Xiao for publishing this work openly, sharing their code, and describing their results with the kind of precision and honesty that makes research actually useful to build on.

Read the full paper — arXiv:2409.04701v3 arXiv →
REFERENCES PRIMARY SOURCES
[1]
Anthropic — Introducing Contextual Retrieval, 2024. anthropic.com/news/contextual-retrieval
[2]
Callan, J.P. — Passage-level evidence in document retrieval. SIGIR '94: Proceedings of the Seventeenth Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval, pp. 302–310. Springer, 1994.
[3]
Chen, T., Wang, H., Chen, S., Yu, W., Ma, K., Zhao, X., Zhang, H., & Yu, D. — Dense X Retrieval: What retrieval granularity should we use? Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 15159–15177. Association for Computational Linguistics, 2024.
[4]
Demszky, D., Movshovitz-Attias, D., Ko, J., Cowen, A., Nemade, G., & Ravi, S. — GoEmotions: A dataset of fine-grained emotions. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 4040–4054, 2020.
[5]
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. — BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. Proceedings of NAACL-HLT 2019, pp. 4171–4186. Association for Computational Linguistics, 2019. aclanthology.org/N19-1423
[6]
Günther, M., Ong, J., Mohr, I., Abdessalem, A., Abel, T., Akram, M.K., Guzman, S., Mastrapas, G., Sturua, S., Wang, B., et al. — Jina Embeddings 2: 8192-Token General-Purpose Text Embeddings for Long Documents. arXiv preprint, 2023. arXiv:2310.19923
[7]
Jha, R., Wang, B., Günther, M., Sturua, S., Akram, M.K., & Xiao, H. — Jina-ColBERT-v2: A General-Purpose Multilingual Late Interaction Retriever. arXiv preprint, 2024. arXiv:2408.16672
[8]
Joshi, M., Choi, E., Weld, D.S., & Zettlemoyer, L. — TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, pp. 1601–1611, 2017.
[9]
Kamradt, G. — 5 Levels of Text Splitting. FullStackRetrieval RetrievalTutorials, 2024. github.com — FullStackRetrieval-com
[10]
Khattab, O. & Zaharia, M. — ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 39–48, 2020.
[11]
Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W., Rocktäschel, T., et al. — Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. Advances in Neural Information Processing Systems, vol. 33, pp. 9459–9474, 2020.
[12]
Luo, K., Liu, Z., Xiao, S., Zhou, T., Chen, Y., Zhao, J., & Liu, K. — Landmark Embedding: A chunking-free embedding method for retrieval augmented long-context large language models. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, pp. 3268–3281. Association for Computational Linguistics, 2024.
[13]
Nussbaum, Z., Morris, J.X., Duderstadt, B., & Mulyar, A. — Nomic Embed: Training a Reproducible Long Context Text Embedder. arXiv preprint, 2024. arXiv:2402.01613
[14]
Press, O., Smith, N., & Lewis, M. — Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation. International Conference on Learning Representations (ICLR), 2022.
[15]
Reimers, N. & Gurevych, I. — Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th IJCNLP, pp. 3982–3992, 2019.
[16]
Safjan, K. — From Fixed-Size to NLP Chunking — A Deep Dive into Text Chunking Techniques. Krystian's Safjan Blog, 2023.
[17]
Salton, G., Allan, J., & Buckley, C. — Approaches to passage retrieval in full text information systems. Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 49–58, 1993.
[18]
Sturua, S., Mohr, I., Akram, M.K., Günther, M., Wang, B., Krimmel, M., Wang, F., Mastrapas, G., Koukounas, A., Wang, N., et al. — jina-embeddings-v3: Multilingual Embeddings with Task LoRA. arXiv preprint, 2024. arXiv:2409.10173
[19]
Su, J., Ahmed, M., Lu, Y., Pan, S., Bo, W., & Liu, Y. — RoFormer: Enhanced Transformer with Rotary Position Embedding. Neurocomputing, vol. 568, p. 127063, 2024.
[20]
Thakur, N., Reimers, N., Rücklé, A., Srivastava, A., & Gurevych, I. — BEIR: A Heterogeneous Benchmark for Zero-Shot Evaluation of Information Retrieval Models. Thirty-fifth Conference on Neural Information Processing Systems — Datasets and Benchmarks Track, 2021. openreview.net
[21]
Thorne, J., Vlachos, A., Christodoulopoulos, C., & Mittal, A. — FEVER: A Large-Scale Dataset for Fact Extraction and VERification. NAACL-HLT, 2018.
[22]
van den Oord, A., Li, Y., & Vinyals, O. — Representation Learning with Contrastive Predictive Coding. CoRR, abs/1807.03748, 2018. arXiv:1807.03748
[23]
Zhou, Y., Dai, S., Cao, Z., Zhang, X., & Xu, J. — Length-Induced Embedding Collapse in Transformer-Based Models. arXiv preprint, 2024. arXiv:2410.24200
[24]
Zhu, D., Wang, L., Yang, N., Song, Y., Wu, W., Wei, F., & Li, S. — LongEmbed: Extending Embedding Models for Long Context Retrieval. arXiv preprint, 2024. arXiv:2404.12096
RankWithMe.ai logo
SYSTEM STATUS
SYSTEM STATUS
Page:RESEARCH
Date:2026.03.04
Paper:LATE CHUNKING
Authors:GÜNTHER · MOHR · WILLIAMS · WANG · XIAO
Orgs:JINA AI · WEAVIATE
Domain:cs.CL · cs.IR
arXiv:2409.04701v3
Improvement:+3.63% rel.
vs. Naive:Consistent ✓
Add. Training:NOT REQUIRED
JSON-LD:PENDING
Root-LD:PENDING
Human Verified:TRUE
↑↓ : Scroll ENTER : Select ESC : Exit
Build: 2026-PROD Method: ENTITY-FIRST Status: OPERATIONAL
Structure before ads. Always.