Skip to content

Chunking & Search

RAG works by finding pieces of documents (aka “chunks”) and passing them alongside a query (usually a question) to an LLM. The two big problems to solve for any RAG system are:

  1. How do I convert documents into chunks?
  2. How do I find relevant chunks for a query?

Standard RAG Pipeline

The standard RAG pipeline splits these into two distinct steps:

A standard RAG pipeline showing Chunking steps on the left and search on the right. Chunk extracts text from the PDF, converts the text into chunks, then indexes those chunks. The search pipeline takes a query, loads the chunks, then combines the query and chunks into an LLM prompt

This can work well for smaller documents or more structured data, but it has a few downsides:

  1. Chunking happens before we index. We can’t take the query into account when generating chunks.
  2. Layout information (font sizes, colors, spacing, etc.) is thrown away.

Our Approach

Instead of turning text into chunks, we analyze the layout of a document the way a person would read it. Most unstructured documents have a tree-like structure with headers and subheaders helping organize relevant content together:

A tree structure with a title as the root and two headers as children. The leftmost header has two paragraphs as children. The rightmost header has a list as its child. The list has two list items as children

We convert documents into a tree. The document is segmented by Content Blocks, which are the building blocks of the document tree:

  • Headers: A header contains children, which may be any other content block - including other headers
  • Lists: These are ordered or unordered sequences, usually indicated by bullets, letters, or numbers.
  • ListItems: These are discrete items within a list. Usually no more than a sentence or a paragraph
  • Paragraphs: Usually no more than a few sentences.
  • Tables: (Coming Soon) Sometimes these are used to display data, and sometimes used for formatting.
  • Figures: (Coming Soon) These are diagrams, graphs, and other primarily visual elements of a document.

Layout Analysis

A document’s layout tells readers how to parse it. Take a look at a page from SAP’s Intelligent Sales Execution doc:

A page of text with headers, subheaders, lists, and sublists. Each is highlighted by a different color based on the type of content block available

It’s immediately obvious that 1 Intelligent Sales Execution is the main header, and everything else on the page is “underneath” it. 1.1 Prerequisites in a subheader, which mostly contains a list. Depending on what’s on the next page, it may still be under 1.1 Prerequisites or it may be an entirely new subheader. Similarly, even though the Note is closer to 1.1 Prerequisites, it’s actually associated with the main header.

The only way to know these things to look at the layout of the document. Below is the same page run through our parser:

A page of text with headers, subheaders, lists, and sublists. Each is highlighted by a different color based on the type of content block available.

Our system analyzes the layout and extracts content blocks in same hierarchy they’re presented within the PDF.

Search & Dynamic Chunking

Let’s say someone searched "What is a Sales Forecast Category?". The last bullet point in the list would be a match. However, that bullet on its own is missing context. It talks about Sales Forecast Category in relation to something, but the bullet point itself assumes that you already know the context. In fact, every bullet in this list is like that - they’re missing context unless you see the title + its first paragraph.

Traditional chunking based on sentence, paragraph, and/or page boundaries often misses these contextual clues. Without the layout information, it’s very difficult to know that text that’s multiple paragraphs, or even multiple pages, away is necessary context.

Our system solves that by searching the entire document hierarchy, rather than just chunks of text. It would find the ListItem “Additional Category - …”, then walk up the hierarchy to the header:

Once the header is found, it walks back down to find relevant context:

This lets us dynamically create a chunk based on the document layout metadata, plus the query. We generate markdown that can be passed directly to an LLM, or analyzed to further refine your pipeline.