Text Splitting/chunking
Goal
Your goal is not to chunk for chunking sake, the goal is to get data in a format where it can be retrieved for value later!
- Text splitting/chunking
- process of splitting your data into smaller pieces
- The GOAL → to prepare ur data for anticipated tasks u actually have
- “What’s the optimal way for me to pass data my LLM needs for its task?”
- one of the most foundations an ai practitioner will do
- Why??
- u cant pass all data at once, LLMs have context windows
- LLMs also need unnecessary info removed from data (high signal to noise ratio) because if not it will destroy the performance of the overall app
- What is retrieval?
- The act of gathering the right information for ur LLM
- there are many retrieval strategies
- Retrieval evaluations
- U need proper evaluations to see if retrievals are working
- Ragas
- Recommended chunk size
- 2000~6000, depending on app
- website visualizer
5 Main ways of Text Splitting
- Outline
- Character splitting - split by fixed static character limit
- Recursive Character splitting - separate with separators
- Document specific text splitting
- python/js/pdfs etc → include multimodal
- Semantic splitting
- Not naive like the previous 3 (they focus on physical positioning/structure of text chunks), but like “what” and the “why” o f the text (understanding the context of text)
- Agentic splitting
- You build an agent like system that will review text and split it for us
- Bonus: Indexing
- LangChain will be the main tool
- Google Colab Link → https://colab.research.google.com/drive/1OmUlatumVfBBpGqUCGblr457Vt-lg7v_?usp=sharing
Character splitting
- split by fixed static character limit
- easy, simple, but super rigid. X used in production
- terms
chunk size= # of characters u would like in your chunkschunk overlap= the amount you would like ur sequential chunks to overlap, to avoid cutting a single piece of context into multiple pieces. this will also lead so duplicates
Manual:
text = "This is the text I would want to chunk."
chunks = []
chunk_size = 35
for i in range(0, len(text), chunk_size):
chunk = text[i: i + chunk_size]
chunks.append(chunk)- u can also use other tools like
llamaIndex
from langchain_text_splitters import CharacterTextSplitter
text = "This is a sentence that I will test Langchain's text splitter."
text_splitter = CharacterTextSplitter(chunk_size = 35, chunk_overlap=0, separator='', strip_whitespace=False)
text_splitter.create_documents([text])[Document(metadata={}, page_content='This is a sentence that I will test'),
Document(metadata={}, page_content=" Langchain's text splitter.")]
CharacterTextSplittercreate_documentsexpects a listchunk_overlap- the tail of prev will overlap with head of nextseparator- we can split by anything! → punctuations, space, a character, etc
strip_whitespaceTrueby default
- Result
- each chunk is a
Documentobject → in langchain it’s an object that can hold strings and metadata. our content is held inpage_content
- each chunk is a
Recursive Character splitting
- The standard chunking method for text embedding/RAG
- Divides large documents into smaller pieces based on a maximum chunk size limit and a prioritized list of separators
"\n\n"- double new line / paragraph breaks"\n"- next line" "- spaces""- characters- so the prioritized ordering is: paragraph → sentence → word
- what it does
- Splits the text using the highest-priority separator first (e.g.,
\n\nto split by paragraph) - Checks if the resulting chunks are within the defined maximum
chunk_size.- if chunk too large, the algo recursively applies the next separator in the list only to that specific oversized chunk.
- Typically uses a
chunk_overlapparameter to duplicate a small number of characters at the boundaries of adjacent chunks to prevent cutting off context abruptly.
- Splits the text using the highest-priority separator first (e.g.,
- Basically
- First, split the text by the highest priority separator (e.g.,
\n\nfor paragraphs). - Check the size of the resulting pieces.
- If a piece is bigger than chunk size: Recursively split that specific oversized piece using the next separator in the list (e.g.,
\n, then" ", then""). - Once pieces are small enough: Merge these smaller pieces sequentially into a single chunk until the combined length reaches the
chunk_sizelimit. (This prevents having unnecessarily tiny chunks). - Repeat until the entire document is processed.
- First, split the text by the highest priority separator (e.g.,
from langchain_text_splitters import RecursiveCharacterTextSplitter
text = """What I Worked On
February 2021
Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep.
"""
text_splitter = RecursiveCharacterTextSplitter(
chunk_size = 100,
chunk_overlap = 0,
length_function = len,
is_separator_regex = False
)
text_splitter.create_documents([text])- attributes
chunk_sizechunk_overlaplength_function- use
lenfunction to get thelengthof the string
- use
is_separator_regex- put as
Falseto not use regex
- put as
Document specific text splitting
Python docs, markdown, pdf, etc..
- Start with markdown -
MarkdownTextSplitter- splits by headers
from langchain_text_splitters import MarkdownTextSplitter
splitter = MarkdownTextSplitter(
chunk_size = 40,
chunk_overlap=0
)
markdown_text = """
# Fun in California
## Driving
## Hiking
"""
splitter.create_documents([markdown_text])- Python docs -
PythonCodeTextSplitter\nclass, etc
- JavaScript docs -
RecursiveCharacterTextSplitter,LanguageRecursiveCharacterTextSplitter.from_language(language=Language.JS, ...
PDFS
- PDFs are an extremely common data type for LLMs, often they have tables that contain information
- https://unstructured.io/ → very convenient!
import os
from unstructured.partition.pdf import partition_pdf
from unstructured.staging.base import elements_to_json
filename = "sonocraftar.pdf"
# extract
elements = partition_pdf(
filename=filename,
# unstructured helpers
strategy="hi_res",
infer_table_structure=True,
model_name="yolox"
)elements- will give us
NarrativeTextandTable Tables will be shown as html → the LLM will understand better
- will give us
Multi-Modal (text+image)
- How will u deal with images?
from typing import Any
from pydantic import BaseModel
from unstructured.partition.pdf import partition_pdf
filepath = ".../...pdf"
raw_pdf_elems = partition_pdf(
filename = filepath,
# using pdf format to find embedded img blocks
extract_images_in_pdf=True,
# use layout model to get bounding boxes
# titles are any sub-section of the document
# post process to aggregate text once we have title
chunking_strategy="by_title",
# chunking params to aggregate text blocks
# attempt to creat new chunk 3000 chars
# attempt to keep chunks > 2000 chars
# marx max on chunks
max_characters=4000,
new_after_n_chars=3800,
combine_text_under_n_chars=2000,
impage_output_dir_path="static/pdfImages/"
)extract_images_in_pdf- it will chunk but treat images differently
- the images will be extracted to
static/pdfImages/ - we can do embeddings for images, but usually models provide for just one (ex. embedding model only for text or only for images)
- there is the CLIP model, but we can do better
- what we’ll do
- generate text summary of each img + embedding of text summary
- https://www.youtube.com/watch?v=8OJC21T2SL4 → 28:00
Semantic splitting
- lvl 1-3
- we all took physical positioning into acc. we just assumed paragraphs have similar info. what if they dont? what if we have rlly messy info?
- grouping books by size
- we will try embedding based chunking method
- makes use of meaning + context → grouping books by genre/author
- embedding → represent the semantic meaning of a string. so semantically similar chunks should be held together
- https://colab.research.google.com/drive/1OmUlatumVfBBpGqUCGblr457Vt-lg7v_?usp=sharing
Agentic splitting
- Can we instruct an LLM to do this task like a human would?
-
Me: