I am trying to create chunks (max) 350 characters long with 100 chunk overlap.
I understand that chunk_size
is an upper limit, so I may get chunks shorter than that. But why am I not getting any chunk_overlap
?
Is it because the overlap also has to split on one of the separator chars? So it s 100 chars chunk_overlap if there is a separator
within 100 chars of the split that it can split on?
from langchain.text_splitter import RecursiveCharacterTextSplitter
some_text = """When writing documents, writers will use document structure to group content.
This can convey to the reader, which idea s are related. For example, closely related ideas
are in sentances. Similar ideas are in paragraphs. Paragraphs form a document.
Paragraphs are often delimited with a carriage return or two carriage returns.
Carriage returns are the "backslash n" you see embedded in this string.
Sentences have a period at the end, but also, have a space.
and words are separated by space."""
r_splitter = RecursiveCharacterTextSplitter(
chunk_size=350,
chunk_overlap=100,
separators=["
", "
", "(?<=. )", " ", ""]
)
x = r_splitter.split_text(some_text)
print(x)
for thing in x:
print(len(thing))
Output
["When writing documents, writers will use document structure to group content. This can convey to the reader, which idea s are related. For example, closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document.",
Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the "backslash n" you see embedded in this string. Sentences have a period at the end, but also, have a space.and words are separated by space. ]
248
243