-
Notifications
You must be signed in to change notification settings - Fork 12
Open
Description
We need to implement a single processor to handle semantic chunking for documents. The processor will:
- Break documents into chunks based on semantic structure.
- Store these chunks using a nested schema to model the relationship between the resource and its chunks.
This approach replaces the earlier plan of having two separate processors (semantic chunking and markdown-based chunking).
Requirements
-
Customizable Chunking Strategy
- The processor should allow experimentation with different chunking strategies.
- Ensure the strategy is modular and can be easily modified as needed.
-
Input Format
- The input for chunking will be the
bodyfield of the document. - This field is consistently formatted in markdown across all documents.
- The input for chunking will be the
-
Schema for Storing Chunks
- Use a nested schema to represent the relationship between a resource and its chunks. Refer to feat(scrapers): Add GitHub metadata scraper for issues and pull requests #93 for a similar implementation of nested schemas.
- Explore assigning a title to each chunk where possible. This is particularly useful for transcripts, where the title can serve as a chapter title.
Example Document with Nested Schema
{ "id": "delving-bitcoin-1257-11-3754", "title": "Understanding Bitcoin", "body": "Full markdown content here...", "chunks": [ { "id": "delving-bitcoin-1257-11-3754-chunk1", "title": "Introduction to Bitcoin", "body": "Bitcoin is a decentralized digital currency...", }, { "id": "delving-bitcoin-1257-11-3754-chunk2", "title": "How Bitcoin Works", "body": "Bitcoin operates on a blockchain...", }, { "id": "delving-bitcoin-1257-11-3754-chunk3", "body": "Additional details on mining...", } ], "chunking_strategy": "v1.0" } -
Version Control
- Add a chunking strategy version field to the document schema to track the chunking method used.
-
Configurable Text Limit
- Not all documents need to be chunked.
- Introduce a configurable text length threshold to decide whether a document is chunked.
Additional Notes
- The previous plan involving two processors has been abandoned as it added unnecessary complexity for the scraper.
Open Questions
- Should we assign titles to chunks? This could be especially valuable for deriving chapter titles in transcript documents.
- If semantic chunking leverages embeddings, should we store these embeddings as part of the chunk data for potential future applications?
- Can existing markdown headings enhance the semantic chunking process? For example, should they be used as a guide or a starting point for chunk definitions?
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels