Semantic Chunking Processor

We need to implement a **single processor** to handle semantic chunking for documents. The processor will:  
- Break documents into chunks based on semantic structure.  
- Store these chunks using a **nested schema** to model the relationship between the resource and its chunks.  

This approach replaces the earlier plan of having two separate processors (semantic chunking and markdown-based chunking).  

### Requirements  

1. **Customizable Chunking Strategy**  
   - The processor should allow experimentation with different chunking strategies.  
   - Ensure the strategy is modular and can be easily modified as needed.  

2. **Input Format**  
   - The input for chunking will be the `body` field of the document.  
   - This field is consistently formatted in markdown across all documents.  

3. **Schema for Storing Chunks**  
   - Use a nested schema to represent the relationship between a resource and its chunks. Refer to #93 for a similar implementation of nested schemas.
   - Explore assigning a **title** to each chunk where possible. This is particularly useful for transcripts, where the title can serve as a chapter title.  

   **Example Document with Nested Schema**  
    ```json
    {
      "id": "delving-bitcoin-1257-11-3754",
      "title": "Understanding Bitcoin",
      "body": "Full markdown content here...",
      "chunks": [
        {
          "id": "delving-bitcoin-1257-11-3754-chunk1",
          "title": "Introduction to Bitcoin",
          "body": "Bitcoin is a decentralized digital currency...",
        },
        {
          "id": "delving-bitcoin-1257-11-3754-chunk2",
          "title": "How Bitcoin Works",
          "body": "Bitcoin operates on a blockchain...",
        },
        {
          "id": "delving-bitcoin-1257-11-3754-chunk3",
          "body": "Additional details on mining...",
        }
      ],
      "chunking_strategy": "v1.0"
    }
    ```

4. **Version Control**  
   - Add a **chunking strategy version** field to the document schema to track the chunking method used.  

5. **Configurable Text Limit**  
   - Not all documents need to be chunked.  
   - Introduce a configurable **text length threshold** to decide whether a document is chunked.  

### Additional Notes  
- The previous plan involving two processors has been abandoned as it added unnecessary complexity for the scraper.  

### Open Questions
1. Should we assign titles to chunks? This could be especially valuable for deriving chapter titles in transcript documents.
2. If semantic chunking leverages embeddings, should we store these embeddings as part of the chunk data for potential future applications?
3. Can existing markdown headings enhance the semantic chunking process? For example, should they be used as a guide or a starting point for chunk definitions?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Semantic Chunking Processor #95

Requirements

Additional Notes

Open Questions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Semantic Chunking Processor #95

Description

Requirements

Additional Notes

Open Questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions