Skip to content

Semantic Chunking Processor #95

@kouloumos

Description

@kouloumos

We need to implement a single processor to handle semantic chunking for documents. The processor will:

  • Break documents into chunks based on semantic structure.
  • Store these chunks using a nested schema to model the relationship between the resource and its chunks.

This approach replaces the earlier plan of having two separate processors (semantic chunking and markdown-based chunking).

Requirements

  1. Customizable Chunking Strategy

    • The processor should allow experimentation with different chunking strategies.
    • Ensure the strategy is modular and can be easily modified as needed.
  2. Input Format

    • The input for chunking will be the body field of the document.
    • This field is consistently formatted in markdown across all documents.
  3. Schema for Storing Chunks

    Example Document with Nested Schema

    {
      "id": "delving-bitcoin-1257-11-3754",
      "title": "Understanding Bitcoin",
      "body": "Full markdown content here...",
      "chunks": [
        {
          "id": "delving-bitcoin-1257-11-3754-chunk1",
          "title": "Introduction to Bitcoin",
          "body": "Bitcoin is a decentralized digital currency...",
        },
        {
          "id": "delving-bitcoin-1257-11-3754-chunk2",
          "title": "How Bitcoin Works",
          "body": "Bitcoin operates on a blockchain...",
        },
        {
          "id": "delving-bitcoin-1257-11-3754-chunk3",
          "body": "Additional details on mining...",
        }
      ],
      "chunking_strategy": "v1.0"
    }
  4. Version Control

    • Add a chunking strategy version field to the document schema to track the chunking method used.
  5. Configurable Text Limit

    • Not all documents need to be chunked.
    • Introduce a configurable text length threshold to decide whether a document is chunked.

Additional Notes

  • The previous plan involving two processors has been abandoned as it added unnecessary complexity for the scraper.

Open Questions

  1. Should we assign titles to chunks? This could be especially valuable for deriving chapter titles in transcript documents.
  2. If semantic chunking leverages embeddings, should we store these embeddings as part of the chunk data for potential future applications?
  3. Can existing markdown headings enhance the semantic chunking process? For example, should they be used as a guide or a starting point for chunk definitions?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions