Skip to content

A command line app that semantically chunks DOCX files, converts them to Open AI embeddings, stores them in Pinecone, and provides semantic search and Retrieval-Augmented Generation (RAG).

Notifications You must be signed in to change notification settings

HoraceShmorace/Chunky

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Chunky

Chunky

From DOCX to RAG

A command-line Node.js app that semantically chunks and converts DOCX files into OpenAI embeddings, stores them in Pinecone, and provides semantic (vector) search and Retrieval-Augmented Generation (RAG). Ask questions and get precise, context-aware GPT answers generated directly from your own content. This proof of concept features robust error handling with automatic retries, detailed state tracking, and batch processing for reliability and speed.

This will quickly evolve into a full and robust RAG service using AWS serverless technologies and capable of processing numerous file types sourced on-disk or in the cloud. Give me a minute.

Prerequisites

  • Node.js (v18+ recommended)
  • OpenAI API Key (get one at OpenAI)
  • Pinecone API Key (get one at Pinecone)

Configuration

Create a .env file in the root of your project before running any commands, and add the following keys (use /.env-sample as a starting point):

OPENAI_API_KEY=your-openai-api-key
PINECONE_API_KEY=your-pinecone-api-key
PINECONE_CLOUD=cloud-platform-like-aws-or-gcp
PINECONE_REGION=us-east-1-or-other
PINECONE_INDEX=arbitrary-name-of-your-index # the index should not already exist

Directory Structure

project-root/
├── files/
│   ├── in/          # place DOCX files here
│   ├── out/         # embeddings generated here
│   └── processed/   # DOCX files moved here after processing
├── src/
│   ├── index.js            # ingestion entry point
│   ├── pinecone-create.js  # pinecone index creation entry point
│   ├── pinecone-delete.js  # pinecone embeddings deletion entry point
│   ├── pinecone-upload.js  # pinecone embeddings upload entry point
│   ├── pinecone-search.js  # pinecone vector search entry point
│   ├── local-search.js     # local embeddings search entry point
│   ├── rag.js
│   └── lib/
│       ├── parser.js
│       ├── chunker.js
│       ├── embedder.js
│       ├── storage.js
│       ├── ingester.js
│       ├── local-sdk.js
│       ├── pinecone-sdk.js
│       └── state-manager.js
├── .env
├── .env-sample      # a starter .env file
├── README.md
├── package.json
└── package-lock.json

Quickstart

  1. Install dependencies
npm install
  1. Create your Pinecone index once:
npm run pc-create
  1. Ingest all docx files and generate embeddings:
npm start
  1. Upload embeddings to Pinecone:
npm run pc-upload
  1. Ask GPT to answer in the context of your content:
npm run rag "your question here"

App Command Reference

Environment Setup

# install dependencies
npm install

Pinecone Index Management

# create Pinecone index (run once)
npm run pc-create
# or
node src/pinecone-create.js

# delete all vectors from index
npm run pc-delete
# or
node src/pinecone-delete.js

Create Embeddings Locally

# process all DOCX files
npm start
# or
npm run ingest
# or
node src/index.js

# process a single DOCX file
npm start book-slug
# or
npm run ingest book-slug
# or
node src/index.js book-slug 

The embeddings generated are stored in the files/out directory, organized by book slug (the book slug will be the filename minus the extension).

Upload Embeddings to Pinecone

Uploads include automatic retries with exponential backoff to handle network interruptions.

# upload all embeddings to Pinecone
npm run pc-upload
# or
node src/pinecone-upload.js

# upload embeddings for a single book
npm run pc-upload book-slug
# or
node src/pinecone-upload.js book-slug

Searching

You can perform semantic searches locally or via Pinecone:

# search embeddings stored locally
npm run local-search "your query"
# or
node src/local-search.js "your query"

# search embeddings in Pinecone index (vector search)
npm run pc-search "your query"
# or
node src/pinecone-search.js "your query"

# search a specific namespace in Pinecone
npm run pc-search "your query" namespace
# or
node src/pinecone-search.js "your query" namespace

Retrieval-Augmented Generation (RAG)

Get context-aware GPT responses by leveraging Pinecone:

# GPT-augmented answer from Pinecone context
npm run rag "your question here"
# or
node src/rag.js "your question here"

This will query Pinecone, fetch relevant context, and return a GPT-generated response.

About

A command line app that semantically chunks DOCX files, converts them to Open AI embeddings, stores them in Pinecone, and provides semantic search and Retrieval-Augmented Generation (RAG).

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published