New ingestion pipeline #279

dharaneeshvrd · 2026-02-03T04:22:08Z

Instead of doing ingestion stage by stage via different methods

New ingestion pipeline would look like below
- Generate current conversion status against the files available in cache to do only missing parts of ingestion pipeline. e.g. If conversion & processing is done, only chunking will be attempted.
- Split the files to process in light and heavy(pdf pages > 500) batches. This is required to process heavy files with less concurrency to avoid OOM kill. Light files = 4 workers, Heavy files = 2 workers
- Replace Process Pool with Thread Pool for processing & chunking since it involves only network call & counting tokens which does not require process which is required for CPU heavy tasks like conversion
- Start the ingestion pipeline
  - Start conversion for pdfs which needs conversion
  - As soon as a pdf’s conversion is completed, trigger its text & table processing
  - Wait for all conversion to get completed
  - As soon as a pdf’s processing is completed, trigger chunking
  - Wait for all processing to get completed
  - Wait for all chunking to get completed
Improved stats by printing ingestion pipeline’s stage wise timings for all the pdfs

To Address: AISERVICES-326 & other improvements

spyre-rag/src/ingest/cli.py

spyre-rag/src/ingest/doc_utils.py

dharaneeshvrd · 2026-02-04T06:48:47Z

@manju956 Addressed your comments, ptal.

yussufsh

Logic looks good to me.
This solves 2 problems:

Handling bigger pdfs to control the memory usage
Start processing steps as soon as the conversion is done for a pdf.

manalilatkar

LGTM.

manju956

LGTM

spyre-rag/src/ingest/doc_utils.py

Instead of doing ingestion stage by stage via different methods - New ingestion pipeline would look like below - Generate current conversion status against the files available in cache to do only missing parts of ingestion pipeline. e.g. If conversion & processing is done, only chunking will be attempted. - Split the files to process in light and heavy(pdf pages > 500) batches. This is required to process heavy files with less concurrency to avoid OOM kill. Light files = 4 workers, Heavy files = 2 workers - Replace Process Pool with Thread Pool for processing & chunking since it involves only network call & counting tokens which does not require process which is required for CPU heavy tasks like conversion - Start the ingestion pipeline - Start conversion for pdfs which needs conversion - As soon as a pdf’s conversion is completed, trigger its text & table processing - Wait for all conversion to get completed - As soon as a pdf’s processing is completed, trigger chunking - Wait for all processing to get completed - Wait for all chunking to get completed - Improved stats by printing ingestion pipeline’s stage wise timings for all the pdfs Signed-off-by: Dharaneeshwaran Ravichandran <dharaneeshwaran.ravichandran@ibm.com>

dharaneeshvrd requested review from Niharika0306, manju956 and yussufsh February 3, 2026 04:22

dharaneeshvrd force-pushed the separate-summarization branch 3 times, most recently from 9b6771b to fa1171e Compare February 3, 2026 06:25

manju956 reviewed Feb 3, 2026

View reviewed changes

dharaneeshvrd force-pushed the separate-summarization branch 2 times, most recently from f82f6c3 to 06100e0 Compare February 4, 2026 06:48

dharaneeshvrd requested a review from mkumatag February 4, 2026 06:49

yussufsh previously approved these changes Feb 5, 2026

View reviewed changes

dharaneeshvrd dismissed yussufsh’s stale review via 06100e0 February 5, 2026 09:57

dharaneeshvrd force-pushed the separate-summarization branch from 1075f6a to 06100e0 Compare February 5, 2026 09:57

dharaneeshvrd requested a review from manalilatkar February 5, 2026 13:23

manalilatkar previously approved these changes Feb 5, 2026

View reviewed changes

manju956 previously approved these changes Feb 6, 2026

View reviewed changes

Niharika0306 reviewed Feb 6, 2026

View reviewed changes

spyre-rag/src/ingest/doc_utils.py Outdated Show resolved Hide resolved

dharaneeshvrd dismissed stale reviews from manju956 and manalilatkar via e8455e9 February 6, 2026 08:28

dharaneeshvrd force-pushed the separate-summarization branch from 06100e0 to e8455e9 Compare February 6, 2026 08:28

Niharika0306 reviewed Feb 6, 2026

View reviewed changes

spyre-rag/src/ingest/doc_utils.py Show resolved Hide resolved

Niharika0306 reviewed Feb 6, 2026

View reviewed changes

spyre-rag/src/ingest/doc_utils.py Outdated Show resolved Hide resolved

dharaneeshvrd force-pushed the separate-summarization branch 5 times, most recently from d426e50 to b130a0f Compare February 6, 2026 14:43

dharaneeshvrd force-pushed the separate-summarization branch from b130a0f to f052251 Compare February 6, 2026 14:44

Niharika0306 approved these changes Feb 9, 2026

View reviewed changes

Merge branch 'main' into separate-summarization

d82f7a2

dharaneeshvrd merged commit 6508c3d into IBM:main Feb 9, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New ingestion pipeline #279

New ingestion pipeline #279

Uh oh!

dharaneeshvrd commented Feb 3, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dharaneeshvrd commented Feb 4, 2026

Uh oh!

yussufsh left a comment

Uh oh!

manalilatkar left a comment

Uh oh!

manju956 left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

New ingestion pipeline #279

New ingestion pipeline #279

Uh oh!

Conversation

dharaneeshvrd commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dharaneeshvrd commented Feb 4, 2026

Uh oh!

yussufsh left a comment

Choose a reason for hiding this comment

Uh oh!

manalilatkar left a comment

Choose a reason for hiding this comment

Uh oh!

manju956 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

dharaneeshvrd commented Feb 3, 2026 •

edited

Loading