Skip to content

Conversation

@dharaneeshvrd
Copy link
Member

@dharaneeshvrd dharaneeshvrd commented Feb 3, 2026

Instead of doing ingestion stage by stage via different methods

  • New ingestion pipeline would look like below
    • Generate current conversion status against the files available in cache to do only missing parts of ingestion pipeline. e.g. If conversion & processing is done, only chunking will be attempted.
    • Split the files to process in light and heavy(pdf pages > 500) batches. This is required to process heavy files with less concurrency to avoid OOM kill. Light files = 4 workers, Heavy files = 2 workers
    • Replace Process Pool with Thread Pool for processing & chunking since it involves only network call & counting tokens which does not require process which is required for CPU heavy tasks like conversion
    • Start the ingestion pipeline
      • Start conversion for pdfs which needs conversion
      • As soon as a pdf’s conversion is completed, trigger its text & table processing
      • Wait for all conversion to get completed
      • As soon as a pdf’s processing is completed, trigger chunking
      • Wait for all processing to get completed
      • Wait for all chunking to get completed
  • Improved stats by printing ingestion pipeline’s stage wise timings for all the pdfs

To Address: AISERVICES-326 & other improvements

@dharaneeshvrd dharaneeshvrd force-pushed the separate-summarization branch 3 times, most recently from 9b6771b to fa1171e Compare February 3, 2026 06:25
@dharaneeshvrd dharaneeshvrd force-pushed the separate-summarization branch 2 times, most recently from f82f6c3 to 06100e0 Compare February 4, 2026 06:48
@dharaneeshvrd
Copy link
Member Author

@manju956 Addressed your comments, ptal.

yussufsh
yussufsh previously approved these changes Feb 5, 2026
Copy link
Member

@yussufsh yussufsh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logic looks good to me.
This solves 2 problems:

  1. Handling bigger pdfs to control the memory usage
  2. Start processing steps as soon as the conversion is done for a pdf.

manalilatkar
manalilatkar previously approved these changes Feb 5, 2026
Copy link
Member

@manalilatkar manalilatkar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

manju956
manju956 previously approved these changes Feb 6, 2026
Copy link
Contributor

@manju956 manju956 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dharaneeshvrd dharaneeshvrd dismissed stale reviews from manju956 and manalilatkar via e8455e9 February 6, 2026 08:28
@dharaneeshvrd dharaneeshvrd force-pushed the separate-summarization branch from 06100e0 to e8455e9 Compare February 6, 2026 08:28
@dharaneeshvrd dharaneeshvrd force-pushed the separate-summarization branch 5 times, most recently from d426e50 to b130a0f Compare February 6, 2026 14:43
Instead of doing ingestion stage by stage via different methods
- New ingestion pipeline would look like below
    - Generate current conversion status against the files available in cache to do only missing parts of ingestion pipeline. e.g. If conversion & processing is done, only chunking will be attempted.
    - Split the files to process in light and heavy(pdf pages > 500) batches. This is required to process heavy files with less concurrency to avoid OOM kill. Light files = 4 workers, Heavy files = 2 workers
    - Replace Process Pool with Thread Pool for processing & chunking since it involves only network call & counting tokens which does not require process which is required for CPU heavy tasks like conversion
    - Start the ingestion pipeline
        - Start conversion for pdfs which needs conversion
        - As soon as a pdf’s conversion is completed, trigger its text & table processing
        - Wait for all conversion to get completed
        - As soon as a pdf’s processing is completed, trigger chunking
        - Wait for all processing to get completed
        - Wait for all chunking to get completed
- Improved stats by printing ingestion pipeline’s stage wise timings for all the pdfs

Signed-off-by: Dharaneeshwaran Ravichandran <dharaneeshwaran.ravichandran@ibm.com>
@dharaneeshvrd dharaneeshvrd force-pushed the separate-summarization branch from b130a0f to f052251 Compare February 6, 2026 14:44
@dharaneeshvrd dharaneeshvrd merged commit 6508c3d into IBM:main Feb 9, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants