-
Notifications
You must be signed in to change notification settings - Fork 31
New ingestion pipeline #279
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
9b6771b to
fa1171e
Compare
f82f6c3 to
06100e0
Compare
|
@manju956 Addressed your comments, ptal. |
yussufsh
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logic looks good to me.
This solves 2 problems:
- Handling bigger pdfs to control the memory usage
- Start processing steps as soon as the conversion is done for a pdf.
1075f6a to
06100e0
Compare
manalilatkar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
manju956
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
e8455e9
06100e0 to
e8455e9
Compare
d426e50 to
b130a0f
Compare
Instead of doing ingestion stage by stage via different methods
- New ingestion pipeline would look like below
- Generate current conversion status against the files available in cache to do only missing parts of ingestion pipeline. e.g. If conversion & processing is done, only chunking will be attempted.
- Split the files to process in light and heavy(pdf pages > 500) batches. This is required to process heavy files with less concurrency to avoid OOM kill. Light files = 4 workers, Heavy files = 2 workers
- Replace Process Pool with Thread Pool for processing & chunking since it involves only network call & counting tokens which does not require process which is required for CPU heavy tasks like conversion
- Start the ingestion pipeline
- Start conversion for pdfs which needs conversion
- As soon as a pdf’s conversion is completed, trigger its text & table processing
- Wait for all conversion to get completed
- As soon as a pdf’s processing is completed, trigger chunking
- Wait for all processing to get completed
- Wait for all chunking to get completed
- Improved stats by printing ingestion pipeline’s stage wise timings for all the pdfs
Signed-off-by: Dharaneeshwaran Ravichandran <dharaneeshwaran.ravichandran@ibm.com>
b130a0f to
f052251
Compare
Instead of doing ingestion stage by stage via different methods
To Address: AISERVICES-326 & other improvements