-
Notifications
You must be signed in to change notification settings - Fork 3
Fix memory leaking when processing large result files #181
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: 0.x
Are you sure you want to change the base?
Conversation
|
Hi @GlucNAc I known the filesystem job execution storage has some issues, it's more a default when nothing else is available, but I will always push users to alternatives, like the DoctrineDBALJobExecutionStorage Can you be more specific when you say there is a OOM issue? I known we have a problem when |
|
The problem has been found when many imports were done. These imports were KO and thus generated a lot of failures/warnings. However having too many failures/warnings per result file was not the real problem, it was more about the quantity of large json files to process. Implementing a rotating file policy functionnality would help to mitigate this accumulation problem, but the main problem would still remain. Indeed, the main problem comes from the fact that public function query(Query $query): iterable
{
$candidates = [];
$glob = new \GlobIterator(
\implode(DIRECTORY_SEPARATOR, [$this->directory, '**', '*']) . '.' . $this->serializer->extension(),
);
/** @var \SplFileInfo $file */
foreach ($glob as $file) {
// do filtering stuff
$candidates[] = $execution;
}
$order = // select order callback
if ($order) {
\uasort($candidates, $order);
}
return \array_slice($candidates, $query->offset(), $query->limit());
}This implementation leads to lose the benefits from the Filtering an That's why I manage to fix the current implementation to limit the memory consumption. As the problem comes from accumulation of large json, to be able to detect the problem via unit test, I have generated 10 json files of 20 Mo each (with many warnings) and written these tests that detect OMM issue. This class performs 3 tests :
which were run on the current state of the code. The test 2 relies on On the other hand, the test 1 and 3 fail because they both rely on the previously describe To fix it, I have implemented a sort of lazy loading in order to just load the data necessary for the sorting and pagination logics, and then This lazy loading is done by JsonJobExecutionPartialSerializer, which reads the first 2048 bytes of the json and denormalizes the data into a new instance of Then the filtering and sorting logics are performed by the new private method In our original use case, processing the result files was long (> 10 s), and was resulting in 1,6 Go RAM consumption. With that fix, the time was acceptable and the memory peak usage didn't overcome 100 Mo in a dev env (with Symfony debug ON). |
|
Than you for all the details you shared I understand the problem you have and it is all around the Unfortunately, as you said, ordering + paginating an iterator is not possible, so for the moment, we have to deal with this IMO, anyone using the I'm also not a fan of "partial hydration" |
|
I totaly agree with you about having a smaller object. It was my first intention with JobExecutionResultFile, but I was facing code duplication while deserializing the json, and as this logic seems to be already duplicated in JobExecutionRowNormalizer, I didn't wanted to go further in a refacto before being sure about your interest in it. |
|
I'm on the process to release a new version with BC, so, it's OK if we introduce such changes if it's for the best of the library |
|
OK I will propose such a refacto then. Maybe in another PR + updating the target of the current PR to this new one? Btw, will your new release include the update of the return of |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## 0.x #181 +/- ##
============================================
- Coverage 99.68% 99.32% -0.37%
- Complexity 940 962 +22
============================================
Files 157 160 +3
Lines 2852 2951 +99
============================================
+ Hits 2843 2931 +88
- Misses 9 20 +11 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Hello :)
We have been using this great library to import large files (160 columns with tens of thousands of rows). Thus, it has generated large JSON result files, which cause out-of-memory errors when attempting to process them.
This PR:
Yokai\Batch\Storage\FilesystemJobExecutionStorage::queryandYokai\Batch\Storage\FilesystemJobExecutionStorage::countto reduce memory consumption.While this is still not fully scalable, it might be worth considering a migration to something like SQLite when the filesystem is chosen to store job results.
Finally, I think it would be interesting to change the signature of
Yokai\Batch\Storage\QueryableJobExecutionStorageInterface::queryfromiterabletoIterator.Hope you think this is interesting :)