Skip to content

Conversation

@GlucNAc
Copy link

@GlucNAc GlucNAc commented Jan 6, 2026

Hello :)

We have been using this great library to import large files (160 columns with tens of thousands of rows). Thus, it has generated large JSON result files, which cause out-of-memory errors when attempting to process them.

This PR:

  • adds tests that demonstrate an OOM issue occurs when processing 10 dummy files of ~20 MB each;
  • fixes Yokai\Batch\Storage\FilesystemJobExecutionStorage::query and Yokai\Batch\Storage\FilesystemJobExecutionStorage::count to reduce memory consumption.

While this is still not fully scalable, it might be worth considering a migration to something like SQLite when the filesystem is chosen to store job results.

Finally, I think it would be interesting to change the signature of Yokai\Batch\Storage\QueryableJobExecutionStorageInterface::query from iterable to Iterator.

Hope you think this is interesting :)

@yann-eugone
Copy link
Contributor

Hi @GlucNAc
I really appreciate your efforts here

I known the filesystem job execution storage has some issues, it's more a default when nothing else is available, but I will always push users to alternatives, like the DoctrineDBALJobExecutionStorage

Can you be more specific when you say there is a OOM issue?
Were you able to spot the actual reason of the exhaustion?

I known we have a problem when JobExecution become to heavy, due to logs, warnings, failures, etc...
But the problem exists, no matter the storage used, so the fix is to be found elsewhere

@GlucNAc
Copy link
Author

GlucNAc commented Jan 7, 2026

The problem has been found when many imports were done. These imports were KO and thus generated a lot of failures/warnings. However having too many failures/warnings per result file was not the real problem, it was more about the quantity of large json files to process. Implementing a rotating file policy functionnality would help to mitigate this accumulation problem, but the main problem would still remain.

Indeed, the main problem comes from the fact that Yokai\Batch\Storage\FilesystemJobExecutionStorage::query does not fully work in an iterator manner. In fact, the method do globally this :

public function query(Query $query): iterable
{
    $candidates = [];
    $glob = new \GlobIterator(
        \implode(DIRECTORY_SEPARATOR, [$this->directory, '**', '*']) . '.' . $this->serializer->extension(),
    );
    /** @var \SplFileInfo $file */
    foreach ($glob as $file) {

        // do filtering stuff

        $candidates[] = $execution;
    }

    $order = // select order callback

    if ($order) {
        \uasort($candidates, $order);
    }

    return \array_slice($candidates, $query->offset(), $query->limit());
}

This implementation leads to lose the benefits from the GlobIterator by accumulating JobExecution objects in $candidates array.

Filtering an iterator is possible, but performing ordering and offset logic is not. That's why I proposed to use SQLite in order to not maintain this logic and delegate it to the SQLite engine for users who needs to stay on the filesystem, but that would be not possible to introduce such a paradigm move in a fix PR.

That's why I manage to fix the current implementation to limit the memory consumption.

As the problem comes from accumulation of large json, to be able to detect the problem via unit test, I have generated 10 json files of 20 Mo each (with many warnings) and written these tests that detect OMM issue.

This class performs 3 tests :

  1. testNoOOMOnCount;
  2. testNoOOMOnList;
  3. testNoOOMOnQuery;

which were run on the current state of the code.

The test 2 relies on Yokai\Batch\Storage\FilesystemJobExecutionStorage::list, which is a "true" iterator and then does not lead to OOM issue, which prove that the json file size is not the main issue (even if a gigabyte file would obviously make this test to fail).

On the other hand, the test 1 and 3 fail because they both rely on the previously describe Yokai\Batch\Storage\FilesystemJobExecutionStorage::query method.

To fix it, I have implemented a sort of lazy loading in order to just load the data necessary for the sorting and pagination logics, and then JobExecution are processed one by one by a "true" iterator like in Yokai\Batch\Storage\FilesystemJobExecutionStorage::list, correctly filtered, sorted and paginated (according to existing unit tests that are still OK + new pagination tests).

This lazy loading is done by JsonJobExecutionPartialSerializer, which reads the first 2048 bytes of the json and denormalizes the data into a new instance of JobExecution using the existing Yokai\Batch\Serializer\JsonJobExecutionSerializer::fromArray method (moved into Yokai\Batch\Serializer\JobExecutionSerializerTrait for this purpose).

Then the filtering and sorting logics are performed by the new private method Yokai\Batch\Storage\FilesystemJobExecutionStorage::getResultFiles (a better name could be found I think), which is consumed by Yokai\Batch\Storage\FilesystemJobExecutionStorage::query and Yokai\Batch\Storage\FilesystemJobExecutionStorage::count which implement their own pagination logic and keep memory usage as low as possible.

In our original use case, processing the result files was long (> 10 s), and was resulting in 1,6 Go RAM consumption. With that fix, the time was acceptable and the memory peak usage didn't overcome 100 Mo in a dev env (with Symfony debug ON).

@yann-eugone
Copy link
Contributor

Than you for all the details you shared

I understand the problem you have and it is all around the query method
This method is part of an independent interface, that was not implemented on the filesystem storage in the beginning
But because we found a working (but suboptimal) implementation, we decided to keep it like that

Unfortunately, as you said, ordering + paginating an iterator is not possible, so for the moment, we have to deal with this
I see your point with sqlite, but once again, if you have database capabilities, just use the dbal storage, because it also works with sqlite (tests are running with sqlite)

IMO, anyone using the filesystem storage should not be using these methods, unless you know exactly what you are doing

I'm also not a fan of "partial hydration"
Maybe the problem is that in a listing method, we should hydrate smaller objects, not the whole JobExecution
It is essentially the same you are doing here, but with a different result object
WDYT ?

@GlucNAc
Copy link
Author

GlucNAc commented Jan 7, 2026

I totaly agree with you about having a smaller object. It was my first intention with JobExecutionResultFile, but I was facing code duplication while deserializing the json, and as this logic seems to be already duplicated in JobExecutionRowNormalizer, I didn't wanted to go further in a refacto before being sure about your interest in it.

@yann-eugone
Copy link
Contributor

I'm on the process to release a new version with BC, so, it's OK if we introduce such changes if it's for the best of the library

@GlucNAc
Copy link
Author

GlucNAc commented Jan 7, 2026

OK I will propose such a refacto then. Maybe in another PR + updating the target of the current PR to this new one?

Btw, will your new release include the update of the return of Yokai\Batch\Storage\QueryableJobExecutionStorageInterface::query from iterable to Iterator? I think that would be a good improvement as it would encourage any implementation to use "real" iterator, and it will prevent some errors like "already closed iterator" (as developers would be aware an iterator is returned) and give opportunity to use iterator's function safely on the result of query method.

@codecov
Copy link

codecov bot commented Jan 7, 2026

Codecov Report

❌ Patch coverage is 95.47511% with 10 lines in your changes missing coverage. Please review.
✅ Project coverage is 99.32%. Comparing base (96533f2) to head (e917297).

Files with missing lines Patch % Lines
...c/Serializer/JsonJobExecutionPartialSerializer.php 85.36% 6 Missing ⚠️
...atch/src/Storage/FilesystemJobExecutionStorage.php 95.95% 4 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##                0.x     #181      +/-   ##
============================================
- Coverage     99.68%   99.32%   -0.37%     
- Complexity      940      962      +22     
============================================
  Files           157      160       +3     
  Lines          2852     2951      +99     
============================================
+ Hits           2843     2931      +88     
- Misses            9       20      +11     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants