Remove the language filter for wikidata labels.#26
Conversation
Bump Spark Driver memory to acount for larger results set. The memory upper bound was found to allow the job to complete on enwiki. This change is experimental, and meant to enable analysis/experimentation.
clarakosi
left a comment
There was a problem hiding this comment.
I tested it on stat1005 for enwiki and still ran into memory issues.
For reference it was run: 39607674-87fa-4ee5-9158-c008c150c505
This commit adds some tweaks to spark init, memory limits and garbage collection policies needed to meet enwiki memory requirements.
|
@clarakosi I have been able to reproduce the memory errors you reported. There's a few things to unpack. I was able to get the process to complete and generate valid data, by significantly increasing the Spark Driver memory (64G) and tweaking related memory settings. tl;dr: given the very large memory footpring, I would not introduce the query change for now. I'd stick to filtering results by language. How does the error manifests.The query change triggered the following chain of failures:
These result is OOM on the driver and (after memory increase) GC failures. TuningI tweaked the following (in order):
WARNING: with pyspark in MitigationIMHO we should look at query optimizations, before committing to system changes. 64G is around 10% of the total memory available on stat hosts, and not sustainable in the long run. Enabling |
Bump Spark Driver memory to acount for larger results set.
The memory upper bound was found to allow the job to complete on enwiki.
This change is experimental, and meant to enable analysis/experimentation.