-
Notifications
You must be signed in to change notification settings - Fork 14
Open
Description
Bug Report: OutOfMemoryError (“GC overhead limit exceeded”) when running topic modeling with Mallet
Describe the bug
When invoking pycistopic topic-modeling … using the Mallet backend on a large corpus, the Java process crashes during train-topics with:
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
RuntimeError: mallet train-topics returned non-zero exit status 1
This indicates that the JVM is spending almost all its time garbage-collecting and failing to make forward progress.
To Reproduce
-
Ensure Mallet is installed (e.g. version 2.0.8) and on your
PATH. -
Activate your
scenicplusconda env:conda activate scenicplus
-
Run a topic modeling command on a large dataset, for example:
pycistopic topic_modeling mallet \
-i $input_file \
-o $output_file \
-t 10 \
-p $ncores \
-m $mem \
-b $mallet_path
- Observe the error in the STDERR log (
pycisTopic_all_*.err).
Error output
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
at cc.mallet.topics.ParallelTopicModel.estimate(ParallelTopicModel.java:727)
at cc.mallet.topics.tui.TopicTrainer.main(TopicTrainer.java:245)
RuntimeError: command '['/lab-share/.../mallet', 'topic_modeling', ...]' returned non-zero exit status 1.
Expected behavior
The topic modeling should complete (or at least fail cleanly) without thrashing the JVM, given that a large amount of memory is specified. When insufficient memory was given it crashes with "out of heap memory" instead.
Screenshots
N/A
Version (please complete the following information):
- Python: 3.11.0 (Miniforge3)
- pycisTopic: 2.0a0 (from
pip show pycisTopic) - Mallet: 2.0.8
- OpenJDK: 11.x
Metadata
Metadata
Assignees
Labels
No labels