Skip to content

Bug Report: OutOfMemoryError (“GC overhead limit exceeded”) when running topic modeling with Mallet #213

@teng-gao

Description

@teng-gao

Bug Report: OutOfMemoryError (“GC overhead limit exceeded”) when running topic modeling with Mallet

Describe the bug

When invoking pycistopic topic-modeling … using the Mallet backend on a large corpus, the Java process crashes during train-topics with:

Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
RuntimeError: mallet train-topics returned non-zero exit status 1

This indicates that the JVM is spending almost all its time garbage-collecting and failing to make forward progress.

To Reproduce

  1. Ensure Mallet is installed (e.g. version 2.0.8) and on your PATH.

  2. Activate your scenicplus conda env:

    conda activate scenicplus
  3. Run a topic modeling command on a large dataset, for example:

pycistopic topic_modeling mallet \
      -i $input_file \
      -o $output_file \
      -t 10 \
      -p $ncores \
      -m $mem \
      -b $mallet_path
  1. Observe the error in the STDERR log (pycisTopic_all_*.err).

Error output

Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
    at cc.mallet.topics.ParallelTopicModel.estimate(ParallelTopicModel.java:727)
    at cc.mallet.topics.tui.TopicTrainer.main(TopicTrainer.java:245)

RuntimeError: command '['/lab-share/.../mallet', 'topic_modeling', ...]' returned non-zero exit status 1.

Expected behavior

The topic modeling should complete (or at least fail cleanly) without thrashing the JVM, given that a large amount of memory is specified. When insufficient memory was given it crashes with "out of heap memory" instead.

Screenshots

N/A

Version (please complete the following information):

  • Python: 3.11.0 (Miniforge3)
  • pycisTopic: 2.0a0 (from pip show pycisTopic)
  • Mallet: 2.0.8
  • OpenJDK: 11.x

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions