Kubernetes batch system seems less robust than mesos

Hi, since migrating over to use kubernetes as batch system instead of mesos, the workflow runs seems to stall more often. Even after checking and adjusting all the memory, cpu requirements of all the subworkflows, it still seems to struggle, and I see events like

  Type     Reason            Age                     From               Message
  ----     ------            ----                    ----               -------
  Warning  FailedScheduling  4m24s (x1231 over 30h)  default-scheduler  0/7 nodes are available: 2 Insufficient cpu, 2 Insufficient memory, 5 node(s) had taint {node.kubernetes.io/unreachable: }, that the pod didn't tolerate.

Is there anything to look for to make this more robust? It is very unreliable now for running the workflows compared to mesos. Sometimes the job need to be killed, and with no changes to workflow params, but restart where it left off, it would be able to do the scheduling and finish. This at least suggest the mem/disk/cpu requirement specs are fine, as restart allowed it to schedule. But keep on killing and restarting to avoid stalling is not really a solution.

┆Issue is synchronized with this [Jira Story](https://ucsc-cgl.atlassian.net/browse/TOIL-1803)
┆Issue Number: TOIL-1803


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kubernetes batch system seems less robust than mesos #5446

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Kubernetes batch system seems less robust than mesos #5446

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions