-
Notifications
You must be signed in to change notification settings - Fork 245
Description
Hi, since migrating over to use kubernetes as batch system instead of mesos, the workflow runs seems to stall more often. Even after checking and adjusting all the memory, cpu requirements of all the subworkflows, it still seems to struggle, and I see events like
Type Reason Age From Message
Warning FailedScheduling 4m24s (x1231 over 30h) default-scheduler 0/7 nodes are available: 2 Insufficient cpu, 2 Insufficient memory, 5 node(s) had taint {node.kubernetes.io/unreachable: }, that the pod didn't tolerate.
Is there anything to look for to make this more robust? It is very unreliable now for running the workflows compared to mesos. Sometimes the job need to be killed, and with no changes to workflow params, but restart where it left off, it would be able to do the scheduling and finish. This at least suggest the mem/disk/cpu requirement specs are fine, as restart allowed it to schedule. But keep on killing and restarting to avoid stalling is not really a solution.
┆Issue is synchronized with this Jira Story
┆Issue Number: TOIL-1803