Skip to content

Kubernetes batch system seems less robust than mesos #5446

@boyangzhao

Description

@boyangzhao

Hi, since migrating over to use kubernetes as batch system instead of mesos, the workflow runs seems to stall more often. Even after checking and adjusting all the memory, cpu requirements of all the subworkflows, it still seems to struggle, and I see events like

Type Reason Age From Message


Warning FailedScheduling 4m24s (x1231 over 30h) default-scheduler 0/7 nodes are available: 2 Insufficient cpu, 2 Insufficient memory, 5 node(s) had taint {node.kubernetes.io/unreachable: }, that the pod didn't tolerate.

Is there anything to look for to make this more robust? It is very unreliable now for running the workflows compared to mesos. Sometimes the job need to be killed, and with no changes to workflow params, but restart where it left off, it would be able to do the scheduling and finish. This at least suggest the mem/disk/cpu requirement specs are fine, as restart allowed it to schedule. But keep on killing and restarting to avoid stalling is not really a solution.

┆Issue is synchronized with this Jira Story
┆Issue Number: TOIL-1803

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions