Skip to content

[Bug] Worker stuck in "closing-gracefully" state #3018

@AnesBenmerzoug

Description

@AnesBenmerzoug

I have a cluster composed of 8 workers distributed accross in k8s pods in AWS.
I noticed that sometimes one of the workers gets stuck in the "closing-gracefully" state because of an assertion error in the worker's close_gracefully() method.

For some reason the scheduler still tries to send tasks to this worker, which just fail after some time because the worker does not actually execute them.

Manually closing the worker using the client's retire_workers() methods works and I'm currently using it as a workaround.

After digging around the code base a bit, I found that the part of the code responsible for this behaviour is in the scheduler's replicate() method. The failing assertion, which I did not completely understand, is not handled properly and so leads the worker to not close properly.

From analyzing the expression I could conclude the following:

  • n_missing is greater than 0, otherwise the method would have returned
  • branching_factor's default value is used which 2

From those two points it seems that len(ts.who_has) is 0

Unfortunately I did not yet find a minimal example to reproduce this example.

Stack Trace
2019-09-02 at 01:40:09 | LOCALDEV | DASK     | INFO     | distributed.worker:close_gracefully:1116 - Closing worker gracefully: tcp://10.165.119.248:37069
2019-09-02 at 01:40:10 | LOCALDEV | DASK     | INFO     | distributed.worker:transition_executing_done:1646 - Comm closed
2019-09-02 at 01:40:10 | LOCALDEV | DASK     | INFO     | distributed.worker:transition_executing_done:1646 - Comm closed
2019-09-02 at 01:40:11 | LOCALDEV | DASK     | WARNING  | distributed.utils_perf:_gc_callback:204 - full garbage collections took 30% CPU time recently (threshold: 10%)
2019-09-02 at 01:40:13 | LOCALDEV | DASK     | WARNING  | distributed.utils_perf:_gc_callback:204 - full garbage collections took 30% CPU time recently (threshold: 10%)
2019-09-02 at 01:40:13 | LOCALDEV | DASK     | INFO     | agena_data_acquisition.tools.flow.utilities:set_container_components_on_hash_cache:152 - Setting categories and timestamps on hash cache
2019-09-02 at 01:40:13 | LOCALDEV | DASK     | INFO     | distributed.worker:transition_executing_done:1646 - Comm closed
2019-09-02 at 01:40:13 | LOCALDEV | DASK     | INFO     | distributed.worker:transition_executing_done:1646 - Comm closed
2019-09-02 at 01:40:15 | LOCALDEV | DASK     | ERROR    | tornado.ioloop:_run_callback:763 - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0x7f625d4db400>>, <Task finished coro=<Worker.close_gracefully() done, defined at /opt/venv/lib/python3.6/site-packages/distributed/worker.py:1104> exception=AssertionError()>)
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/lib/python3.6/multiprocessing/forkserver.py", line 196, in main
    _serve_one(s, listener, alive_r, old_handlers)
    │          │  │         │        └ {<Signals.SIGCHLD: 17>: <Handlers.SIG_DFL: 0>, <Signals.SIGINT: 2>: <built-in function default_int_handler>}
    │          │  │         └ 17
    │          │  └ <socket.socket [closed] fd=-1, family=AddressFamily.AF_UNIX, type=SocketKind.SOCK_STREAM, proto=0>
    │          └ <socket.socket [closed] fd=-1, family=AddressFamily.AF_UNIX, type=SocketKind.SOCK_STREAM, proto=0><function _serve_one at 0x7f625c10c730>
  File "/usr/lib/python3.6/multiprocessing/forkserver.py", line 231, in _serve_one
    code = spawn._main(child_r)
           │     │     └ 11
           │     └ <function _main at 0x7f625c111d08><module 'multiprocessing.spawn' from '/usr/lib/python3.6/multiprocessing/spawn.py'>
  File "/usr/lib/python3.6/multiprocessing/spawn.py", line 118, in _main
    return self._bootstrap()
           │    └ <function BaseProcess._bootstrap at 0x7f625d3ddd08><ForkServerProcess(Dask Worker process (from Nanny), started daemon)>
  File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
    │    └ <function BaseProcess.run at 0x7f625d3dd510><ForkServerProcess(Dask Worker process (from Nanny), started daemon)>
  File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
    │    │        │    │        │    └ {}
    │    │        │    │        └ <ForkServerProcess(Dask Worker process (from Nanny), started daemon)>
    │    │        │    └ (<bound method WorkerProcess._run of <class 'distributed.nanny.WorkerProcess'>>, (), {'worker_kwargs': {'scheduler_ip': 'tcp:...
    │    │        └ <ForkServerProcess(Dask Worker process (from Nanny), started daemon)>
    │    └ <bound method AsyncProcess._run of <class 'distributed.process.AsyncProcess'>><ForkServerProcess(Dask Worker process (from Nanny), started daemon)>
  File "/opt/venv/lib/python3.6/site-packages/distributed/process.py", line 179, in _run
    target(*args, **kwargs)
    │       │       └ {'worker_kwargs': {'scheduler_ip': 'tcp://dask-scheduler:8786', 'nthreads': 4, 'local_directory': '/tmp', 'services': None, '...
    │       └ ()
    └ <bound method WorkerProcess._run of <class 'distributed.nanny.WorkerProcess'>>
  File "/opt/venv/lib/python3.6/site-packages/distributed/nanny.py", line 697, in _run
    loop.run_sync(run)
    │    │        └ <function WorkerProcess._run.<locals>.run at 0x7f623d58c8c8>
    │    └ <function IOLoop.run_sync at 0x7f625a0e0378><tornado.platform.asyncio.AsyncIOLoop object at 0x7f625d4db400>
  File "/opt/venv/lib/python3.6/site-packages/tornado/ioloop.py", line 526, in run_sync
    self.start()
    │    └ <function BaseAsyncIOLoop.start at 0x7f625993b6a8><tornado.platform.asyncio.AsyncIOLoop object at 0x7f625d4db400>
  File "/opt/venv/lib/python3.6/site-packages/tornado/platform/asyncio.py", line 148, in start
    self.asyncio_loop.run_forever()
    │    │            └ <function BaseEventLoop.run_forever at 0x7f625acee158>
    │    └ <_UnixSelectorEventLoop running=True closed=False debug=False><tornado.platform.asyncio.AsyncIOLoop object at 0x7f625d4db400>
  File "/usr/lib/python3.6/asyncio/base_events.py", line 438, in run_forever
    self._run_once()
    │    └ <function BaseEventLoop._run_once at 0x7f625acf9620><_UnixSelectorEventLoop running=True closed=False debug=False>
  File "/usr/lib/python3.6/asyncio/base_events.py", line 1451, in _run_once
    handle._run()
    │      └ <function Handle._run at 0x7f625acdb2f0><Handle IOLoop.add_future.<locals>.<lambda>(<Task finishe...ertionError()>) at /opt/venv/lib/python3.6/site-packages/tornado/...
  File "/usr/lib/python3.6/asyncio/events.py", line 145, in _run
    self._callback(*self._args)
    │    │          │    └ <member '_args' of 'Handle' objects>
    │    │          └ <Handle IOLoop.add_future.<locals>.<lambda>(<Task finishe...ertionError()>) at /opt/venv/lib/python3.6/site-packages/tornado/...
    │    └ <member '_callback' of 'Handle' objects><Handle IOLoop.add_future.<locals>.<lambda>(<Task finishe...ertionError()>) at /opt/venv/lib/python3.6/site-packages/tornado/...
  File "/opt/venv/lib/python3.6/site-packages/tornado/ioloop.py", line 690, in <lambda>
    lambda f: self._run_callback(functools.partial(callback, future))
           │  │    │             │         │       │         └ <Task finished coro=<Worker.close_gracefully() done, defined at /opt/venv/lib/python3.6/site-packages/distributed/worker.py:1...
           │  │    │             │         │       └ <bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0x7f625d4db400>>
           │  │    │             │         └ <class 'functools.partial'>
           │  │    │             └ <module 'functools' from '/usr/lib/python3.6/functools.py'>
           │  │    └ <function IOLoop._run_callback at 0x7f625a0e0b70>
           │  └ <tornado.platform.asyncio.AsyncIOLoop object at 0x7f625d4db400><Task finished coro=<Worker.close_gracefully() done, defined at /opt/venv/lib/python3.6/site-packages/distributed/worker.py:1...
> File "/opt/venv/lib/python3.6/site-packages/tornado/ioloop.py", line 743, in _run_callback
    ret = callback()
          └ functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0x7f625d4db4...
  File "/opt/venv/lib/python3.6/site-packages/tornado/ioloop.py", line 767, in _discard_future_result
    future.result()
    │      └ <method 'result' of '_asyncio.Task' objects><Task finished coro=<Worker.close_gracefully() done, defined at /opt/venv/lib/python3.6/site-packages/distributed/worker.py:1...
  File "/opt/venv/lib/python3.6/site-packages/distributed/worker.py", line 1118, in close_gracefully
    await self.scheduler.retire_workers(workers=[self.address], remove=False)
          │    │                                 │    └ <property object at 0x7f62590b5098>
          │    │                                 └ <Worker: tcp://10.165.119.248:37069, closing-gracefully, stored: 37, running: 2/4, ready: 0, comm: 0, waiting: 0>
          │    └ <pooled rpc to 'tcp://dask-scheduler:8786'><Worker: tcp://10.165.119.248:37069, closing-gracefully, stored: 37, running: 2/4, ready: 0, comm: 0, waiting: 0>
  File "/opt/venv/lib/python3.6/site-packages/distributed/core.py", line 750, in send_recv_from_rpc
    result = await send_recv(comm=comm, op=key, **kwargs)
                   │         │    │        │      └ {'workers': ['tcp://10.165.119.248:37069'], 'remove': False}
                   │         │    │        └ 'retire_workers'
                   │         │    └ <TCP ConnectionPool local=tcp://10.165.119.248:58730 remote=tcp://dask-scheduler:8786>
                   │         └ <TCP ConnectionPool local=tcp://10.165.119.248:58730 remote=tcp://dask-scheduler:8786><function send_recv at 0x7f625912dc80>
  File "/opt/venv/lib/python3.6/site-packages/distributed/core.py", line 559, in send_recv
    six.reraise(*clean_exception(**response))
    │   │        │                 └ {'status': 'uncaught-error', 'text': '', 'exception': AssertionError(), 'traceback': <traceback object at 0x7f60f8ba7e48>}
    │   │        └ <function clean_exception at 0x7f62590b6e18>
    │   └ <function reraise at 0x7f625a0b17b8><module 'six' from '/opt/venv/lib/python3.6/site-packages/six.py'>
  File "/opt/venv/lib/python3.6/site-packages/six.py", line 692, in reraise
    raise value.with_traceback(tb)
          │                    └ NoneNone
  File "/opt/venv/lib/python3.6/site-packages/distributed/core.py", line 416, in handle_comm
    result = await result
  File "/opt/venv/lib/python3.6/site-packages/distributed/scheduler.py", line 3117, in retire_workers
    delete=False,
  File "/opt/venv/lib/python3.6/site-packages/tornado/gen.py", line 748, in run
    yielded = self.gen.send(value)
  File "/opt/venv/lib/python3.6/site-packages/distributed/scheduler.py", line 2881, in replicate
    assert count > 0

AssertionError: assert count > 0
2019-09-02 at 01:55:24 | LOCALDEV | DASK     | INFO     | distributed.worker:run:3377 - Run out-of-band function 'collect'
2019-09-02 at 01:55:25 | LOCALDEV | DASK     | WARNING  | distributed.utils_perf:_gc_callback:204 - full garbage collections took 30% CPU time recently (threshold: 10%)
2019-09-02 at 02:11:15 | LOCALDEV | DASK     | INFO     | distributed.worker:run:3377 - Run out-of-band function 'collect'
2019-09-02 at 02:11:15 | LOCALDEV | DASK     | WARNING  | distributed.utils_perf:_gc_callback:204 - full garbage collections took 30% CPU time recently (threshold: 10%)
2019-09-02 at 02:27:08 | LOCALDEV | DASK     | INFO     | distributed.worker:run:3377 - Run out-of-band function 'collect'
Workers status
>>> pprint(client.run(lambda dask_worker: dask_worker.status))
{'tcp://10.165.118.150:37697': 'running',
 'tcp://10.165.118.190:40097': 'running',
 'tcp://10.165.119.194:43069': 'closing-gracefully',
 'tcp://10.165.134.5:37665': 'running',
 'tcp://10.165.134.98:37897': 'running',
 'tcp://10.165.135.151:41777': 'running',
 'tcp://10.165.135.24:39843': 'running',
 'tcp://10.165.135.75:41797': 'running'}
Manually Closing the Worker
>>> client.retire_workers(['tcp://10.165.119.194:43069'])
{'tcp://10.165.119.194:43069': {'type': 'Worker', 'id': 'tcp://10.165.119.194:43069', 'host': '10.165.119.194', 'resources': {}, 'local_directory': '/tmp/worker-insrgwz0', 'name': 'tcp://10.165.119.194:43069', 'nthreads': 4, 'memory_limit': 6000000000, 'last_seen': 1567423741.9343932, 'services': {'dashboard': 40601}, 'metrics': {'cpu': 4.0, 'memory': 1920286720, 'time': 1567423741.4320846, 'read_bytes': 285.9653648363741, 'write_bytes': 773.9062670746627, 'num_fds': 1284, 'executing': 0, 'in_memory': 0, 'ready': 0, 'in_flight': 0, 'bandwidth': 100000000}, 'nanny': 'tcp://10.165.119.194:32957'}}

>>> pprint(client.run(lambda dask_worker: dask_worker.status))
{'tcp://10.165.118.150:41061': 'running',
 'tcp://10.165.118.190:40097': 'running',
 'tcp://10.165.119.194:39373': 'running',
 'tcp://10.165.134.5:37665': 'running',
 'tcp://10.165.134.98:37897': 'running',
 'tcp://10.165.135.151:41777': 'running',
 'tcp://10.165.135.24:39843': 'running',
 'tcp://10.165.135.75:41797': 'running'}
Scheduler Info
>>> pprint(client.scheduler_info())
{'address': 'tcp://10.165.119.247:8786',
 'id': 'Scheduler-ae134b55-02b0-4644-9053-1a4d27cd6253',
 'services': {'dashboard': 8787},
 'type': 'Scheduler',
 'workers': {'tcp://10.165.118.150:40363': {'host': '10.165.118.150',
                                            'id': 'tcp://10.165.118.150:40363',
                                            'last_seen': 1567418324.9379842,
                                            'local_directory': '/tmp/worker-yn1xir95',
                                            'memory_limit': 6000000000,
                                            'metrics': {'bandwidth': 100000000,
                                                        'cpu': 0.0,
                                                        'executing': 0,
                                                        'in_flight': 0,
                                                        'in_memory': 0,
                                                        'memory': 106196992,
                                                        'num_fds': 27,
                                                        'read_bytes': 285.2919233032259,
                                                        'ready': 0,
                                                        'time': 1567418324.9362829,
                                                        'write_bytes': 768.0936396625311},
                                            'name': 'tcp://10.165.118.150:40363',
                                            'nanny': 'tcp://10.165.118.150:41743',
                                            'nthreads': 4,
                                            'resources': {},
                                            'services': {'dashboard': 44259},
                                            'type': 'Worker'},
             'tcp://10.165.118.190:39499': {'host': '10.165.118.190',
                                            'id': 'tcp://10.165.118.190:39499',
                                            'last_seen': 1567418325.1259258,
                                            'local_directory': '/tmp/worker-n67kzmw4',
                                            'memory_limit': 6000000000,
                                            'metrics': {'bandwidth': 100000000,
                                                        'cpu': 0.0,
                                                        'executing': 0,
                                                        'in_flight': 0,
                                                        'in_memory': 0,
                                                        'memory': 107712512,
                                                        'num_fds': 27,
                                                        'read_bytes': 285.8943480874215,
                                                        'ready': 0,
                                                        'time': 1567418325.1242785,
                                                        'write_bytes': 769.7155525430579},
                                            'name': 'tcp://10.165.118.190:39499',
                                            'nanny': 'tcp://10.165.118.190:37897',
                                            'nthreads': 4,
                                            'resources': {},
                                            'services': {'dashboard': 36281},
                                            'type': 'Worker'},
             'tcp://10.165.119.194:43069': {'host': '10.165.119.194',
                                            'id': 'tcp://10.165.119.194:43069',
                                            'last_seen': 1567418324.9348905,
                                            'local_directory': '/tmp/worker-insrgwz0',
                                            'memory_limit': 6000000000,
                                            'metrics': {'bandwidth': 100000000,
                                                        'cpu': 2.0,
                                                        'executing': 0,
                                                        'in_flight': 0,
                                                        'in_memory': 0,
                                                        'memory': 1865625600,
                                                        'num_fds': 1284,
                                                        'read_bytes': 285.88140257823795,
                                                        'ready': 0,
                                                        'time': 1567418324.9321342,
                                                        'write_bytes': 773.6790405439028},
                                            'name': 'tcp://10.165.119.194:43069',
                                            'nanny': 'tcp://10.165.119.194:32957',
                                            'nthreads': 4,
                                            'resources': {},
                                            'services': {'dashboard': 40601},
                                            'type': 'Worker'},
             'tcp://10.165.134.5:45259': {'host': '10.165.134.5',
                                          'id': 'tcp://10.165.134.5:45259',
                                          'last_seen': 1567418325.0068722,
                                          'local_directory': '/tmp/worker-c9_tdwxs',
                                          'memory_limit': 6000000000,
                                          'metrics': {'bandwidth': 100000000,
                                                      'cpu': 0.0,
                                                      'executing': 0,
                                                      'in_flight': 0,
                                                      'in_memory': 0,
                                                      'memory': 114638848,
                                                      'num_fds': 27,
                                                      'read_bytes': 285.86423455352053,
                                                      'ready': 0,
                                                      'time': 1567418325.004854,
                                                      'write_bytes': 765.636376461527},
                                          'name': 'tcp://10.165.134.5:45259',
                                          'nanny': 'tcp://10.165.134.5:42299',
                                          'nthreads': 4,
                                          'resources': {},
                                          'services': {'dashboard': 42483},
                                          'type': 'Worker'},
             'tcp://10.165.134.98:42651': {'host': '10.165.134.98',
                                           'id': 'tcp://10.165.134.98:42651',
                                           'last_seen': 1567418325.138247,
                                           'local_directory': '/tmp/worker-gmprr26a',
                                           'memory_limit': 6000000000,
                                           'metrics': {'bandwidth': 100000000,
                                                       'cpu': 2.0,
                                                       'executing': 0,
                                                       'in_flight': 0,
                                                       'in_memory': 0,
                                                       'memory': 114290688,
                                                       'num_fds': 27,
                                                       'read_bytes': 286.39261759547566,
                                                       'ready': 0,
                                                       'time': 1567418324.6365747,
                                                       'write_bytes': 769.0543017948438},
                                           'name': 'tcp://10.165.134.98:42651',
                                           'nanny': 'tcp://10.165.134.98:45353',
                                           'nthreads': 4,
                                           'resources': {},
                                           'services': {'dashboard': 37307},
                                           'type': 'Worker'},
             'tcp://10.165.135.151:46743': {'host': '10.165.135.151',
                                            'id': 'tcp://10.165.135.151:46743',
                                            'last_seen': 1567418325.299708,
                                            'local_directory': '/tmp/worker-ytdfgks_',
                                            'memory_limit': 6000000000,
                                            'metrics': {'bandwidth': 100000000,
                                                        'cpu': 4.0,
                                                        'executing': 0,
                                                        'in_flight': 0,
                                                        'in_memory': 0,
                                                        'memory': 114499584,
                                                        'num_fds': 27,
                                                        'read_bytes': 286.0085919100238,
                                                        'ready': 0,
                                                        'time': 1567418325.29762,
                                                        'write_bytes': 770.0231320654489},
                                            'name': 'tcp://10.165.135.151:46743',
                                            'nanny': 'tcp://10.165.135.151:35201',
                                            'nthreads': 4,
                                            'resources': {},
                                            'services': {'dashboard': 34289},
                                            'type': 'Worker'},
             'tcp://10.165.135.24:39503': {'host': '10.165.135.24',
                                           'id': 'tcp://10.165.135.24:39503',
                                           'last_seen': 1567418325.1116345,
                                           'local_directory': '/tmp/worker-96w5bp0w',
                                           'memory_limit': 6000000000,
                                           'metrics': {'bandwidth': 100000000,
                                                       'cpu': 2.0,
                                                       'executing': 0,
                                                       'in_flight': 0,
                                                       'in_memory': 0,
                                                       'memory': 113909760,
                                                       'num_fds': 27,
                                                       'read_bytes': 285.5692941296901,
                                                       'ready': 0,
                                                       'time': 1567418325.1096437,
                                                       'write_bytes': 766.8434192014055},
                                           'name': 'tcp://10.165.135.24:39503',
                                           'nanny': 'tcp://10.165.135.24:37487',
                                           'nthreads': 4,
                                           'resources': {},
                                           'services': {'dashboard': 44285},
                                           'type': 'Worker'},
             'tcp://10.165.135.75:42995': {'host': '10.165.135.75',
                                           'id': 'tcp://10.165.135.75:42995',
                                           'last_seen': 1567418325.064751,
                                           'local_directory': '/tmp/worker-_zkylo2z',
                                           'memory_limit': 6000000000,
                                           'metrics': {'bandwidth': 100000000,
                                                       'cpu': 0.0,
                                                       'executing': 0,
                                                       'in_flight': 0,
                                                       'in_memory': 0,
                                                       'memory': 112508928,
                                                       'num_fds': 27,
                                                       'read_bytes': 285.7694387958324,
                                                       'ready': 0,
                                                       'time': 1567418325.062967,
                                                       'write_bytes': 767.3808706125849},
                                           'name': 'tcp://10.165.135.75:42995',
                                           'nanny': 'tcp://10.165.135.75:34145',
                                           'nthreads': 4,
                                           'resources': {},
                                           'services': {'dashboard': 35467},
                                           'type': 'Worker'}}}

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions