Currently, even if one worker goes OOM, the entire ray cluster gets killed by LSF. With help of bluanch find a mechanism to manage remote tasks.