XM Docker distributed experiments don't finish cleanly 

Hi, 

When running one of the baseline examples in a distributed fashion, e.g.

`python run_gail.py --num_steps=1 --run_distributed=True`

The experiment finishes with the output

> I1113 10:19:41.710619 139662026843712 lp_utils.py:98] StepsLimiter: Max steps of 1 was reached, terminating
> I1113 10:19:41.711068 139671506786112 savers.py:205] Caught SIGTERM: forcing a checkpoint save.
> Worker groups that did not terminate in time: ['actor']
> Killing entire runtime.
> Killed

While the experiment has run successfully, this messy teardown means that you cannot end the experiment runner properly. When using external loggers like Weights and Biases, which means the experiment is reported as 'crashed' even though it's run successfully. 

In my own (imitation learning) code, I see the same problem for full experiments, but the error message is
> Worker groups that did not terminate in time: ['learner']

Is there a way to have a cleaner teardown? The XM docker launch function in launchpad doesn't appear to return an object that you can use for a smart `wait` or something.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

XM Docker distributed experiments don't finish cleanly #312

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

XM Docker distributed experiments don't finish cleanly #312

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions