-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Description
What happened + What you expected to happen
How can I stop such jobs (running
)? (created by vllm engine V1)

Something appears hung or inconsistent (I could not find manually the running process - it would be great to have IP/PID for every actor)
What is the reason?
Thanks!
The pending
job has Job has not started yet. It may be waiting for the runtime environment to be set up.
and Logs are Failed to load
. What does this mean? How to debug this? How can I know the IP/PID of this set up
process so that I can ssh into the machine and fetch a stack trace?
Symptoms seems the same as for the below, but I don't use any containers:



I cannot even stop this pending
job:

This continues endlessly (although typically - in reverse, stopping is limited by some sort of grade period?) :(
Ray also is often tempted to connect to the ray cluster within the docker instance. For some reason, the inner ray instance is also visible on the host. Because of this, ray stop
tries and fails to kill the actors which are related to that other Ray instance within the docker. This is extremely confusing
Versions / Dependencies
Ray 2.48.0
Reproduction script
N/A
Issue Severity
None