Skip to content

[Core] Stopping jobs without explicit Submission ID #56102

@vadimkantorov

Description

@vadimkantorov

What happened + What you expected to happen

How can I stop such jobs (running)? (created by vllm engine V1)

Image

Something appears hung or inconsistent (I could not find manually the running process - it would be great to have IP/PID for every actor)

What is the reason?

Thanks!


The pending job has Job has not started yet. It may be waiting for the runtime environment to be set up. and Logs are Failed to load. What does this mean? How to debug this? How can I know the IP/PID of this set up process so that I can ssh into the machine and fetch a stack trace?

Symptoms seems the same as for the below, but I don't use any containers:

Image Image Image

I cannot even stop this pending job:

Image

This continues endlessly (although typically - in reverse, stopping is limited by some sort of grade period?) :(


Ray also is often tempted to connect to the ray cluster within the docker instance. For some reason, the inner ray instance is also visible on the host. Because of this, ray stop tries and fails to kill the actors which are related to that other Ray instance within the docker. This is extremely confusing

Versions / Dependencies

Ray 2.48.0

Reproduction script

N/A

Issue Severity

None

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething that is supposed to be working; but isn'tcommunity-backlogcoreIssues that should be addressed in Ray CorequestionJust a question :)stabilitytriageNeeds triage (eg: priority, bug/not-bug, and owning component)usability

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions