Replies: 2 comments
-
Hey @sommerso, welcome to the SkyPilot community! Setting the proxy env vars in the
|
Beta Was this translation helpful? Give feedback.
-
Thank you!!! I can now see the models being downloaded on the private LAN nodes. Getting there, step by step. It starts the model in the kubernetes pod. My yaml file is set to 10 CPU and 10 replicas and 10GB memory I noticed that only 2 of the 10 replicas are started. Even though the other replicas did download the model. Looking at the replica logs it is stuck on the following http_proxy:http://management-server:3128 https_proxy:http://management-server:3128 no_proxy:localhost,127.0.0.1,0.0.0.0,172.30.10.0/24,192.168.0.0/16,10.96.0.0/12 (codellama, pid=1789) time=2025-07-11T16:56:27.614Z level=INFO source=images.go:476 msg="total blobs: 6" (codellama, pid=1789) time=2025-07-11T16:56:27.614Z level=INFO source=images.go:483 msg="total unused blobs removed: 0" (codellama, pid=1789) time=2025-07-11T16:56:27.614Z level=INFO source=routes.go:1288 msg="Listening on 0.0.0.0:9100 (version 0.9.6)" (codellama, pid=1789) time=2025-07-11T16:56:27.614Z level=INFO source=gpu.go:217 msg="looking for compatible GPUs" (codellama, pid=1789) time=2025-07-11T16:56:27.626Z level=INFO source=gpu.go:377 msg="no compatible GPUs were discovered" (codellama, pid=1789) time=2025-07-11T16:56:27.627Z level=INFO source=types.go:130 msg="inference compute" id=0 library=cpu variant="" compute="" driver=0.0 name="" total="2015.7 GiB" available="1999.8 GiB" The pod says running. The sky serve status shows only 2 replicas started, on the management server where as the rest of the nodes are just just showing starting. After 15 minutes it times out with an unexpected termination. Most the my digging and research still points back prlimit: failed to set the NOFILE resource limit: Operation not permitted and that the sky controller node is set to run in a privileged while the worker nodes are not set to run like that. There is some suggestion to manually "patching" the deployment to use this privileged mode. Which alludes to starting the sky serve yaml and waiting for the prompt, injecting the privileged mode and then confirming the prompt. This is probably more a kubernetes question, but any prod in the right direction will be appreciated. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I'm running a POC for skypilot on compute blades that were scheduled to be disposed of.
Current setup is 5 compute blades in a blade chassis with about 200 CPU's (only CPU and not GPU) at my disposal. It isn't ideal, but it is a start. One of the prerequisites from the CTO was that the compute nodes would be completely segregated from the production network. We created a private network internal to the chassis and the compute nodes can all connect to each other. One of the compute nodes the management / head / controller server and that was allowed on the production network and has a dual nic setup.
I am running ubuntu 24.04 LTS
So the main server has production LAN and therefore internet and DNS access.
Compute nodes have access to the management server and the internet via a proxy server configured on the management server and making use of of http_proxy and https_proxy environmental variables, they are also syncing time from the management and using it as a DNS server.
I have set up a Kubernetes cluster making use of the full kubelet install rather than minikube as per die quick start guides. Skypilot 0.9.3 is installed and running.
Sky check shows that Kubernetes is available.
Kubectl get pods shows my compute and controller nodes.
I currently have an llama model running - 1 replica using 24 CPU's. This has started up on the controller node.
The current issue I am facing is that any additional models starting up on the private LAN compute nodes fail to provision as it doesn't seem to retrieve the env variables from the OS. Is there a way that this can be configured in the skypilot .sky/config.yaml file?
I do have proxy exports in the model.yaml file which I'm starting up with sky serve up model.yaml -n model but it fails even before that yaml file is parsed.
The logs seem to point to NOFILE and DNS resolution both which I can confirm working on the OS.
I 07-11 14:39:32 replica_managers.py:126] /usr/bin/prlimit
I 07-11 14:39:32 replica_managers.py:126] sudo: setrlimit(rlimit_nofile): Operation not permitted
I 07-11 14:39:32 replica_managers.py:126] sudo: setrlimit(rlimit_nofile): Operation not permitted
I 07-11 14:39:32 replica_managers.py:126] prlimit: failed to set the NOFILE resource limit: Operation not permitted
I 07-11 14:39:32 replica_managers.py:126] sudo: setrlimit(rlimit_nofile): Operation not permitted
I 07-11 14:39:32 replica_managers.py:126] sudo: setrlimit(rlimit_nofile): Operation not permitted
I 07-11 14:39:32 replica_managers.py:126] prlimit: failed to set the NOFILE resource limit: Operation not permitted
I 07-11 14:39:32 replica_managers.py:126] sudo: setrlimit(rlimit_nofile): Operation not permitted
I 07-11 14:39:32 replica_managers.py:126] sudo: setrlimit(rlimit_nofile): Operation not permitted
I 07-11 14:39:32 replica_managers.py:126] prlimit: failed to set the NOFILE resource limit: Operation not permitted
I 07-11 14:39:32 replica_managers.py:126] sudo: setrlimit(rlimit_nofile): Operation not permitted
I 07-11 14:39:32 replica_managers.py:126] sudo: setrlimit(rlimit_nofile): Operation not permitted
I 07-11 14:39:32 replica_managers.py:126] prlimit: failed to set the NOFILE resource limit: Operation not permitted
I 07-11 14:39:32 replica_managers.py:126] sudo: setrlimit(rlimit_nofile): Operation not permitted
I 07-11 14:39:32 replica_managers.py:126] sudo: setrlimit(rlimit_nofile): Operation not permitted
I 07-11 14:39:32 replica_managers.py:126] prlimit: failed to set the NOFILE resource limit: Operation not permitted
I 07-11 14:39:32 replica_managers.py:126] Waiting ray cluster to be initialized
I 07-11 14:39:32 replica_managers.py:126] === Ray and skypilot dependencies installation completed in 0 secs ===
I 07-11 14:39:32 replica_managers.py:126] Using Python 3.10.10 environment at: skypilot-runtime
I 07-11 14:39:32 replica_managers.py:126] Using Python 3.10.10 environment at: skypilot-runtime
I 07-11 14:39:32 replica_managers.py:126] warning: Skipping skypilot as it is not installed
I 07-11 14:39:32 replica_managers.py:126] warning: No packages to uninstall
I 07-11 14:39:32 replica_managers.py:126] Using Python 3.10.10 environment at: skypilot-runtime
I 07-11 14:39:32 replica_managers.py:126] error: Failed to fetch:
https://pypi.org/simple/wheel/
I 07-11 14:39:32 replica_managers.py:126] Caused by: Could not connect, are you offline?
I 07-11 14:39:32 replica_managers.py:126] Caused by: Request failed after 3 retries
I 07-11 14:39:32 replica_managers.py:126] Caused by: error sending request for url (https://pypi.org/simple/wheel/ )
I 07-11 14:39:32 replica_managers.py:126] Caused by: client error (Connect)
I 07-11 14:39:32 replica_managers.py:126] Caused by: dns error: failed to lookup address information: Temporary fa ilure in name resolution
I 07-11 14:39:32 replica_managers.py:126] Caused by: failed to lookup address information: Temporary failure in na me resolution
I 07-11 14:39:32 replica_managers.py:126]
I 07-11 14:39:32 replica_managers.py:126] ===== stderr =====command terminated with exit code 1
Anything that I'm missing?
I know this isn't ideal as skypilot was designed to be cloud-online-all-the-time. Was looking forward to having an on prem LLM.
Beta Was this translation helpful? Give feedback.
All reactions