Budgie^Smore | so now I am officially a full-tme job seeker, hopefully I will find a place that either uses juju or is open to it! | 00:05 |
---|---|---|
dgonzo | SaMnCo & tvansteenburgh thank you | 13:16 |
=== lathiat_ is now known as lathiat | ||
SaMnCo | dgonzo: np, happy to help. What are you working on lately? Would a bare metal cluster of 3x nVidia Pascal P5000 help? | 14:10 |
dgonzo | SaMnCo: I'm working on automating the setup of our cloud cluster. Next step is to figure out what level of autoscaling we can employ. | 14:23 |
dgonzo | NVIDIA has reached out to us about a couple bare-metal projects so that would be helpful but right now the focus is our aws cloud. | 14:24 |
dgonzo | just started the deploy again and it appears to be working. I read through the `enable-gpu.sh` | 14:25 |
dgonzo | I see that's to run once the cluster is up. Have you looked at any of the stuff on autoscaling? I see clarifai (e.g. folks who worked on the "--experimental-nvidia-gpus" suppor) hint at it and I ran into the autoscaling bundle by elastisys https://jujucharms.com/u/elastisys/autoscaled-kubernetes/bundle/0 | 14:30 |
dgonzo | anything else you can point to on autoscaling kubernetes would be appreciated ... but it looks pretty sparse when this all hits GPUs | 14:31 |
SaMnCo | Exactly dgonzo I was about to point you to that. SimonKLB can you help and send a link to your video about Elastisys & CDK? | 15:14 |
SaMnCo | With CDK you could set constraints on services and then be done. | 15:14 |
SaMnCo | Howamy nodes are you looking at? | 15:15 |
dgonzo | SaMnCo: This is still a dev cluster. I would like to start with one CPU node and one GPU node and have the cluster scale based on resource needs. | 16:58 |
dgonzo | SaMnCo: I'm getting an unable to connect from kubectl (both installed and the pre-configured client I scp'd from k8s master) | 16:58 |
dgonzo | "Unable to connect to the server: dial tcp 35.164.49.125:6443: i/o timeout" | 16:59 |
SaMnCo | juju expose kubernetes-master | 16:59 |
dgonzo | ahh, great | 16:59 |
SaMnCo | did you get the scripts I shared yesterday? These fasten the process quite a bit (but the next official release will have auto GPU discovery | 17:00 |
SaMnCo | ) | 17:00 |
dgonzo | I was just about to mess with the SG settings in aws proper | 17:00 |
SaMnCo | so you won't need to worry about the GPU anymore | 17:00 |
dgonzo | I did. Awesome. I'll keep following your progress. This is such a sore spot in AI workflows and our current processes are very hacky | 17:00 |
Zic | hi here, I just saw that I can use the latest version of Docker in CDK when configuring the kubernetes-worker charm... can I and "how" do this afterward on a production cluster? | 17:02 |
Zic | we have a serious performance bug in the latest Docker package from Ubuntu archive that we don't have on latest Docker version | 17:02 |
dgonzo | SaMnCo: ran kubectl create -f ./nvidia-smi.yml and the container is erroring out: | 17:12 |
dgonzo | 2017-04-08T17:06:55.508474879Z container_linux.go:247: starting container process caused "exec: \"nvidia-smi\": executable file not found in $PATH" | 17:12 |
SaMnCo | interesting | 17:36 |
SaMnCo | let me check | 17:37 |
SaMnCo | dgonzo can you send the nvidia-smi file you are using (is it the one from my repo?) | 17:39 |
dgonzo | it is the one from your repo | 17:40 |
SaMnCo | ok | 17:40 |
SaMnCo | wondering if nvidia changed their images | 17:41 |
SaMnCo | I should have locked that to a specific version and build my own | 17:41 |
dgonzo | currently kubernetes dashboard is reporting issues : Service Unavailable (503) | 17:42 |
SaMnCo | when you go on the long link from kubectl cluster-info | 17:42 |
SaMnCo | ? | 17:42 |
SaMnCo | in the version I used for the blog, which is an old stuff, we did not include that endpoint, and the only way to look at the kubeUI was to do a kubectl proxy | 17:43 |
SaMnCo | more recent versions allow it with an admin:admin login | 17:43 |
dgonzo | ok. Yeah I'm using the kubectl proxy | 17:44 |
SaMnCo | so if you use the bundle from the blog, it is abit outdated | 17:44 |
dgonzo | it was fine but broken jobs were piling up on the nvidia-smi | 17:44 |
dgonzo | ok | 17:44 |
SaMnCo | right, delete the job | 17:44 |
SaMnCo | will stop the bleeding | 17:45 |
SaMnCo | I'm downloading teh nvidia/cuda image to have a look, but my connection is really bad here | 17:45 |
SaMnCo | it is taking ages | 17:45 |
dgonzo | yeah, it's a big file | 17:46 |
SaMnCo | if you can download then run locally on your laptop, see if the binary has been moved | 17:46 |
SaMnCo | then change the command from the nvidia-smi.yaml | 17:46 |
SaMnCo | then I will probably have to update the post | 17:46 |
SaMnCo | laster 300MB... | 17:48 |
dgonzo | delete seems to hang. Is this right "kubectl delete -f bundles/nvidia-smi.yml" | 17:50 |
dgonzo | also I'm familiar with running CUDA on my labptop (GTK equipped) but I'm not sure what you're asking me to check. Are you wanting me to try and run the cuda charm locally? | 17:51 |
SaMnCo | no | 17:54 |
SaMnCo | docker pull nvidia/cuda | 17:54 |
SaMnCo | docker run --rm -it nvidia/cuda bash | 17:54 |
SaMnCo | then try running nvidia-smi from the command line | 17:55 |
SaMnCo | it should fail, as it failed on k8s, saying the binary is not in $path | 17:55 |
SaMnCo | dgonzo: ^ | 17:55 |
SaMnCo | then do | 17:55 |
SaMnCo | find / -name "nvidia-smi" | 17:55 |
SaMnCo | and see if it returns something | 17:56 |
SaMnCo | if yes, update the nvifia-smi.yaml file at the line command: | 17:56 |
SaMnCo | to include the full path of the binary, then reinstall it in the cluster | 17:56 |
SaMnCo | via kubectl | 17:56 |
dgonzo | I'm able to run the nvidia/cuda bash locally | 17:58 |
dgonzo | even though it didn't fail I'm getting this for the PATH | 17:59 |
dgonzo | /usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin | 17:59 |
dgonzo | e.g. # echo $PATH | 17:59 |
SaMnCo | ok | 18:02 |
SaMnCo | give me a sec | 18:02 |
dgonzo | ok I don't think I'm understanding... | 18:02 |
dgonzo | ok | 18:02 |
SaMnCo | I am 9MB away from having the image | 18:03 |
dgonzo | nvidia-docker is not in the cuda image | 18:03 |
dgonzo | :) | 18:03 |
dgonzo | sorry nvidia-smi is not a command in the nvidia/cuda image as far as I can tell | 18:03 |
SaMnCo | yes, they seem to have removed it | 18:04 |
SaMnCo | which is why the test fails | 18:04 |
dgonzo | ok, nice. At least my problem is sane | 18:05 |
SaMnCo | yep | 18:05 |
dgonzo | locally i'm using nvidia-docker and nvidia-docker-compose | 18:05 |
dgonzo | so k8s and juju are all new to me... I'm feeling dumb :) | 18:06 |
dgonzo | I've had quite a bit of success rolling out GPU capable docker containers | 18:06 |
SaMnCo | no problem, we are here to help | 18:07 |
SaMnCo | have a look into https://hub.docker.com/r/opless/t-nvidia-smi/ | 18:07 |
SaMnCo | I am sorry my connection is really bad, hard for me to DL 1GB in less than 30min | 18:07 |
SaMnCo | but if you can | 18:07 |
SaMnCo | run the same test | 18:07 |
SaMnCo | docker pull ... | 18:07 |
SaMnCo | docker run --rm opless/t-nvidia-smi nvidia-smi | 18:08 |
dgonzo | ok, i'm lucky to have GB internet | 18:08 |
SaMnCo | if it returns something that is not a path problem, then you may replace the image in the nvidia-smi.yaml file by this one | 18:08 |
dgonzo | pulling now | 18:08 |
SaMnCo | cool | 18:08 |
dgonzo | nope, getting a path problem with that image too | 18:10 |
SaMnCo | grrrmmmmh | 18:10 |
SaMnCo | ok, give me a set | 18:11 |
dgonzo | np | 18:11 |
dgonzo | not sure if it will be helpful but here's what I've been doing to run nvidia containers with compose directly on aws | 18:16 |
dgonzo | https://github.com/NVIDIA/nvidia-docker/wiki/Installation | 18:17 |
SaMnCo | oh but you are running nVidia CUDA containers | 18:17 |
SaMnCo | it is just that I used this as a test | 18:17 |
dgonzo | *nods* | 18:17 |
SaMnCo | and the test fails because of the Docker image, not of anything else | 18:17 |
SaMnCo | there is no doubt you can run these, but I like easy proofs | 18:18 |
SaMnCo | and this just fails now and it annoys me | 18:18 |
SaMnCo | so here is my idea | 18:18 |
dgonzo | hah! | 18:18 |
SaMnCo | can you do the following for me | 18:18 |
SaMnCo | juju ssh kubernetes-worker-gpu "sudo find / -name nvidia-smi" | 18:19 |
SaMnCo | juju ssh kubernetes-worker-gpu/0 "sudo find / -name nvidia-smi" | 18:19 |
SaMnCo | sorry first line I forgot the 0 | 18:19 |
SaMnCo | this will return a path | 18:19 |
SaMnCo | then | 18:19 |
SaMnCo | juju scp kubernetes-worker-gpu/0:<path you got> ./ | 18:20 |
dgonzo | ok | 18:20 |
SaMnCo | from that, you should be able to build a docker image FROM nvidia/cuda | 18:20 |
dgonzo | hmm not getting any results from the find command | 18:22 |
SaMnCo | https://www.irccloud.com/pastebin/JZNCQVte/ | 18:22 |
SaMnCo | wtf maybe they just killed it | 18:22 |
dgonzo | :( | 18:23 |
SaMnCo | what happens if you do | 18:25 |
SaMnCo | juju ssh kubernetes-worker/0 "nvidia-smi" | 18:25 |
dgonzo | so I just ran NV_GPU=0 nvidia-docker run -it --rm nvidia/cuda | 18:25 |
dgonzo | and then which nvidia-smi | 18:25 |
dgonzo | and get | 18:25 |
dgonzo | /usr/local/nvidia/bin/nvidia-smi | 18:25 |
dgonzo | ... trying | 18:25 |
dgonzo | bash: nvidia-smi: command not found | 18:26 |
SaMnCo | oh I am so stupid I think the nvidia-smi is shared from the host | 18:26 |
SaMnCo | it is not in the container itself | 18:26 |
SaMnCo | do you have the CUDA installed and all that? | 18:27 |
dgonzo | you're not stupid that's just wacky | 18:27 |
dgonzo | I do cudnn and all that | 18:27 |
SaMnCo | the cluster you deployed, so you have CUDA installed on the workers? | 18:28 |
SaMnCo | meaning you have the CUDA charm built locally and so on? | 18:28 |
dgonzo | I thought I did how can I triple check | 18:28 |
dgonzo | yes | 18:28 |
SaMnCo | juju status | 18:28 |
SaMnCo | should return a line with CUDA | 18:28 |
dgonzo | yes | 18:29 |
SaMnCo | and like all green | 18:29 |
SaMnCo | ok | 18:29 |
dgonzo | well, I have a Status = unknown | 18:29 |
SaMnCo | can you run | 18:29 |
SaMnCo | juju ssh kubernetes-worker-gpu/0 "ls -l /dev/nvidia*" | 18:30 |
SaMnCo | do we have the devices created there? | 18:30 |
dgonzo | ls: cannot access '/dev/nvidia': No such file or directory | 18:30 |
SaMnCo | juju ssh kubernetes-worker-gpu/0 "sudo ls -l /dev/nvidia*" | 18:31 |
SaMnCo | sorry | 18:31 |
dgonzo | yeah, still no | 18:33 |
dgonzo | (cluster) gonzo@gonzo-msi:~/Projects/ziff/cluster$ juju ssh kubernetes-worker-gpu/0 "sudo ls -l /dev/nvidia" | 18:33 |
dgonzo | ls: cannot access '/dev/nvidia': No such file or directory | 18:33 |
dgonzo | so looks like my deploy failed on cuda | 18:34 |
dgonzo | ?? | 18:34 |
dgonzo | I'm going to try redeploying cuda and see where it goes wrong | 18:38 |
SaMnCo | I am publishing the charm in my namespaces | 18:39 |
SaMnCo | so you can use that | 18:39 |
SaMnCo | ok so, replace the charm line by cs:~samuel-cozannet/xenial/cuda-0 | 18:42 |
SaMnCo | this is the same code base as what I have been using | 18:42 |
dgonzo | cool | 18:42 |
SaMnCo | so at least we will know if it is again outdated, but I used this like 2 days ago and it worked fine | 18:42 |
dgonzo | so looks like my deploy failed on cuda si:~/Projects/ziff/cluster$ juju deploy cs:~samuel-cozannet/xenial/cuda-0 | 18:45 |
dgonzo | ERROR cannot resolve charm URL "cs:~samuel-cozannet/xenial/cuda-0": cannot get "/~samuel-cozannet/xenial/cuda-0/meta/any?include=id&include=supported-series&include=published": unauthorized: access denied for user "davidbgonzalez" | 18:45 |
SaMnCo | ah maybe I need to grant you some rights | 18:53 |
dgonzo | ... I just tried rebuilding and deploying from the cloned repo and this time got and error "hook failed: install" | 18:55 |
dgonzo | I'm going to remove that relation and app and try the charm you've shared and circle back on the error later | 18:55 |
SaMnCo | ah you can try again | 18:55 |
SaMnCo | ok I got to go for tonight, but the charm is now in the store | 18:56 |
SaMnCo | and public | 18:56 |
dgonzo | thanks! I'll keep hacking really appreciate the help | 18:56 |
dgonzo | buenas noches | 18:56 |
SaMnCo | np, happy to help | 18:56 |
SaMnCo | keep me posted | 18:56 |
Catalys | Hey folks, I just deployed a local environment of openstack (based on LXD) on a test machine. I was trying to have the nova compute node change from virt-type lxc to kvm, however it doesn't same to get processed. There is a warning that is shouldn't be changed after deployment, assuming this is because it might break running VM's, however we don't have any yet. Is there a way to force the change though? | 19:01 |
Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!