/srv/irclogs.ubuntu.com/2017/04/08/#juju.txt

Budgie^Smoreso now I am officially a full-tme job seeker, hopefully I will find a place that either uses juju or is open to it!00:05
dgonzoSaMnCo & tvansteenburgh thank you13:16
=== lathiat_ is now known as lathiat
SaMnCodgonzo: np, happy to help. What are you working on lately? Would a bare metal cluster of 3x nVidia Pascal P5000 help?14:10
dgonzoSaMnCo: I'm working on automating the setup of our cloud cluster. Next step is to figure out what level of autoscaling we can employ.14:23
dgonzoNVIDIA has reached out to us about a couple bare-metal projects so that would be helpful but right now the focus is our aws cloud.14:24
dgonzojust started the deploy again and it appears to be working. I read through the `enable-gpu.sh`14:25
dgonzoI see that's to run once the cluster is up. Have you looked at any of the stuff on autoscaling? I see clarifai (e.g. folks who worked on the "--experimental-nvidia-gpus" suppor) hint at it and I ran into the autoscaling bundle by elastisys https://jujucharms.com/u/elastisys/autoscaled-kubernetes/bundle/014:30
dgonzoanything else you can point to on autoscaling kubernetes would be appreciated ... but it looks pretty sparse when this all hits GPUs14:31
SaMnCoExactly dgonzo I was about to point you to that. SimonKLB can you help and send a link to your video about Elastisys & CDK?15:14
SaMnCoWith CDK you could set constraints on services and then be done.15:14
SaMnCoHowamy nodes are you looking at?15:15
dgonzoSaMnCo: This is still a dev cluster. I would like to start with one CPU node and one GPU node and have the cluster scale based on resource needs.16:58
dgonzoSaMnCo: I'm getting an unable to connect from kubectl (both installed and the pre-configured client I scp'd from k8s master)16:58
dgonzo"Unable to connect to the server: dial tcp 35.164.49.125:6443: i/o timeout"16:59
SaMnCojuju expose kubernetes-master16:59
dgonzoahh, great16:59
SaMnCodid you get the scripts I shared yesterday? These fasten the process quite a bit (but the next official release will have auto GPU discovery17:00
SaMnCo)17:00
dgonzoI was just about to mess with the SG settings in aws proper17:00
SaMnCoso you won't need to worry about the GPU anymore17:00
dgonzoI did. Awesome. I'll keep following your progress. This is such a sore spot in AI workflows and our current processes are very hacky17:00
Zichi here, I just saw that I can use the latest version of Docker in CDK when configuring the kubernetes-worker charm... can I and "how" do this afterward on a production cluster?17:02
Zicwe have a serious performance bug in the latest Docker package from Ubuntu archive that we don't have on latest Docker version17:02
dgonzoSaMnCo: ran kubectl create -f ./nvidia-smi.yml and the container is erroring out:17:12
dgonzo2017-04-08T17:06:55.508474879Z container_linux.go:247: starting container process caused "exec: \"nvidia-smi\": executable file not found in $PATH"17:12
SaMnCointeresting17:36
SaMnColet me check17:37
SaMnCodgonzo can you send the nvidia-smi file you are using (is it the one from my repo?)17:39
dgonzoit is the one from your repo17:40
SaMnCook17:40
SaMnCowondering if nvidia changed their images17:41
SaMnCoI should have locked that to a specific version and build my own17:41
dgonzocurrently kubernetes dashboard is reporting issues : Service Unavailable (503)17:42
SaMnCowhen you go on the long link from kubectl cluster-info17:42
SaMnCo?17:42
SaMnCoin the version I used for the blog, which is an old stuff, we did not include that endpoint, and the only way to look at the kubeUI was to do a kubectl proxy17:43
SaMnComore recent versions allow it with an admin:admin login17:43
dgonzook. Yeah I'm using the kubectl proxy17:44
SaMnCoso if you use the bundle from the blog, it is abit outdated17:44
dgonzoit was fine but broken jobs were piling up on the nvidia-smi17:44
dgonzook17:44
SaMnCoright, delete the job17:44
SaMnCowill stop the bleeding17:45
SaMnCoI'm downloading teh nvidia/cuda image to have a look, but my connection is really bad here17:45
SaMnCoit is taking ages17:45
dgonzoyeah, it's a big file17:46
SaMnCoif you can download then run locally on your laptop, see if the binary has been moved17:46
SaMnCothen change the command from the nvidia-smi.yaml17:46
SaMnCothen I will probably have to update the post17:46
SaMnColaster 300MB...17:48
dgonzodelete seems to hang. Is this right "kubectl delete -f bundles/nvidia-smi.yml"17:50
dgonzoalso I'm familiar with running CUDA on my labptop (GTK equipped) but I'm not sure what you're asking me to check. Are you wanting me to try and run the cuda charm locally?17:51
SaMnCono17:54
SaMnCodocker pull nvidia/cuda17:54
SaMnCodocker run --rm -it nvidia/cuda bash17:54
SaMnCothen try running nvidia-smi from the command line17:55
SaMnCoit should fail, as it failed on k8s, saying the binary is not in $path17:55
SaMnCodgonzo: ^17:55
SaMnCothen do17:55
SaMnCofind / -name "nvidia-smi"17:55
SaMnCoand see if it returns something17:56
SaMnCoif yes, update the nvifia-smi.yaml file at the line command:17:56
SaMnCoto include the full path of the binary, then reinstall it in the cluster17:56
SaMnCovia kubectl17:56
dgonzoI'm able to run the nvidia/cuda bash locally17:58
dgonzoeven though it didn't fail I'm getting this for the PATH17:59
dgonzo/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin17:59
dgonzoe.g. # echo $PATH17:59
SaMnCook18:02
SaMnCogive me a sec18:02
dgonzook I don't think I'm understanding...18:02
dgonzook18:02
SaMnCoI am 9MB away from having the image18:03
dgonzonvidia-docker is not in the cuda image18:03
dgonzo:)18:03
dgonzosorry nvidia-smi is not a command in the nvidia/cuda image as far as I can tell18:03
SaMnCoyes, they seem to have removed it18:04
SaMnCowhich is why the test fails18:04
dgonzook, nice. At least my problem is sane18:05
SaMnCoyep18:05
dgonzolocally i'm using nvidia-docker and nvidia-docker-compose18:05
dgonzoso k8s and juju are all new to me... I'm feeling dumb :)18:06
dgonzoI've had quite a bit of success rolling out GPU capable docker containers18:06
SaMnCono problem, we are here to help18:07
SaMnCohave a look into https://hub.docker.com/r/opless/t-nvidia-smi/18:07
SaMnCoI am sorry my connection is really bad, hard for me to DL 1GB in less than 30min18:07
SaMnCobut if you can18:07
SaMnCorun the same test18:07
SaMnCodocker pull ...18:07
SaMnCodocker run --rm opless/t-nvidia-smi nvidia-smi18:08
dgonzook, i'm lucky to have GB internet18:08
SaMnCoif it returns something that is not a path problem, then you may replace the image in the nvidia-smi.yaml file by this one18:08
dgonzopulling now18:08
SaMnCocool18:08
dgonzonope, getting a path problem with that image too18:10
SaMnCogrrrmmmmh18:10
SaMnCook, give me a set18:11
dgonzonp18:11
dgonzonot sure if it will be helpful but here's what I've been doing to run nvidia containers with compose directly on aws18:16
dgonzohttps://github.com/NVIDIA/nvidia-docker/wiki/Installation18:17
SaMnCooh but you are running nVidia CUDA containers18:17
SaMnCoit is just that I used this as a test18:17
dgonzo*nods*18:17
SaMnCoand the test fails because of the Docker image, not of anything else18:17
SaMnCothere is no doubt you can run these, but I like easy proofs18:18
SaMnCoand this just fails now and it annoys me18:18
SaMnCoso here is my idea18:18
dgonzohah!18:18
SaMnCocan you do the following for me18:18
SaMnCojuju ssh kubernetes-worker-gpu "sudo find / -name nvidia-smi"18:19
SaMnCojuju ssh kubernetes-worker-gpu/0 "sudo find / -name nvidia-smi"18:19
SaMnCosorry first line I forgot the 018:19
SaMnCothis will return a path18:19
SaMnCothen18:19
SaMnCojuju scp kubernetes-worker-gpu/0:<path you got> ./18:20
dgonzook18:20
SaMnCofrom that, you should be able to build a docker image FROM nvidia/cuda18:20
dgonzohmm not getting any results from the find command18:22
SaMnCohttps://www.irccloud.com/pastebin/JZNCQVte/18:22
SaMnCowtf maybe they just killed it18:22
dgonzo:(18:23
SaMnCowhat happens if you do18:25
SaMnCojuju ssh kubernetes-worker/0 "nvidia-smi"18:25
dgonzoso I just ran NV_GPU=0 nvidia-docker run -it --rm nvidia/cuda18:25
dgonzoand then which nvidia-smi18:25
dgonzoand get18:25
dgonzo/usr/local/nvidia/bin/nvidia-smi18:25
dgonzo... trying18:25
dgonzobash: nvidia-smi: command not found18:26
SaMnCooh I am so stupid I think the nvidia-smi is shared from the host18:26
SaMnCoit is not in the container itself18:26
SaMnCodo you have the CUDA installed and all that?18:27
dgonzoyou're not stupid that's just wacky18:27
dgonzoI do cudnn and all that18:27
SaMnCothe cluster you deployed, so you have CUDA installed on the workers?18:28
SaMnComeaning you have the CUDA charm built locally and so on?18:28
dgonzoI thought I did how can I triple check18:28
dgonzoyes18:28
SaMnCojuju status18:28
SaMnCoshould return a line with CUDA18:28
dgonzoyes18:29
SaMnCoand like all green18:29
SaMnCook18:29
dgonzowell, I have a Status = unknown18:29
SaMnCocan you run18:29
SaMnCojuju ssh kubernetes-worker-gpu/0 "ls -l /dev/nvidia*"18:30
SaMnCodo we have the devices created there?18:30
dgonzols: cannot access '/dev/nvidia': No such file or directory18:30
SaMnCojuju ssh kubernetes-worker-gpu/0 "sudo ls -l /dev/nvidia*"18:31
SaMnCosorry18:31
dgonzoyeah, still no18:33
dgonzo(cluster) gonzo@gonzo-msi:~/Projects/ziff/cluster$ juju ssh kubernetes-worker-gpu/0 "sudo ls -l /dev/nvidia"18:33
dgonzols: cannot access '/dev/nvidia': No such file or directory18:33
dgonzoso looks like my deploy failed on cuda18:34
dgonzo??18:34
dgonzoI'm going to try redeploying cuda and see where it goes wrong18:38
SaMnCoI am publishing the charm in my namespaces18:39
SaMnCoso you can use that18:39
SaMnCook so, replace the charm line by cs:~samuel-cozannet/xenial/cuda-018:42
SaMnCothis is the same code base as what I have been using18:42
dgonzocool18:42
SaMnCoso at least we will know if it is again outdated, but I used this like 2 days ago and it worked fine18:42
dgonzoso looks like my deploy failed on cuda si:~/Projects/ziff/cluster$ juju deploy cs:~samuel-cozannet/xenial/cuda-018:45
dgonzoERROR cannot resolve charm URL "cs:~samuel-cozannet/xenial/cuda-0": cannot get "/~samuel-cozannet/xenial/cuda-0/meta/any?include=id&include=supported-series&include=published": unauthorized: access denied for user "davidbgonzalez"18:45
SaMnCoah maybe I need to grant you some rights18:53
dgonzo... I just tried rebuilding and deploying from the cloned repo and this time got and error "hook failed: install"18:55
dgonzoI'm going to remove that relation and app and try the charm you've shared and circle back on the error later18:55
SaMnCoah you can try again18:55
SaMnCook I got to go for tonight, but the charm is now in the store18:56
SaMnCoand public18:56
dgonzothanks! I'll keep hacking really appreciate the help18:56
dgonzobuenas noches18:56
SaMnConp, happy to help18:56
SaMnCokeep me posted18:56
CatalysHey folks, I just deployed a local environment of openstack (based on LXD) on a test machine. I was trying to have the nova compute node change from virt-type lxc to kvm, however it doesn't same to get processed. There is a warning that is shouldn't be changed after deployment, assuming this is because it might break running VM's, however we don't have any yet. Is there a way to force the change though?19:01

Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!