=== axw_ is now known as axw | ||
=== frankban|afk is now known as frankban | ||
parlos | Good Morning | 08:49 |
---|---|---|
EstEd | yo :) | 08:51 |
parlos | I'm trying to figure out how MAAS and Landscape are 'connected'. Landscape/Autopilot complains that there isnt a node that matches its need, and I've corrected this by adding such a node to MAAS, but still Landscape complains/does not detect it... any hints? | 08:54 |
EstEd | sorry, I've never used landscape. I'm just at the same point of trying to get started with juju. Only at the stage of reading the manual :p | 09:03 |
parlos | :) I went with conjure-up.. did not complain as much.. but now maas deployment failed.. sigh.. Good luck! | 09:12 |
EstEd | ok, I may also look at that. I am trying hard to take on all the information ;) | 09:12 |
EstEd | thanks! | 09:12 |
mup | Bug #1714273 opened: [2.3.0] Power Error when checking power status <hwcert-server> <MAAS:Incomplete> <https://launchpad.net/bugs/1714273> | 14:57 |
xygnal | hi guys. I am having timeouts trying to access the API. | 15:29 |
xygnal | memory is good, CPU gets a bit hot but it does nto fall behind on run queue. | 15:29 |
xygnal | however.. I am seeing disk times of 500ms a lot on the region controller and on the rack controllers i see peaks of 1.5 SECONDS sometimes | 15:30 |
xygnal | is 500ms repsonse time going to cause such stalls? | 15:30 |
xygnal | we are running the pgsql database ANY region controller on the same box | 15:30 |
roaksoax | xygnal: do the logs seem to show something is off? Or are there any rogue processes ? | 15:35 |
roaksoax | xygnal: how many machines / rack controllers, etc | 15:35 |
xygnal | roaksoax didnt see anything in logs so far, have a tcpdump setup to make sure it actually gets to MAAS | 15:56 |
xygnal | 1 region controller | 15:56 |
xygnal | 6 rack controllers | 15:56 |
xygnal | I think we are under 500 servers right now but it might be a couple hundred higher | 15:56 |
xygnal | all of MAAS region and racks are on ESXi VMs | 15:57 |
xygnal | is twistd multi threaded? it kinda looked as if they were bound to a single cpu | 15:58 |
roaksoax | xygnal: it is | 16:01 |
roaksoax | xygnal: what version you running ? | 16:01 |
xygnal | 2.2 | 16:02 |
roaksoax | xygnal: 2.2.2 ? | 16:02 |
roaksoax | xygnal: there's been some fixes in 2.2.x (soon to be 2.2.3) that may help with multiple region/racks | 16:02 |
xygnal_ | not sure | 16:03 |
xygnal_ | why? | 16:03 |
roaksoax | there's some fixes to dns and rack/region communication/registration | 16:04 |
=== frankban is now known as frankban|afk | ||
xygnal | roaksoax: yes we are 2.2.2 confirmed. We are worried about a bug you only have a test fix for in 2.3, about rack ccontroller restarts freaking out MAAS> | 17:23 |
xygnal | also | 17:23 |
xygnal | i discovered we only have 2 cpus on this region c ontroller. I expect we need more cores? | 17:23 |
xygnal | 4? 6? | 17:23 |
roaksoax | xygnal: that bug is backported to 2.2 , but newer 2.2.2 is not released yet | 17:24 |
roaksoax | xygnal: we typically recommend at least 4+ CPU's | 17:24 |
xygnal | roaksoax according to the bug report it says 2.3 not 2.2... | 17:24 |
roaksoax | xygnal: since the region runs 4 processes | 17:24 |
roaksoax | xygnal: have a link ? | 17:25 |
xygnal | 1707071 | 17:26 |
xygnal | sorrry talking on my phone so i dont have to go through my work VPN to IRC | 17:26 |
xygnal | thats the bug # | 17:26 |
roaksoax | https://bugs.launchpad.net/nova/+bug/1707071 -> seems openstack bug | 17:26 |
xygnal | oh | 17:27 |
xygnal | i hit 0 instead of 9 | 17:28 |
xygnal | *embarrsed* | 17:28 |
xygnal | 1707*9*71 | 17:28 |
roaksoax | xygnal: it is fix committed and targetted for 22.3 | 17:30 |
roaksoax | 2.2.3 | 17:30 |
roaksoax | which has not been released yet | 17:30 |
xygnal | ah | 17:32 |
xygnal | when is 2.2.3 looking to come out? | 17:32 |
xygnal | also, we are bumping up to 6 cores now | 17:32 |
roaksoax | xygnal: i was hoping to put a candidate this week, but I think I'll have to punt that to next week | 17:32 |
xygnal | is it not compataible such that i could backport it into 2.2 temporarily? | 17:32 |
xygnal | 2.2.2 i mean | 17:34 |
roaksoax | xygnal: it should be: https://git.launchpad.net/maas/commit/?id=e34ededffc9cb96124ee2232793e0c064fdd735a | 17:45 |
xygnal | roaksoax thanks that should be easy enough. just one region? not sure where provisionserver is used. | 18:03 |
xygnal | *provisioningserver | 18:03 |
xygnal | roaksoax not having the best time here. multiple users using our region server is causing the API and even the CLI to simply *hang* | 20:15 |
xygnal | does not appear to be an OS bottleneck and I am not wearing warnings to correlate. I do not have debugging turned on, as far as I know. | 20:15 |
xygnal | but I do have several twistd3 processes that use 100% cpu a peice | 20:15 |
xygnal | even after restarteding regiond, even after rebooting | 20:15 |
xygnal | i have idle CPU, i have free memory, but I cannot interface with MAAS! | 20:16 |
roaksoax | xygnal: ok, so you said you have 1 controller with 6 rack controllers ? | 20:18 |
roaksoax | xygnal: and CPU is ided | 20:19 |
roaksoax | xygnal: but you cannot reach the UI/API ? | 20:19 |
xygnal | 6 cpus now after our talk | 20:19 |
xygnal | and yes | 20:19 |
xygnal | If i strace the running pid | 20:19 |
xygnal | i see a lot of resource temporarily unavailable and connection timeouts | 20:19 |
xygnal | same for most of those high CPU twisted3 pids | 20:19 |
xygnal | I have NOT rebooted the rack controllers since earlier today | 20:19 |
xygnal | correct. I mean I can reach it in that i can connect TCP to the box, or i can login via SSH and use maas CLI to connect | 20:20 |
xygnal | but both hang the same | 20:20 |
xygnal | no response | 20:20 |
xygnal | MAAS is too busy to answer | 20:20 |
roaksoax | xygnal: so, do this, if possible, stop all your rack controllers for a second | 20:20 |
roaksoax | xygnal: and see if your region controller can be interfacted with | 20:20 |
roaksoax | xygnal: then start adding rack by rack controller , but wait for them to fully connect | 20:21 |
roaksoax | before adding a new rakc controller | 20:21 |
xygnal | think it's possible they are returning load to the region controller? | 20:21 |
roaksoax | xygnal: i'm thinking that all trying to connect to the region at the same time may be causing issues | 20:22 |
xygnal | ok | 20:22 |
roaksoax | xygnal: i'm running 2 region controllers and 4 rack controllers without issues, but definitely dont have 500 machines | 20:22 |
xygnal | thanks | 20:23 |
xygnal | will test this and if we see a difference we may break off some rack controllers to a new region | 20:24 |
roaksoax | xygnal: so i've uploaded 2.2.3 candidate to ppa:maas/proposed | 20:42 |
roaksoax | xygnal: that has not gone through final qa though | 20:43 |
xygnal | roaksoax we're seeing the problem return after the first rack controller is returned | 20:57 |
xygnal | roaksoax this may be off the wall but i have been thinking on this all day. a lot of our current builders are clients that are using ESXi. all we do for them is write their image down to disk at the end of a generic install and they boot up fine and take over. | 20:58 |
xygnal | could that DD (which would take some time) cause any unwanted waiting from MAAS? | 20:59 |
xygnal | just making sure. | 20:59 |
xygnal | actually it's an SSD so i suppose, not that long | 20:59 |
xygnal | if i check the MAAS twistd3's in lsof i can see the various client machines associated with that connection. I am certain we have multiple users trying to do things. | 21:01 |
xygnal | that is the why. the question is, why is MAAS so busy, when the system... is not? | 21:02 |
xygnal | i still dont see any convincing log messages | 21:02 |
roaksoax | xygnal: i know what the issue is | 21:14 |
roaksoax | err | 21:14 |
roaksoax | nevermind | 21:14 |
roaksoax | wrongwindows | 21:14 |
xygnal | roaksoax still seeing a ton of 'resource temporarily unavailable' and 'connection timed out' messages from the only active twistd3 right now | 21:22 |
xygnal | not from MAAS itself, but from strace -ffp | 21:22 |
xygnal | mostly futex wait | 21:23 |
xygnal | roaksoax proposed PPA applied, all systems booted, still can't run CLI commands | 21:44 |
xygnal | hangs | 21:44 |
mup | Bug #1714362 opened: [2.3] Power management balances between controllers <MAAS:New> <https://launchpad.net/bugs/1714362> | 21:57 |
mup | Bug #1714362 changed: [2.3] Power management balances between controllers <MAAS:New> <https://launchpad.net/bugs/1714362> | 22:00 |
mup | Bug #1714362 opened: [2.3] Power management balances between controllers <MAAS:New> <https://launchpad.net/bugs/1714362> | 22:06 |
roaksoax | xygnal: even with 1 rack controller connceted ? | 22:23 |
roaksoax | xygnal: if that is the case, only 1 rack controller | 22:23 |
xygnal | yes | 22:24 |
xygnal | even with only 1 | 22:24 |
roaksoax | xygnal: right, so that could mean some other problem, do you have firewalling in places or similar ? | 22:25 |
xygnal | not entirely open no its behind company proxy. I noticed some kind of 'snap' connections failing constnatly | 22:26 |
xygnal | as well as NTP connections failing sometimes | 22:26 |
xygnal | it has no problems grabbing its images that I can recall | 22:27 |
xygnal | and not aware of any connectivity changes recently | 22:29 |
roaksoax | xygnal: any logs ? | 22:29 |
roaksoax | 'snap' connections, that's weird | 22:30 |
roaksoax | xygnal: could be related to networking firewalling between region/racks ? | 22:30 |
roaksoax | xygnal: i mean, if you are simply running the region and no racks and you cant contact the api | 22:30 |
roaksoax | there could be another problem there | 22:30 |
xygnal | no no. one rack. | 22:30 |
xygnal | we put back first rack after shutting all down | 22:31 |
xygnal | did noy see any node to node comm errors outsud | 22:32 |
xygnal | outside of our own restarts | 22:32 |
xygnal | can get you logs. all the rpc errors between region and rack in thrn | 22:33 |
xygnal | correlate to our node restarts | 22:33 |
roaksoax | xygnal: what if you try to reach the cli wihtin the same machine of the region controller ? | 22:33 |
xygnal | we are... | 22:33 |
xygnal | already doing so.. | 22:34 |
roaksoax | xygnal: and it works ? | 22:37 |
xygnal | no | 22:37 |
xygnal | hangs | 22:37 |
xygnal | maas is hung up on something and i cannot see it in thr | 22:38 |
xygnal | logs | 22:38 |
xygnal | only recent change i am aware of is update to 2.2.2 | 22:39 |
xygnal | any regressions? | 22:39 |
roaksoax | xygnal: sso doing it from the same region controller works | 22:40 |
roaksoax | xygnal: but doesn't work remotely | 22:40 |
xygnal | no. its not working at all with any re | 22:41 |
xygnal | rack* controller connected | 22:42 |
xygnal | even local to the region controller | 22:42 |
roaksoax | xygnal: try this | 22:42 |
roaksoax | stop maas-rackd | 22:42 |
roaksoax | stop maas-regiond | 22:42 |
roaksoax | ps faux | grep twist | 22:42 |
roaksoax | xygnal: and see if there are any rogue processes | 22:42 |
roaksoax | xygnal: and then, sudo service postgresql stop | 22:42 |
roaksoax | xygnal: and then sudo service postgresql start | 22:43 |
roaksoax | sudo service maas-regiond start | 22:43 |
roaksoax | sudo service maas-rackd start | 22:43 |
xygnal | alright so. both services stopped. two twisted3's still going. | 22:50 |
xygnal | even after pgsql stop and start | 22:50 |
xygnal | oops nope | 22:50 |
xygnal | there they go | 22:50 |
xygnal | ok services back up | 22:52 |
xygnal | maas create command... hung | 22:52 |
xygnal | CPU utilization back up to 100% for twisted3 again | 22:52 |
roaksoax | xygnal: hold on, so when you stopped maas-regiond and maas-rackd there were still twistd3 services running ? | 22:53 |
roaksoax | xygnal: like this | 22:53 |
roaksoax | maas 44484 0.0 0.0 4508 712 ? Ss 16:42 0:00 /bin/sh -c exec twistd3 --nodaemon --pidfile= --logger=provisioningserver.logger.EventLogger maas-regiond 2>&1 | tee -a $LOGFILE | 22:53 |
roaksoax | maas 44492 3.6 10.1 1093456 203008 ? Sl 16:42 4:49 \_ /usr/bin/python3 /usr/bin/twistd3 --nodaemon --pidfile= --logger=provisioningserver.logger.EventLogger maas-regiond | 22:54 |
roaksoax | ?>? | 22:54 |
xygnal | yes | 22:54 |
xygnal | like that | 22:54 |
roaksoax | xygnal: so that seems there are rogue processes that are running | 22:54 |
roaksoax | xygnal: so sudo service maas-regiond stop && sudo service maas-rackd stop | 22:55 |
roaksoax | xygnal: ps faux | grep twistd3 | 22:55 |
roaksoax | xygnal: and kill those rogue processes | 22:55 |
xygnal | if i stop pgsql | 22:57 |
xygnal | they do die | 22:57 |
xygnal | after a minute or two | 22:57 |
xygnal | but they come right back when i bring services back | 22:57 |
roaksoax | xygnal: right, but I mean, make sure maas-regiond is stopped | 22:57 |
roaksoax | same as maas-rackd | 22:57 |
roaksoax | and check if there are twisted processes | 22:58 |
roaksoax | if there are, then that's the probelm | 22:58 |
roaksoax | as there shouldn't be | 22:58 |
xygnal | they show stopped in systemctl status | 22:58 |
xygnal | let me recap | 22:58 |
xygnal | stop rackd, stop regiond, i see two twisted3's still going. | 22:58 |
xygnal | stop pgsql | 22:58 |
xygnal | wait a minute | 22:58 |
xygnal | all twisted3's gone | 22:58 |
xygnal | no more 'rogue process' running | 22:58 |
xygnal | but if i start all 3 back up? | 22:59 |
xygnal | problem is back | 22:59 |
roaksoax | xygnal: ok, so stop rackd, stop regiond | 22:59 |
roaksoax | kill -9 rogue processes | 22:59 |
roaksoax | and wait and see if they come up | 22:59 |
roaksoax | if they do, it is worth investigating where those are coming from | 23:00 |
roaksoax | ubuntu@maas:~$ sudo service maas-regiond stop | 23:00 |
roaksoax | ubuntu@maas:~$ sudo service maas-rackd stop | 23:00 |
roaksoax | ubuntu@maas:~$ ps faux | grep twistd3 | 23:00 |
roaksoax | ubuntu 10292 0.0 0.0 12944 984 pts/0 S+ 23:00 0:00 \_ grep --color=auto twistd3 | 23:00 |
xygnal | alright | 23:01 |
xygnal | instead of stopping pgsql | 23:01 |
xygnal | i just stopped the two sercices | 23:01 |
xygnal | services* | 23:01 |
xygnal | and killed the one twisted3 that didnt die | 23:01 |
xygnal | manually with kill -9 | 23:01 |
xygnal | now there are none running | 23:01 |
roaksoax | xygnal: ok so ps faux | grep twistd3 shows no rogue processes running | 23:02 |
roaksoax | xygnal: just double check that | 23:02 |
xygnal | exactly | 23:02 |
xygnal | zero results | 23:02 |
roaksoax | xygnal: ok, so restart maas-regiond, wait a few seconds | 23:02 |
roaksoax | xygnal: and restart maas-rackd | 23:03 |
xygnal | alright | 23:04 |
xygnal | looking good so far | 23:04 |
xygnal | time to run maas create | 23:04 |
xygnal | aaand | 23:05 |
xygnal | cpu pegged 100% as soon as I did the command | 23:05 |
xygnal | still no response | 23:05 |
roaksoax | xygnal: can you share your logs /var/log/maas/*.log | 23:06 |
roaksoax | but that's really really strange | 23:06 |
mup | Bug #1714362 changed: [2.3] Power management balances between controllers on every power check <performance> <MAAS:Invalid> <https://launchpad.net/bugs/1714362> | 23:06 |
xygnal | just the last 30 minutes of them or so? | 23:06 |
roaksoax | xygnal: yeah that should be ok | 23:07 |
xygnal | hm you know. does everything have a timestamp? if i try to grab just timstamp i am going to potentially flush out logs | 23:08 |
xygnal | I mean, filter out, preventing you from seeing | 23:08 |
roaksoax | xygnal: yeah | 23:09 |
xygnal | well i can do it easy for maas.log and regiond.log but rackd has some stack traces, checking right now. | 23:14 |
xygnal | looks like it threw a permission denied on lost+ found... what the... | 23:14 |
xygnal | critical image downloading images failed | 23:15 |
xygnal | where would it be downloading them? | 23:15 |
xygnal | this was working before... looks like something has broken while I was away | 23:16 |
roaksoax | strange | 23:17 |
roaksoax | :/ | 23:17 |
xygnal | where do images usually get downloaded to? | 23:17 |
xygnal | I didnt think we had a dedicated MOUNT for it | 23:18 |
roaksoax | xygnal: images.maas.io | 23:18 |
roaksoax | xygnal: oh | 23:18 |
roaksoax | xygnal: to the database | 23:18 |
roaksoax | xygnal: in the region controller | 23:18 |
roaksoax | and then the rack controllers sync those images onto the filesystmes | 23:18 |
xygnal | er... but i'm getting errors about a local file, permission denied on lost+found at the end of the stack trace | 23:18 |
xygnal | for that critical error mentioned above | 23:18 |
xygnal | if it goes strait into database... why am I seeing this? | 23:19 |
roaksoax | xygnal: maybe you run out out of space ? | 23:19 |
roaksoax | xygnal: maybe that's the rackd.log ? | 23:19 |
roaksoax | or the rack putting those messages ? | 23:19 |
xygnal | rackd.log on region | 23:19 |
xygnal | rack controllers themselves are services off still from before | 23:20 |
xygnal | its the region controller logs I am looking at | 23:20 |
roaksoax | xygnal: right, that's strange, I've never send any similar issue | 23:20 |
xygnal | and the rackd.log is whats saying that | 23:20 |
roaksoax | maybe you just ran out of space ? | 23:20 |
roaksoax | xygnal: ah rackd.log | 23:20 |
xygnal | nope, hardly any space used | 23:21 |
xygnal | well 42% | 23:21 |
xygnal | lots of room yet | 23:21 |
xygnal | same with inodes :) | 23:21 |
roaksoax | strange | 23:23 |
Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!