=== axw_ is now known as axw === frankban|afk is now known as frankban [08:49] Good Morning [08:51] yo :) [08:54] I'm trying to figure out how MAAS and Landscape are 'connected'. Landscape/Autopilot complains that there isnt a node that matches its need, and I've corrected this by adding such a node to MAAS, but still Landscape complains/does not detect it... any hints? [09:03] sorry, I've never used landscape. I'm just at the same point of trying to get started with juju. Only at the stage of reading the manual :p [09:12] :) I went with conjure-up.. did not complain as much.. but now maas deployment failed.. sigh.. Good luck! [09:12] ok, I may also look at that. I am trying hard to take on all the information ;) [09:12] thanks! [14:57] Bug #1714273 opened: [2.3.0] Power Error when checking power status [15:29] hi guys. I am having timeouts trying to access the API. [15:29] memory is good, CPU gets a bit hot but it does nto fall behind on run queue. [15:30] however.. I am seeing disk times of 500ms a lot on the region controller and on the rack controllers i see peaks of 1.5 SECONDS sometimes [15:30] is 500ms repsonse time going to cause such stalls? [15:30] we are running the pgsql database ANY region controller on the same box [15:35] xygnal: do the logs seem to show something is off? Or are there any rogue processes ? [15:35] xygnal: how many machines / rack controllers, etc [15:56] roaksoax didnt see anything in logs so far, have a tcpdump setup to make sure it actually gets to MAAS [15:56] 1 region controller [15:56] 6 rack controllers [15:56] I think we are under 500 servers right now but it might be a couple hundred higher [15:57] all of MAAS region and racks are on ESXi VMs [15:58] is twistd multi threaded? it kinda looked as if they were bound to a single cpu [16:01] xygnal: it is [16:01] xygnal: what version you running ? [16:02] 2.2 [16:02] xygnal: 2.2.2 ? [16:02] xygnal: there's been some fixes in 2.2.x (soon to be 2.2.3) that may help with multiple region/racks [16:03] not sure [16:03] why? [16:04] there's some fixes to dns and rack/region communication/registration === frankban is now known as frankban|afk [17:23] roaksoax: yes we are 2.2.2 confirmed. We are worried about a bug you only have a test fix for in 2.3, about rack ccontroller restarts freaking out MAAS> [17:23] also [17:23] i discovered we only have 2 cpus on this region c ontroller. I expect we need more cores? [17:23] 4? 6? [17:24] xygnal: that bug is backported to 2.2 , but newer 2.2.2 is not released yet [17:24] xygnal: we typically recommend at least 4+ CPU's [17:24] roaksoax according to the bug report it says 2.3 not 2.2... [17:24] xygnal: since the region runs 4 processes [17:25] xygnal: have a link ? [17:26] 1707071 [17:26] sorrry talking on my phone so i dont have to go through my work VPN to IRC [17:26] thats the bug # [17:26] https://bugs.launchpad.net/nova/+bug/1707071 -> seems openstack bug [17:27] oh [17:28] i hit 0 instead of 9 [17:28] *embarrsed* [17:28] 1707*9*71 [17:30] xygnal: it is fix committed and targetted for 22.3 [17:30] 2.2.3 [17:30] which has not been released yet [17:32] ah [17:32] when is 2.2.3 looking to come out? [17:32] also, we are bumping up to 6 cores now [17:32] xygnal: i was hoping to put a candidate this week, but I think I'll have to punt that to next week [17:32] is it not compataible such that i could backport it into 2.2 temporarily? [17:34] 2.2.2 i mean [17:45] xygnal: it should be: https://git.launchpad.net/maas/commit/?id=e34ededffc9cb96124ee2232793e0c064fdd735a [18:03] roaksoax thanks that should be easy enough. just one region? not sure where provisionserver is used. [18:03] *provisioningserver [20:15] roaksoax not having the best time here. multiple users using our region server is causing the API and even the CLI to simply *hang* [20:15] does not appear to be an OS bottleneck and I am not wearing warnings to correlate. I do not have debugging turned on, as far as I know. [20:15] but I do have several twistd3 processes that use 100% cpu a peice [20:15] even after restarteding regiond, even after rebooting [20:16] i have idle CPU, i have free memory, but I cannot interface with MAAS! [20:18] xygnal: ok, so you said you have 1 controller with 6 rack controllers ? [20:19] xygnal: and CPU is ided [20:19] xygnal: but you cannot reach the UI/API ? [20:19] 6 cpus now after our talk [20:19] and yes [20:19] If i strace the running pid [20:19] i see a lot of resource temporarily unavailable and connection timeouts [20:19] same for most of those high CPU twisted3 pids [20:19] I have NOT rebooted the rack controllers since earlier today [20:20] correct. I mean I can reach it in that i can connect TCP to the box, or i can login via SSH and use maas CLI to connect [20:20] but both hang the same [20:20] no response [20:20] MAAS is too busy to answer [20:20] xygnal: so, do this, if possible, stop all your rack controllers for a second [20:20] xygnal: and see if your region controller can be interfacted with [20:21] xygnal: then start adding rack by rack controller , but wait for them to fully connect [20:21] before adding a new rakc controller [20:21] think it's possible they are returning load to the region controller? [20:22] xygnal: i'm thinking that all trying to connect to the region at the same time may be causing issues [20:22] ok [20:22] xygnal: i'm running 2 region controllers and 4 rack controllers without issues, but definitely dont have 500 machines [20:23] thanks [20:24] will test this and if we see a difference we may break off some rack controllers to a new region [20:42] xygnal: so i've uploaded 2.2.3 candidate to ppa:maas/proposed [20:43] xygnal: that has not gone through final qa though [20:57] roaksoax we're seeing the problem return after the first rack controller is returned [20:58] roaksoax this may be off the wall but i have been thinking on this all day. a lot of our current builders are clients that are using ESXi. all we do for them is write their image down to disk at the end of a generic install and they boot up fine and take over. [20:59] could that DD (which would take some time) cause any unwanted waiting from MAAS? [20:59] just making sure. [20:59] actually it's an SSD so i suppose, not that long [21:01] if i check the MAAS twistd3's in lsof i can see the various client machines associated with that connection. I am certain we have multiple users trying to do things. [21:02] that is the why. the question is, why is MAAS so busy, when the system... is not? [21:02] i still dont see any convincing log messages [21:14] xygnal: i know what the issue is [21:14] err [21:14] nevermind [21:14] wrongwindows [21:22] roaksoax still seeing a ton of 'resource temporarily unavailable' and 'connection timed out' messages from the only active twistd3 right now [21:22] not from MAAS itself, but from strace -ffp [21:23] mostly futex wait [21:44] roaksoax proposed PPA applied, all systems booted, still can't run CLI commands [21:44] hangs [21:57] Bug #1714362 opened: [2.3] Power management balances between controllers [22:00] Bug #1714362 changed: [2.3] Power management balances between controllers [22:06] Bug #1714362 opened: [2.3] Power management balances between controllers [22:23] xygnal: even with 1 rack controller connceted ? [22:23] xygnal: if that is the case, only 1 rack controller [22:24] yes [22:24] even with only 1 [22:25] xygnal: right, so that could mean some other problem, do you have firewalling in places or similar ? [22:26] not entirely open no its behind company proxy. I noticed some kind of 'snap' connections failing constnatly [22:26] as well as NTP connections failing sometimes [22:27] it has no problems grabbing its images that I can recall [22:29] and not aware of any connectivity changes recently [22:29] xygnal: any logs ? [22:30] 'snap' connections, that's weird [22:30] xygnal: could be related to networking firewalling between region/racks ? [22:30] xygnal: i mean, if you are simply running the region and no racks and you cant contact the api [22:30] there could be another problem there [22:30] no no. one rack. [22:31] we put back first rack after shutting all down [22:32] did noy see any node to node comm errors outsud [22:32] outside of our own restarts [22:33] can get you logs. all the rpc errors between region and rack in thrn [22:33] correlate to our node restarts [22:33] xygnal: what if you try to reach the cli wihtin the same machine of the region controller ? [22:33] we are... [22:34] already doing so.. [22:37] xygnal: and it works ? [22:37] no [22:37] hangs [22:38] maas is hung up on something and i cannot see it in thr [22:38] logs [22:39] only recent change i am aware of is update to 2.2.2 [22:39] any regressions? [22:40] xygnal: sso doing it from the same region controller works [22:40] xygnal: but doesn't work remotely [22:41] no. its not working at all with any re [22:42] rack* controller connected [22:42] even local to the region controller [22:42] xygnal: try this [22:42] stop maas-rackd [22:42] stop maas-regiond [22:42] ps faux | grep twist [22:42] xygnal: and see if there are any rogue processes [22:42] xygnal: and then, sudo service postgresql stop [22:43] xygnal: and then sudo service postgresql start [22:43] sudo service maas-regiond start [22:43] sudo service maas-rackd start [22:50] alright so. both services stopped. two twisted3's still going. [22:50] even after pgsql stop and start [22:50] oops nope [22:50] there they go [22:52] ok services back up [22:52] maas create command... hung [22:52] CPU utilization back up to 100% for twisted3 again [22:53] xygnal: hold on, so when you stopped maas-regiond and maas-rackd there were still twistd3 services running ? [22:53] xygnal: like this [22:53] maas 44484 0.0 0.0 4508 712 ? Ss 16:42 0:00 /bin/sh -c exec twistd3 --nodaemon --pidfile= --logger=provisioningserver.logger.EventLogger maas-regiond 2>&1 | tee -a $LOGFILE [22:54] maas 44492 3.6 10.1 1093456 203008 ? Sl 16:42 4:49 \_ /usr/bin/python3 /usr/bin/twistd3 --nodaemon --pidfile= --logger=provisioningserver.logger.EventLogger maas-regiond [22:54] ?>? [22:54] yes [22:54] like that [22:54] xygnal: so that seems there are rogue processes that are running [22:55] xygnal: so sudo service maas-regiond stop && sudo service maas-rackd stop [22:55] xygnal: ps faux | grep twistd3 [22:55] xygnal: and kill those rogue processes [22:57] if i stop pgsql [22:57] they do die [22:57] after a minute or two [22:57] but they come right back when i bring services back [22:57] xygnal: right, but I mean, make sure maas-regiond is stopped [22:57] same as maas-rackd [22:58] and check if there are twisted processes [22:58] if there are, then that's the probelm [22:58] as there shouldn't be [22:58] they show stopped in systemctl status [22:58] let me recap [22:58] stop rackd, stop regiond, i see two twisted3's still going. [22:58] stop pgsql [22:58] wait a minute [22:58] all twisted3's gone [22:58] no more 'rogue process' running [22:59] but if i start all 3 back up? [22:59] problem is back [22:59] xygnal: ok, so stop rackd, stop regiond [22:59] kill -9 rogue processes [22:59] and wait and see if they come up [23:00] if they do, it is worth investigating where those are coming from [23:00] ubuntu@maas:~$ sudo service maas-regiond stop [23:00] ubuntu@maas:~$ sudo service maas-rackd stop [23:00] ubuntu@maas:~$ ps faux | grep twistd3 [23:00] ubuntu 10292 0.0 0.0 12944 984 pts/0 S+ 23:00 0:00 \_ grep --color=auto twistd3 [23:01] alright [23:01] instead of stopping pgsql [23:01] i just stopped the two sercices [23:01] services* [23:01] and killed the one twisted3 that didnt die [23:01] manually with kill -9 [23:01] now there are none running [23:02] xygnal: ok so ps faux | grep twistd3 shows no rogue processes running [23:02] xygnal: just double check that [23:02] exactly [23:02] zero results [23:02] xygnal: ok, so restart maas-regiond, wait a few seconds [23:03] xygnal: and restart maas-rackd [23:04] alright [23:04] looking good so far [23:04] time to run maas create [23:05] aaand [23:05] cpu pegged 100% as soon as I did the command [23:05] still no response [23:06] xygnal: can you share your logs /var/log/maas/*.log [23:06] but that's really really strange [23:06] Bug #1714362 changed: [2.3] Power management balances between controllers on every power check [23:06] just the last 30 minutes of them or so? [23:07] xygnal: yeah that should be ok [23:08] hm you know. does everything have a timestamp? if i try to grab just timstamp i am going to potentially flush out logs [23:08] I mean, filter out, preventing you from seeing [23:09] xygnal: yeah [23:14] well i can do it easy for maas.log and regiond.log but rackd has some stack traces, checking right now. [23:14] looks like it threw a permission denied on lost+ found... what the... [23:15] critical image downloading images failed [23:15] where would it be downloading them? [23:16] this was working before... looks like something has broken while I was away [23:17] strange [23:17] :/ [23:17] where do images usually get downloaded to? [23:18] I didnt think we had a dedicated MOUNT for it [23:18] xygnal: images.maas.io [23:18] xygnal: oh [23:18] xygnal: to the database [23:18] xygnal: in the region controller [23:18] and then the rack controllers sync those images onto the filesystmes [23:18] er... but i'm getting errors about a local file, permission denied on lost+found at the end of the stack trace [23:18] for that critical error mentioned above [23:19] if it goes strait into database... why am I seeing this? [23:19] xygnal: maybe you run out out of space ? [23:19] xygnal: maybe that's the rackd.log ? [23:19] or the rack putting those messages ? [23:19] rackd.log on region [23:20] rack controllers themselves are services off still from before [23:20] its the region controller logs I am looking at [23:20] xygnal: right, that's strange, I've never send any similar issue [23:20] and the rackd.log is whats saying that [23:20] maybe you just ran out of space ? [23:20] xygnal: ah rackd.log [23:21] nope, hardly any space used [23:21] well 42% [23:21] lots of room yet [23:21] same with inodes :) [23:23] strange