[08:49] <parlos> Good Morning
[08:51] <EstEd> yo :)
[08:54] <parlos> I'm trying to figure out how MAAS and Landscape are 'connected'. Landscape/Autopilot complains that there isnt a node that matches its need, and I've corrected this by adding such a node to MAAS, but still Landscape complains/does not detect it... any hints?
[09:03] <EstEd> sorry, I've never used landscape. I'm just at the same point of trying to get started with juju. Only at the stage of reading the manual :p
[09:12] <parlos> :) I went with conjure-up.. did not complain as much.. but now maas deployment failed..  sigh..  Good luck!
[09:12] <EstEd> ok, I may also look at that. I am trying hard to take on all the information ;)
[09:12] <EstEd> thanks!
[14:57] <mup> Bug #1714273 opened: [2.3.0] Power Error when checking power status <hwcert-server> <MAAS:Incomplete> <https://launchpad.net/bugs/1714273>
[15:29] <xygnal> hi guys.  I am having timeouts trying to access the API.
[15:29] <xygnal> memory is good, CPU gets a bit hot but it does nto fall behind on run queue.
[15:30] <xygnal> however.. I am seeing disk times of 500ms a lot on the region controller and on the rack controllers i see peaks of 1.5 SECONDS sometimes
[15:30] <xygnal> is 500ms repsonse time going to cause such stalls?
[15:30] <xygnal> we are running the pgsql database ANY region controller on the same box
[15:35] <roaksoax> xygnal: do the logs seem to show something is off? Or are there any rogue processes ?
[15:35] <roaksoax> xygnal: how many machines / rack controllers, etc
[15:56] <xygnal>  roaksoax didnt see anything in logs so far, have a tcpdump setup to make sure it actually gets to MAAS
[15:56] <xygnal> 1 region controller
[15:56] <xygnal> 6 rack controllers
[15:56] <xygnal> I think we are under 500 servers right now but it might be a couple hundred higher
[15:57] <xygnal> all of MAAS region and racks are on ESXi VMs
[15:58] <xygnal> is twistd multi threaded? it kinda looked as if they were bound to a single cpu
[16:01] <roaksoax> xygnal: it is
[16:01] <roaksoax> xygnal: what version you running ?
[16:02] <xygnal> 2.2
[16:02] <roaksoax> xygnal: 2.2.2 ?
[16:02] <roaksoax> xygnal: there's been some fixes in 2.2.x (soon to be 2.2.3) that may help with multiple region/racks
[16:03] <xygnal_> not sure
[16:03] <xygnal_> why?
[16:04] <roaksoax> there's some fixes to dns and rack/region communication/registration
[17:23] <xygnal> roaksoax:  yes we are 2.2.2 confirmed.  We are worried about a bug you only have a test fix for in 2.3, about rack ccontroller restarts freaking out MAAS>
[17:23] <xygnal> also
[17:23] <xygnal> i discovered we only have 2 cpus on this region c ontroller.  I expect we need more cores?
[17:23] <xygnal> 4? 6?
[17:24] <roaksoax> xygnal: that bug is backported to 2.2 , but newer 2.2.2 is not released yet
[17:24] <roaksoax> xygnal: we typically recommend at least 4+ CPU's
[17:24] <xygnal> roaksoax according to the bug report it says 2.3 not 2.2...
[17:24] <roaksoax> xygnal: since the region runs 4 processes
[17:25] <roaksoax> xygnal: have a link ?
[17:26] <xygnal> 1707071
[17:26] <xygnal> sorrry talking on my phone so i dont have to go through my work VPN to IRC
[17:26] <xygnal> thats the bug #
[17:26] <roaksoax> https://bugs.launchpad.net/nova/+bug/1707071 -> seems openstack bug
[17:27] <xygnal> oh
[17:28] <xygnal> i hit 0 instead of 9
[17:28] <xygnal> *embarrsed*
[17:28] <xygnal> 1707*9*71
[17:30] <roaksoax> xygnal: it is fix committed and targetted for 22.3
[17:30] <roaksoax> 2.2.3
[17:30] <roaksoax> which has not been released yet
[17:32] <xygnal> ah
[17:32] <xygnal> when is 2.2.3 looking to come out?
[17:32] <xygnal> also, we are bumping up to 6 cores now
[17:32] <roaksoax> xygnal: i was hoping to put a candidate this week, but I think I'll have to punt that to next week
[17:32] <xygnal> is it not compataible such that i could backport it into 2.2 temporarily?
[17:34] <xygnal> 2.2.2 i mean
[17:45] <roaksoax> xygnal: it should be: https://git.launchpad.net/maas/commit/?id=e34ededffc9cb96124ee2232793e0c064fdd735a
[18:03] <xygnal> roaksoax thanks that should be easy enough.  just one region? not sure where provisionserver is used.
[18:03] <xygnal> *provisioningserver
[20:15] <xygnal> roaksoax not having the best time here.  multiple users using our region server is causing the API and even the CLI to simply *hang*
[20:15] <xygnal> does not appear to be an OS bottleneck and I am not wearing warnings to correlate.  I do not have debugging turned on, as far as I know.
[20:15] <xygnal> but I do have several twistd3 processes that use 100% cpu a peice
[20:15] <xygnal> even after restarteding regiond, even after rebooting
[20:16] <xygnal> i have idle CPU, i have free memory, but I cannot interface with MAAS!
[20:18] <roaksoax> xygnal: ok, so you said you have 1 controller with 6 rack controllers ?
[20:19] <roaksoax> xygnal: and CPU is ided
[20:19] <roaksoax> xygnal: but you cannot reach the UI/API ?
[20:19] <xygnal> 6 cpus now after our talk
[20:19] <xygnal> and yes
[20:19] <xygnal> If i strace the running pid
[20:19] <xygnal> i see a lot of resource temporarily unavailable and connection timeouts
[20:19] <xygnal> same for most of those high CPU twisted3 pids
[20:19] <xygnal> I have NOT rebooted the rack controllers since earlier today
[20:20] <xygnal> correct. I mean I can reach it in that i can connect TCP to the box, or i can login via SSH and use maas CLI to connect
[20:20] <xygnal> but both hang the same
[20:20] <xygnal> no response
[20:20] <xygnal> MAAS is too busy to answer
[20:20] <roaksoax> xygnal: so, do this, if possible, stop all your rack controllers for a second
[20:20] <roaksoax> xygnal: and see if your region controller can be interfacted with
[20:21] <roaksoax> xygnal: then start adding rack by rack controller , but wait for them to fully connect
[20:21] <roaksoax> before adding a new rakc controller
[20:21] <xygnal> think it's possible they are returning load to the region controller?
[20:22] <roaksoax> xygnal: i'm thinking that all trying to connect to the region at the same time may be causing issues
[20:22] <xygnal> ok
[20:22] <roaksoax> xygnal: i'm running 2 region controllers and 4 rack controllers without issues, but definitely dont have 500 machines
[20:23] <xygnal> thanks
[20:24] <xygnal> will test this and if we see a difference we may break off some rack controllers to a new region
[20:42] <roaksoax> xygnal: so i've uploaded 2.2.3 candidate to ppa:maas/proposed
[20:43] <roaksoax> xygnal: that has not gone through final qa though
[20:57] <xygnal> roaksoax we're seeing the problem return after the first rack controller is returned
[20:58] <xygnal> roaksoax this may be off the wall but i have been thinking on this all day.  a lot of our current builders are clients that are using ESXi.  all we do for them is write their image down to disk at the end of a generic install and they boot up fine and take over.
[20:59] <xygnal> could that DD (which would take some time) cause any unwanted waiting from MAAS?
[20:59] <xygnal> just making sure.
[20:59] <xygnal> actually it's an SSD so i suppose, not that long
[21:01] <xygnal> if i check the MAAS  twistd3's in lsof i can see the various client machines associated with that connection. I am certain we have multiple users trying to do things.
[21:02] <xygnal> that is the why. the question is, why is MAAS so busy, when the system... is not?
[21:02] <xygnal> i still dont see any convincing log messages
[21:14] <roaksoax> xygnal: i know what the issue is
[21:14] <roaksoax> err
[21:14] <roaksoax> nevermind
[21:14] <roaksoax> wrongwindows
[21:22] <xygnal> roaksoax still seeing a ton of 'resource temporarily unavailable' and 'connection timed out' messages from the only active twistd3 right now
[21:22] <xygnal> not from MAAS itself, but from strace -ffp
[21:23] <xygnal> mostly futex wait
[21:44] <xygnal> roaksoax proposed PPA applied, all systems booted, still can't run CLI commands
[21:44] <xygnal> hangs
[21:57] <mup> Bug #1714362 opened: [2.3] Power management balances between controllers <MAAS:New> <https://launchpad.net/bugs/1714362>
[22:00] <mup> Bug #1714362 changed: [2.3] Power management balances between controllers <MAAS:New> <https://launchpad.net/bugs/1714362>
[22:06] <mup> Bug #1714362 opened: [2.3] Power management balances between controllers <MAAS:New> <https://launchpad.net/bugs/1714362>
[22:23] <roaksoax> xygnal: even with 1 rack controller connceted ?
[22:23] <roaksoax> xygnal: if that is the case, only 1 rack controller
[22:24] <xygnal> yes
[22:24] <xygnal> even with only 1
[22:25] <roaksoax> xygnal: right, so that could mean some other problem, do you have firewalling in places or similar ?
[22:26] <xygnal> not entirely open no its behind company proxy.   I noticed some kind of 'snap' connections failing constnatly
[22:26] <xygnal> as well as NTP connections failing sometimes
[22:27] <xygnal> it has no problems grabbing its images that I can recall
[22:29] <xygnal> and not aware of any connectivity changes recently
[22:29] <roaksoax> xygnal: any logs ?
[22:30] <roaksoax> 'snap' connections, that's weird
[22:30] <roaksoax> xygnal: could be related to networking firewalling between region/racks ?
[22:30] <roaksoax> xygnal: i mean, if you are simply running the region and no racks and you cant contact the api
[22:30] <roaksoax> there could be another problem there
[22:30] <xygnal> no no. one rack.
[22:31] <xygnal> we put back first rack after shutting all down
[22:32] <xygnal> did noy see any node to node comm errors outsud
[22:32] <xygnal> outside of our own restarts
[22:33] <xygnal> can get you logs. all the rpc errors between region and rack in thrn
[22:33] <xygnal> correlate to our node restarts
[22:33] <roaksoax> xygnal: what if you try to reach the cli wihtin the same machine of the region controller ?
[22:33] <xygnal> we are...
[22:34] <xygnal> already doing so..
[22:37] <roaksoax> xygnal: and it works ?
[22:37] <xygnal> no
[22:37] <xygnal> hangs
[22:38] <xygnal> maas is hung up on something and i cannot see it in thr
[22:38] <xygnal> logs
[22:39] <xygnal> only recent change i am aware of is update to 2.2.2
[22:39] <xygnal> any regressions?
[22:40] <roaksoax> xygnal: sso doing it from the same region controller works
[22:40] <roaksoax> xygnal: but doesn't work remotely
[22:41] <xygnal> no. its not working at all with any re
[22:42] <xygnal> rack* controller connected
[22:42] <xygnal> even local to the region controller
[22:42] <roaksoax> xygnal: try this
[22:42] <roaksoax> stop maas-rackd
[22:42] <roaksoax> stop maas-regiond
[22:42] <roaksoax> ps faux | grep twist
[22:42] <roaksoax> xygnal: and see if there are any rogue processes
[22:42] <roaksoax> xygnal: and then, sudo service postgresql stop
[22:43] <roaksoax> xygnal: and then sudo service postgresql start
[22:43] <roaksoax> sudo service maas-regiond start
[22:43] <roaksoax> sudo service maas-rackd start
[22:50] <xygnal> alright so.   both services stopped. two twisted3's still going.
[22:50] <xygnal> even after pgsql stop and start
[22:50] <xygnal> oops nope
[22:50] <xygnal> there they go
[22:52] <xygnal> ok services back up
[22:52] <xygnal> maas create command... hung
[22:52] <xygnal> CPU utilization back up to 100% for twisted3 again
[22:53] <roaksoax> xygnal: hold on, so when you stopped maas-regiond and maas-rackd there were still twistd3 services running ?
[22:53] <roaksoax> xygnal: like this
[22:53] <roaksoax> maas     44484  0.0  0.0   4508   712 ?        Ss   16:42   0:00 /bin/sh -c exec twistd3 --nodaemon --pidfile=         --logger=provisioningserver.logger.EventLogger maas-regiond 2>&1 |       tee -a $LOGFILE
[22:54] <roaksoax> maas     44492  3.6 10.1 1093456 203008 ?      Sl   16:42   4:49  \_ /usr/bin/python3 /usr/bin/twistd3 --nodaemon --pidfile= --logger=provisioningserver.logger.EventLogger maas-regiond
[22:54] <roaksoax> ?>?
[22:54] <xygnal> yes
[22:54] <xygnal> like that
[22:54] <roaksoax> xygnal: so that seems there are rogue processes that are running
[22:55] <roaksoax> xygnal: so sudo service maas-regiond stop && sudo service maas-rackd stop
[22:55] <roaksoax> xygnal: ps faux | grep twistd3
[22:55] <roaksoax> xygnal: and kill those rogue processes
[22:57] <xygnal> if i stop pgsql
[22:57] <xygnal> they do die
[22:57] <xygnal> after a minute or two
[22:57] <xygnal> but they come right back when i bring services back
[22:57] <roaksoax> xygnal: right, but I mean, make sure maas-regiond is stopped
[22:57] <roaksoax> same as maas-rackd
[22:58] <roaksoax> and check if there are twisted processes
[22:58] <roaksoax> if there are, then that's the probelm
[22:58] <roaksoax> as there shouldn't be
[22:58] <xygnal> they show stopped in systemctl status
[22:58] <xygnal> let me recap
[22:58] <xygnal> stop rackd, stop regiond, i see two twisted3's still going.
[22:58] <xygnal> stop pgsql
[22:58] <xygnal> wait a minute
[22:58] <xygnal> all twisted3's gone
[22:58] <xygnal> no more 'rogue process' running
[22:59] <xygnal> but if i start all 3 back up?
[22:59] <xygnal> problem is back
[22:59] <roaksoax> xygnal: ok, so stop rackd, stop regiond
[22:59] <roaksoax> kill -9 rogue processes
[22:59] <roaksoax> and wait and see if they come up
[23:00] <roaksoax> if they do, it is worth investigating where those are coming from
[23:00] <roaksoax> ubuntu@maas:~$ sudo service maas-regiond stop
[23:00] <roaksoax> ubuntu@maas:~$ sudo service maas-rackd stop
[23:00] <roaksoax> ubuntu@maas:~$ ps faux | grep twistd3
[23:00] <roaksoax> ubuntu   10292  0.0  0.0  12944   984 pts/0    S+   23:00   0:00              \_ grep --color=auto twistd3
[23:01] <xygnal> alright
[23:01] <xygnal> instead of stopping pgsql
[23:01] <xygnal> i just stopped the two sercices
[23:01] <xygnal> services*
[23:01] <xygnal> and killed the one twisted3 that didnt die
[23:01] <xygnal> manually with kill -9
[23:01] <xygnal> now there are none running
[23:02] <roaksoax> xygnal: ok so ps faux | grep twistd3 shows no rogue processes running
[23:02] <roaksoax> xygnal: just double check that
[23:02] <xygnal> exactly
[23:02] <xygnal> zero results
[23:02] <roaksoax> xygnal: ok, so restart maas-regiond, wait a few seconds
[23:03] <roaksoax> xygnal: and restart maas-rackd
[23:04] <xygnal> alright
[23:04] <xygnal> looking good so far
[23:04] <xygnal> time to run maas create
[23:05] <xygnal> aaand
[23:05] <xygnal> cpu pegged 100% as soon as I did the command
[23:05] <xygnal> still no response
[23:06] <roaksoax> xygnal: can you share your logs /var/log/maas/*.log
[23:06] <roaksoax> but that's really really strange
[23:06] <mup> Bug #1714362 changed: [2.3] Power management balances between controllers on every power check <performance> <MAAS:Invalid> <https://launchpad.net/bugs/1714362>
[23:06] <xygnal> just the last 30 minutes of them or so?
[23:07] <roaksoax> xygnal: yeah that should be ok
[23:08] <xygnal> hm you know.  does everything have a timestamp? if i try to grab just timstamp i am going to potentially flush out logs
[23:08] <xygnal> I mean, filter out, preventing you from seeing
[23:09] <roaksoax> xygnal: yeah
[23:14] <xygnal> well i can do it easy for maas.log and regiond.log but rackd has some stack traces, checking right now.
[23:14] <xygnal> looks like it threw a permission denied on lost+ found... what the...
[23:15] <xygnal> critical image downloading images failed
[23:15] <xygnal> where would it be downloading them?
[23:16] <xygnal> this was working before... looks like something has broken while I was away
[23:17] <roaksoax> strange
[23:17] <roaksoax> :/
[23:17] <xygnal> where do images usually get downloaded to?
[23:18] <xygnal> I didnt think we had a dedicated MOUNT for it
[23:18] <roaksoax> xygnal: images.maas.io
[23:18] <roaksoax> xygnal: oh
[23:18] <roaksoax> xygnal: to the database
[23:18] <roaksoax> xygnal: in the region controller
[23:18] <roaksoax> and then the rack controllers sync those images onto the filesystmes
[23:18] <xygnal> er... but i'm getting errors about a local file, permission denied on lost+found at the end of the stack trace
[23:18] <xygnal> for that critical error mentioned above
[23:19] <xygnal> if it goes strait into database... why am I seeing this?
[23:19] <roaksoax> xygnal: maybe you run out out of space ?
[23:19] <roaksoax> xygnal: maybe that's the rackd.log ?
[23:19] <roaksoax> or the rack putting those messages ?
[23:19] <xygnal> rackd.log on region
[23:20] <xygnal> rack controllers themselves are services off still from before
[23:20] <xygnal> its the region controller logs I am looking at
[23:20] <roaksoax> xygnal: right, that's strange, I've never send any similar issue
[23:20] <xygnal> and the rackd.log is whats saying that
[23:20] <roaksoax> maybe you just ran out of space ?
[23:20] <roaksoax> xygnal: ah rackd.log
[23:21] <xygnal> nope, hardly any space used
[23:21] <xygnal> well 42%
[23:21] <xygnal> lots of room yet
[23:21] <xygnal> same with inodes :)
[23:23] <roaksoax> strange