/srv/irclogs.ubuntu.com/2017/08/31/#maas.txt

=== axw_ is now known as axw
=== frankban|afk is now known as frankban
parlosGood Morning08:49
EstEdyo :)08:51
parlosI'm trying to figure out how MAAS and Landscape are 'connected'. Landscape/Autopilot complains that there isnt a node that matches its need, and I've corrected this by adding such a node to MAAS, but still Landscape complains/does not detect it... any hints?08:54
EstEdsorry, I've never used landscape. I'm just at the same point of trying to get started with juju. Only at the stage of reading the manual :p09:03
parlos:) I went with conjure-up.. did not complain as much.. but now maas deployment failed..  sigh..  Good luck!09:12
EstEdok, I may also look at that. I am trying hard to take on all the information ;)09:12
EstEdthanks!09:12
mupBug #1714273 opened: [2.3.0] Power Error when checking power status <hwcert-server> <MAAS:Incomplete> <https://launchpad.net/bugs/1714273>14:57
xygnalhi guys.  I am having timeouts trying to access the API.15:29
xygnalmemory is good, CPU gets a bit hot but it does nto fall behind on run queue.15:29
xygnalhowever.. I am seeing disk times of 500ms a lot on the region controller and on the rack controllers i see peaks of 1.5 SECONDS sometimes15:30
xygnalis 500ms repsonse time going to cause such stalls?15:30
xygnalwe are running the pgsql database ANY region controller on the same box15:30
roaksoaxxygnal: do the logs seem to show something is off? Or are there any rogue processes ?15:35
roaksoaxxygnal: how many machines / rack controllers, etc15:35
xygnal roaksoax didnt see anything in logs so far, have a tcpdump setup to make sure it actually gets to MAAS15:56
xygnal1 region controller15:56
xygnal6 rack controllers15:56
xygnalI think we are under 500 servers right now but it might be a couple hundred higher15:56
xygnalall of MAAS region and racks are on ESXi VMs15:57
xygnalis twistd multi threaded? it kinda looked as if they were bound to a single cpu15:58
roaksoaxxygnal: it is16:01
roaksoaxxygnal: what version you running ?16:01
xygnal2.216:02
roaksoaxxygnal: 2.2.2 ?16:02
roaksoaxxygnal: there's been some fixes in 2.2.x (soon to be 2.2.3) that may help with multiple region/racks16:02
xygnal_not sure16:03
xygnal_why?16:03
roaksoaxthere's some fixes to dns and rack/region communication/registration16:04
=== frankban is now known as frankban|afk
xygnalroaksoax:  yes we are 2.2.2 confirmed.  We are worried about a bug you only have a test fix for in 2.3, about rack ccontroller restarts freaking out MAAS>17:23
xygnalalso17:23
xygnali discovered we only have 2 cpus on this region c ontroller.  I expect we need more cores?17:23
xygnal4? 6?17:23
roaksoaxxygnal: that bug is backported to 2.2 , but newer 2.2.2 is not released yet17:24
roaksoaxxygnal: we typically recommend at least 4+ CPU's17:24
xygnalroaksoax according to the bug report it says 2.3 not 2.2...17:24
roaksoaxxygnal: since the region runs 4 processes17:24
roaksoaxxygnal: have a link ?17:25
xygnal170707117:26
xygnalsorrry talking on my phone so i dont have to go through my work VPN to IRC17:26
xygnalthats the bug #17:26
roaksoaxhttps://bugs.launchpad.net/nova/+bug/1707071 -> seems openstack bug17:26
xygnaloh17:27
xygnali hit 0 instead of 917:28
xygnal*embarrsed*17:28
xygnal1707*9*7117:28
roaksoaxxygnal: it is fix committed and targetted for 22.317:30
roaksoax2.2.317:30
roaksoaxwhich has not been released yet17:30
xygnalah17:32
xygnalwhen is 2.2.3 looking to come out?17:32
xygnalalso, we are bumping up to 6 cores now17:32
roaksoaxxygnal: i was hoping to put a candidate this week, but I think I'll have to punt that to next week17:32
xygnalis it not compataible such that i could backport it into 2.2 temporarily?17:32
xygnal2.2.2 i mean17:34
roaksoaxxygnal: it should be: https://git.launchpad.net/maas/commit/?id=e34ededffc9cb96124ee2232793e0c064fdd735a17:45
xygnalroaksoax thanks that should be easy enough.  just one region? not sure where provisionserver is used.18:03
xygnal*provisioningserver18:03
xygnalroaksoax not having the best time here.  multiple users using our region server is causing the API and even the CLI to simply *hang*20:15
xygnaldoes not appear to be an OS bottleneck and I am not wearing warnings to correlate.  I do not have debugging turned on, as far as I know.20:15
xygnalbut I do have several twistd3 processes that use 100% cpu a peice20:15
xygnaleven after restarteding regiond, even after rebooting20:15
xygnali have idle CPU, i have free memory, but I cannot interface with MAAS!20:16
roaksoaxxygnal: ok, so you said you have 1 controller with 6 rack controllers ?20:18
roaksoaxxygnal: and CPU is ided20:19
roaksoaxxygnal: but you cannot reach the UI/API ?20:19
xygnal6 cpus now after our talk20:19
xygnaland yes20:19
xygnalIf i strace the running pid20:19
xygnali see a lot of resource temporarily unavailable and connection timeouts20:19
xygnalsame for most of those high CPU twisted3 pids20:19
xygnalI have NOT rebooted the rack controllers since earlier today20:19
xygnalcorrect. I mean I can reach it in that i can connect TCP to the box, or i can login via SSH and use maas CLI to connect20:20
xygnalbut both hang the same20:20
xygnalno response20:20
xygnalMAAS is too busy to answer20:20
roaksoaxxygnal: so, do this, if possible, stop all your rack controllers for a second20:20
roaksoaxxygnal: and see if your region controller can be interfacted with20:20
roaksoaxxygnal: then start adding rack by rack controller , but wait for them to fully connect20:21
roaksoaxbefore adding a new rakc controller20:21
xygnalthink it's possible they are returning load to the region controller?20:21
roaksoaxxygnal: i'm thinking that all trying to connect to the region at the same time may be causing issues20:22
xygnalok20:22
roaksoaxxygnal: i'm running 2 region controllers and 4 rack controllers without issues, but definitely dont have 500 machines20:22
xygnalthanks20:23
xygnalwill test this and if we see a difference we may break off some rack controllers to a new region20:24
roaksoaxxygnal: so i've uploaded 2.2.3 candidate to ppa:maas/proposed20:42
roaksoaxxygnal: that has not gone through final qa though20:43
xygnalroaksoax we're seeing the problem return after the first rack controller is returned20:57
xygnalroaksoax this may be off the wall but i have been thinking on this all day.  a lot of our current builders are clients that are using ESXi.  all we do for them is write their image down to disk at the end of a generic install and they boot up fine and take over.20:58
xygnalcould that DD (which would take some time) cause any unwanted waiting from MAAS?20:59
xygnaljust making sure.20:59
xygnalactually it's an SSD so i suppose, not that long20:59
xygnalif i check the MAAS  twistd3's in lsof i can see the various client machines associated with that connection. I am certain we have multiple users trying to do things.21:01
xygnalthat is the why. the question is, why is MAAS so busy, when the system... is not?21:02
xygnali still dont see any convincing log messages21:02
roaksoaxxygnal: i know what the issue is21:14
roaksoaxerr21:14
roaksoaxnevermind21:14
roaksoaxwrongwindows21:14
xygnalroaksoax still seeing a ton of 'resource temporarily unavailable' and 'connection timed out' messages from the only active twistd3 right now21:22
xygnalnot from MAAS itself, but from strace -ffp21:22
xygnalmostly futex wait21:23
xygnalroaksoax proposed PPA applied, all systems booted, still can't run CLI commands21:44
xygnalhangs21:44
mupBug #1714362 opened: [2.3] Power management balances between controllers <MAAS:New> <https://launchpad.net/bugs/1714362>21:57
mupBug #1714362 changed: [2.3] Power management balances between controllers <MAAS:New> <https://launchpad.net/bugs/1714362>22:00
mupBug #1714362 opened: [2.3] Power management balances between controllers <MAAS:New> <https://launchpad.net/bugs/1714362>22:06
roaksoaxxygnal: even with 1 rack controller connceted ?22:23
roaksoaxxygnal: if that is the case, only 1 rack controller22:23
xygnalyes22:24
xygnaleven with only 122:24
roaksoaxxygnal: right, so that could mean some other problem, do you have firewalling in places or similar ?22:25
xygnalnot entirely open no its behind company proxy.   I noticed some kind of 'snap' connections failing constnatly22:26
xygnalas well as NTP connections failing sometimes22:26
xygnalit has no problems grabbing its images that I can recall22:27
xygnaland not aware of any connectivity changes recently22:29
roaksoaxxygnal: any logs ?22:29
roaksoax'snap' connections, that's weird22:30
roaksoaxxygnal: could be related to networking firewalling between region/racks ?22:30
roaksoaxxygnal: i mean, if you are simply running the region and no racks and you cant contact the api22:30
roaksoaxthere could be another problem there22:30
xygnalno no. one rack.22:30
xygnalwe put back first rack after shutting all down22:31
xygnaldid noy see any node to node comm errors outsud22:32
xygnaloutside of our own restarts22:32
xygnalcan get you logs. all the rpc errors between region and rack in thrn22:33
xygnalcorrelate to our node restarts22:33
roaksoaxxygnal: what if you try to reach the cli wihtin the same machine of the region controller ?22:33
xygnalwe are...22:33
xygnalalready doing so..22:34
roaksoaxxygnal: and it works ?22:37
xygnalno22:37
xygnalhangs22:37
xygnalmaas is hung up on something and i cannot see it in thr22:38
xygnallogs22:38
xygnalonly recent change i am aware of is update to 2.2.222:39
xygnalany regressions?22:39
roaksoaxxygnal: sso doing it from the same region controller works22:40
roaksoaxxygnal: but doesn't work remotely22:40
xygnalno. its not working at all with any re22:41
xygnalrack* controller connected22:42
xygnaleven local to the region controller22:42
roaksoaxxygnal: try this22:42
roaksoaxstop maas-rackd22:42
roaksoaxstop maas-regiond22:42
roaksoaxps faux | grep twist22:42
roaksoaxxygnal: and see if there are any rogue processes22:42
roaksoaxxygnal: and then, sudo service postgresql stop22:42
roaksoaxxygnal: and then sudo service postgresql start22:43
roaksoaxsudo service maas-regiond start22:43
roaksoaxsudo service maas-rackd start22:43
xygnalalright so.   both services stopped. two twisted3's still going.22:50
xygnaleven after pgsql stop and start22:50
xygnaloops nope22:50
xygnalthere they go22:50
xygnalok services back up22:52
xygnalmaas create command... hung22:52
xygnalCPU utilization back up to 100% for twisted3 again22:52
roaksoaxxygnal: hold on, so when you stopped maas-regiond and maas-rackd there were still twistd3 services running ?22:53
roaksoaxxygnal: like this22:53
roaksoaxmaas     44484  0.0  0.0   4508   712 ?        Ss   16:42   0:00 /bin/sh -c exec twistd3 --nodaemon --pidfile=         --logger=provisioningserver.logger.EventLogger maas-regiond 2>&1 |       tee -a $LOGFILE22:53
roaksoaxmaas     44492  3.6 10.1 1093456 203008 ?      Sl   16:42   4:49  \_ /usr/bin/python3 /usr/bin/twistd3 --nodaemon --pidfile= --logger=provisioningserver.logger.EventLogger maas-regiond22:54
roaksoax?>?22:54
xygnalyes22:54
xygnallike that22:54
roaksoaxxygnal: so that seems there are rogue processes that are running22:54
roaksoaxxygnal: so sudo service maas-regiond stop && sudo service maas-rackd stop22:55
roaksoaxxygnal: ps faux | grep twistd322:55
roaksoaxxygnal: and kill those rogue processes22:55
xygnalif i stop pgsql22:57
xygnalthey do die22:57
xygnalafter a minute or two22:57
xygnalbut they come right back when i bring services back22:57
roaksoaxxygnal: right, but I mean, make sure maas-regiond is stopped22:57
roaksoaxsame as maas-rackd22:57
roaksoaxand check if there are twisted processes22:58
roaksoaxif there are, then that's the probelm22:58
roaksoaxas there shouldn't be22:58
xygnalthey show stopped in systemctl status22:58
xygnallet me recap22:58
xygnalstop rackd, stop regiond, i see two twisted3's still going.22:58
xygnalstop pgsql22:58
xygnalwait a minute22:58
xygnalall twisted3's gone22:58
xygnalno more 'rogue process' running22:58
xygnalbut if i start all 3 back up?22:59
xygnalproblem is back22:59
roaksoaxxygnal: ok, so stop rackd, stop regiond22:59
roaksoaxkill -9 rogue processes22:59
roaksoaxand wait and see if they come up22:59
roaksoaxif they do, it is worth investigating where those are coming from23:00
roaksoaxubuntu@maas:~$ sudo service maas-regiond stop23:00
roaksoaxubuntu@maas:~$ sudo service maas-rackd stop23:00
roaksoaxubuntu@maas:~$ ps faux | grep twistd323:00
roaksoaxubuntu   10292  0.0  0.0  12944   984 pts/0    S+   23:00   0:00              \_ grep --color=auto twistd323:00
xygnalalright23:01
xygnalinstead of stopping pgsql23:01
xygnali just stopped the two sercices23:01
xygnalservices*23:01
xygnaland killed the one twisted3 that didnt die23:01
xygnalmanually with kill -923:01
xygnalnow there are none running23:01
roaksoaxxygnal: ok so ps faux | grep twistd3 shows no rogue processes running23:02
roaksoaxxygnal: just double check that23:02
xygnalexactly23:02
xygnalzero results23:02
roaksoaxxygnal: ok, so restart maas-regiond, wait a few seconds23:02
roaksoaxxygnal: and restart maas-rackd23:03
xygnalalright23:04
xygnallooking good so far23:04
xygnaltime to run maas create23:04
xygnalaaand23:05
xygnalcpu pegged 100% as soon as I did the command23:05
xygnalstill no response23:05
roaksoaxxygnal: can you share your logs /var/log/maas/*.log23:06
roaksoaxbut that's really really strange23:06
mupBug #1714362 changed: [2.3] Power management balances between controllers on every power check <performance> <MAAS:Invalid> <https://launchpad.net/bugs/1714362>23:06
xygnaljust the last 30 minutes of them or so?23:06
roaksoaxxygnal: yeah that should be ok23:07
xygnalhm you know.  does everything have a timestamp? if i try to grab just timstamp i am going to potentially flush out logs23:08
xygnalI mean, filter out, preventing you from seeing23:08
roaksoaxxygnal: yeah23:09
xygnalwell i can do it easy for maas.log and regiond.log but rackd has some stack traces, checking right now.23:14
xygnallooks like it threw a permission denied on lost+ found... what the...23:14
xygnalcritical image downloading images failed23:15
xygnalwhere would it be downloading them?23:15
xygnalthis was working before... looks like something has broken while I was away23:16
roaksoaxstrange23:17
roaksoax:/23:17
xygnalwhere do images usually get downloaded to?23:17
xygnalI didnt think we had a dedicated MOUNT for it23:18
roaksoaxxygnal: images.maas.io23:18
roaksoaxxygnal: oh23:18
roaksoaxxygnal: to the database23:18
roaksoaxxygnal: in the region controller23:18
roaksoaxand then the rack controllers sync those images onto the filesystmes23:18
xygnaler... but i'm getting errors about a local file, permission denied on lost+found at the end of the stack trace23:18
xygnalfor that critical error mentioned above23:18
xygnalif it goes strait into database... why am I seeing this?23:19
roaksoaxxygnal: maybe you run out out of space ?23:19
roaksoaxxygnal: maybe that's the rackd.log ?23:19
roaksoaxor the rack putting those messages ?23:19
xygnalrackd.log on region23:19
xygnalrack controllers themselves are services off still from before23:20
xygnalits the region controller logs I am looking at23:20
roaksoaxxygnal: right, that's strange, I've never send any similar issue23:20
xygnaland the rackd.log is whats saying that23:20
roaksoaxmaybe you just ran out of space ?23:20
roaksoaxxygnal: ah rackd.log23:20
xygnalnope, hardly any space used23:21
xygnalwell 42%23:21
xygnallots of room yet23:21
xygnalsame with inodes :)23:21
roaksoaxstrange23:23

Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!