/srv/irclogs.ubuntu.com/2017/08/31/#maas.txt

=== axw_ is now known as axw
=== frankban\|afk is now known as frankban
parlos	Good Morning	08:49
EstEd	yo :)	08:51
parlos	I'm trying to figure out how MAAS and Landscape are 'connected'. Landscape/Autopilot complains that there isnt a node that matches its need, and I've corrected this by adding such a node to MAAS, but still Landscape complains/does not detect it... any hints?	08:54
EstEd	sorry, I've never used landscape. I'm just at the same point of trying to get started with juju. Only at the stage of reading the manual :p	09:03
parlos	:) I went with conjure-up.. did not complain as much.. but now maas deployment failed.. sigh.. Good luck!	09:12
EstEd	ok, I may also look at that. I am trying hard to take on all the information ;)	09:12
EstEd	thanks!	09:12
mup	Bug #1714273 opened: [2.3.0] Power Error when checking power status <hwcert-server> <MAAS:Incomplete> <https://launchpad.net/bugs/1714273>	14:57
xygnal	hi guys. I am having timeouts trying to access the API.	15:29
xygnal	memory is good, CPU gets a bit hot but it does nto fall behind on run queue.	15:29
xygnal	however.. I am seeing disk times of 500ms a lot on the region controller and on the rack controllers i see peaks of 1.5 SECONDS sometimes	15:30
xygnal	is 500ms repsonse time going to cause such stalls?	15:30
xygnal	we are running the pgsql database ANY region controller on the same box	15:30
roaksoax	xygnal: do the logs seem to show something is off? Or are there any rogue processes ?	15:35
roaksoax	xygnal: how many machines / rack controllers, etc	15:35
xygnal	roaksoax didnt see anything in logs so far, have a tcpdump setup to make sure it actually gets to MAAS	15:56
xygnal	1 region controller	15:56
xygnal	6 rack controllers	15:56
xygnal	I think we are under 500 servers right now but it might be a couple hundred higher	15:56
xygnal	all of MAAS region and racks are on ESXi VMs	15:57
xygnal	is twistd multi threaded? it kinda looked as if they were bound to a single cpu	15:58
roaksoax	xygnal: it is	16:01
roaksoax	xygnal: what version you running ?	16:01
xygnal	2.2	16:02
roaksoax	xygnal: 2.2.2 ?	16:02
roaksoax	xygnal: there's been some fixes in 2.2.x (soon to be 2.2.3) that may help with multiple region/racks	16:02
xygnal_	not sure	16:03
xygnal_	why?	16:03
roaksoax	there's some fixes to dns and rack/region communication/registration	16:04
=== frankban is now known as frankban\|afk
xygnal	roaksoax: yes we are 2.2.2 confirmed. We are worried about a bug you only have a test fix for in 2.3, about rack ccontroller restarts freaking out MAAS>	17:23
xygnal	also	17:23
xygnal	i discovered we only have 2 cpus on this region c ontroller. I expect we need more cores?	17:23
xygnal	4? 6?	17:23
roaksoax	xygnal: that bug is backported to 2.2 , but newer 2.2.2 is not released yet	17:24
roaksoax	xygnal: we typically recommend at least 4+ CPU's	17:24
xygnal	roaksoax according to the bug report it says 2.3 not 2.2...	17:24
roaksoax	xygnal: since the region runs 4 processes	17:24
roaksoax	xygnal: have a link ?	17:25
xygnal	1707071	17:26
xygnal	sorrry talking on my phone so i dont have to go through my work VPN to IRC	17:26
xygnal	thats the bug #	17:26
roaksoax	https://bugs.launchpad.net/nova/+bug/1707071 -> seems openstack bug	17:26
xygnal	oh	17:27
xygnal	i hit 0 instead of 9	17:28
xygnal	embarrsed	17:28
xygnal	1707971	17:28
roaksoax	xygnal: it is fix committed and targetted for 22.3	17:30
roaksoax	2.2.3	17:30
roaksoax	which has not been released yet	17:30
xygnal	ah	17:32
xygnal	when is 2.2.3 looking to come out?	17:32
xygnal	also, we are bumping up to 6 cores now	17:32
roaksoax	xygnal: i was hoping to put a candidate this week, but I think I'll have to punt that to next week	17:32
xygnal	is it not compataible such that i could backport it into 2.2 temporarily?	17:32
xygnal	2.2.2 i mean	17:34
roaksoax	xygnal: it should be: https://git.launchpad.net/maas/commit/?id=e34ededffc9cb96124ee2232793e0c064fdd735a	17:45
xygnal	roaksoax thanks that should be easy enough. just one region? not sure where provisionserver is used.	18:03
xygnal	*provisioningserver	18:03
xygnal	roaksoax not having the best time here. multiple users using our region server is causing the API and even the CLI to simply hang	20:15
xygnal	does not appear to be an OS bottleneck and I am not wearing warnings to correlate. I do not have debugging turned on, as far as I know.	20:15
xygnal	but I do have several twistd3 processes that use 100% cpu a peice	20:15
xygnal	even after restarteding regiond, even after rebooting	20:15
xygnal	i have idle CPU, i have free memory, but I cannot interface with MAAS!	20:16
roaksoax	xygnal: ok, so you said you have 1 controller with 6 rack controllers ?	20:18
roaksoax	xygnal: and CPU is ided	20:19
roaksoax	xygnal: but you cannot reach the UI/API ?	20:19
xygnal	6 cpus now after our talk	20:19
xygnal	and yes	20:19
xygnal	If i strace the running pid	20:19
xygnal	i see a lot of resource temporarily unavailable and connection timeouts	20:19
xygnal	same for most of those high CPU twisted3 pids	20:19
xygnal	I have NOT rebooted the rack controllers since earlier today	20:19
xygnal	correct. I mean I can reach it in that i can connect TCP to the box, or i can login via SSH and use maas CLI to connect	20:20
xygnal	but both hang the same	20:20
xygnal	no response	20:20
xygnal	MAAS is too busy to answer	20:20
roaksoax	xygnal: so, do this, if possible, stop all your rack controllers for a second	20:20
roaksoax	xygnal: and see if your region controller can be interfacted with	20:20
roaksoax	xygnal: then start adding rack by rack controller , but wait for them to fully connect	20:21
roaksoax	before adding a new rakc controller	20:21
xygnal	think it's possible they are returning load to the region controller?	20:21
roaksoax	xygnal: i'm thinking that all trying to connect to the region at the same time may be causing issues	20:22
xygnal	ok	20:22
roaksoax	xygnal: i'm running 2 region controllers and 4 rack controllers without issues, but definitely dont have 500 machines	20:22
xygnal	thanks	20:23
xygnal	will test this and if we see a difference we may break off some rack controllers to a new region	20:24
roaksoax	xygnal: so i've uploaded 2.2.3 candidate to ppa:maas/proposed	20:42
roaksoax	xygnal: that has not gone through final qa though	20:43
xygnal	roaksoax we're seeing the problem return after the first rack controller is returned	20:57
xygnal	roaksoax this may be off the wall but i have been thinking on this all day. a lot of our current builders are clients that are using ESXi. all we do for them is write their image down to disk at the end of a generic install and they boot up fine and take over.	20:58
xygnal	could that DD (which would take some time) cause any unwanted waiting from MAAS?	20:59
xygnal	just making sure.	20:59
xygnal	actually it's an SSD so i suppose, not that long	20:59
xygnal	if i check the MAAS twistd3's in lsof i can see the various client machines associated with that connection. I am certain we have multiple users trying to do things.	21:01
xygnal	that is the why. the question is, why is MAAS so busy, when the system... is not?	21:02
xygnal	i still dont see any convincing log messages	21:02
roaksoax	xygnal: i know what the issue is	21:14
roaksoax	err	21:14
roaksoax	nevermind	21:14
roaksoax	wrongwindows	21:14
xygnal	roaksoax still seeing a ton of 'resource temporarily unavailable' and 'connection timed out' messages from the only active twistd3 right now	21:22
xygnal	not from MAAS itself, but from strace -ffp	21:22
xygnal	mostly futex wait	21:23
xygnal	roaksoax proposed PPA applied, all systems booted, still can't run CLI commands	21:44
xygnal	hangs	21:44
mup	Bug #1714362 opened: [2.3] Power management balances between controllers <MAAS:New> <https://launchpad.net/bugs/1714362>	21:57
mup	Bug #1714362 changed: [2.3] Power management balances between controllers <MAAS:New> <https://launchpad.net/bugs/1714362>	22:00
mup	Bug #1714362 opened: [2.3] Power management balances between controllers <MAAS:New> <https://launchpad.net/bugs/1714362>	22:06
roaksoax	xygnal: even with 1 rack controller connceted ?	22:23
roaksoax	xygnal: if that is the case, only 1 rack controller	22:23
xygnal	yes	22:24
xygnal	even with only 1	22:24
roaksoax	xygnal: right, so that could mean some other problem, do you have firewalling in places or similar ?	22:25
xygnal	not entirely open no its behind company proxy. I noticed some kind of 'snap' connections failing constnatly	22:26
xygnal	as well as NTP connections failing sometimes	22:26
xygnal	it has no problems grabbing its images that I can recall	22:27
xygnal	and not aware of any connectivity changes recently	22:29
roaksoax	xygnal: any logs ?	22:29
roaksoax	'snap' connections, that's weird	22:30
roaksoax	xygnal: could be related to networking firewalling between region/racks ?	22:30
roaksoax	xygnal: i mean, if you are simply running the region and no racks and you cant contact the api	22:30
roaksoax	there could be another problem there	22:30
xygnal	no no. one rack.	22:30
xygnal	we put back first rack after shutting all down	22:31
xygnal	did noy see any node to node comm errors outsud	22:32
xygnal	outside of our own restarts	22:32
xygnal	can get you logs. all the rpc errors between region and rack in thrn	22:33
xygnal	correlate to our node restarts	22:33
roaksoax	xygnal: what if you try to reach the cli wihtin the same machine of the region controller ?	22:33
xygnal	we are...	22:33
xygnal	already doing so..	22:34
roaksoax	xygnal: and it works ?	22:37
xygnal	no	22:37
xygnal	hangs	22:37
xygnal	maas is hung up on something and i cannot see it in thr	22:38
xygnal	logs	22:38
xygnal	only recent change i am aware of is update to 2.2.2	22:39
xygnal	any regressions?	22:39
roaksoax	xygnal: sso doing it from the same region controller works	22:40
roaksoax	xygnal: but doesn't work remotely	22:40
xygnal	no. its not working at all with any re	22:41
xygnal	rack* controller connected	22:42
xygnal	even local to the region controller	22:42
roaksoax	xygnal: try this	22:42
roaksoax	stop maas-rackd	22:42
roaksoax	stop maas-regiond	22:42
roaksoax	ps faux \| grep twist	22:42
roaksoax	xygnal: and see if there are any rogue processes	22:42
roaksoax	xygnal: and then, sudo service postgresql stop	22:42
roaksoax	xygnal: and then sudo service postgresql start	22:43
roaksoax	sudo service maas-regiond start	22:43
roaksoax	sudo service maas-rackd start	22:43
xygnal	alright so. both services stopped. two twisted3's still going.	22:50
xygnal	even after pgsql stop and start	22:50
xygnal	oops nope	22:50
xygnal	there they go	22:50
xygnal	ok services back up	22:52
xygnal	maas create command... hung	22:52
xygnal	CPU utilization back up to 100% for twisted3 again	22:52
roaksoax	xygnal: hold on, so when you stopped maas-regiond and maas-rackd there were still twistd3 services running ?	22:53
roaksoax	xygnal: like this	22:53
roaksoax	maas 44484 0.0 0.0 4508 712 ? Ss 16:42 0:00 /bin/sh -c exec twistd3 --nodaemon --pidfile= --logger=provisioningserver.logger.EventLogger maas-regiond 2>&1 \| tee -a $LOGFILE	22:53
roaksoax	maas 44492 3.6 10.1 1093456 203008 ? Sl 16:42 4:49 \_ /usr/bin/python3 /usr/bin/twistd3 --nodaemon --pidfile= --logger=provisioningserver.logger.EventLogger maas-regiond	22:54
roaksoax	?>?	22:54
xygnal	yes	22:54
xygnal	like that	22:54
roaksoax	xygnal: so that seems there are rogue processes that are running	22:54
roaksoax	xygnal: so sudo service maas-regiond stop && sudo service maas-rackd stop	22:55
roaksoax	xygnal: ps faux \| grep twistd3	22:55
roaksoax	xygnal: and kill those rogue processes	22:55
xygnal	if i stop pgsql	22:57
xygnal	they do die	22:57
xygnal	after a minute or two	22:57
xygnal	but they come right back when i bring services back	22:57
roaksoax	xygnal: right, but I mean, make sure maas-regiond is stopped	22:57
roaksoax	same as maas-rackd	22:57
roaksoax	and check if there are twisted processes	22:58
roaksoax	if there are, then that's the probelm	22:58
roaksoax	as there shouldn't be	22:58
xygnal	they show stopped in systemctl status	22:58
xygnal	let me recap	22:58
xygnal	stop rackd, stop regiond, i see two twisted3's still going.	22:58
xygnal	stop pgsql	22:58
xygnal	wait a minute	22:58
xygnal	all twisted3's gone	22:58
xygnal	no more 'rogue process' running	22:58
xygnal	but if i start all 3 back up?	22:59
xygnal	problem is back	22:59
roaksoax	xygnal: ok, so stop rackd, stop regiond	22:59
roaksoax	kill -9 rogue processes	22:59
roaksoax	and wait and see if they come up	22:59
roaksoax	if they do, it is worth investigating where those are coming from	23:00
roaksoax	ubuntu@maas:~$ sudo service maas-regiond stop	23:00
roaksoax	ubuntu@maas:~$ sudo service maas-rackd stop	23:00
roaksoax	ubuntu@maas:~$ ps faux \| grep twistd3	23:00
roaksoax	ubuntu 10292 0.0 0.0 12944 984 pts/0 S+ 23:00 0:00 \_ grep --color=auto twistd3	23:00
xygnal	alright	23:01
xygnal	instead of stopping pgsql	23:01
xygnal	i just stopped the two sercices	23:01
xygnal	services*	23:01
xygnal	and killed the one twisted3 that didnt die	23:01
xygnal	manually with kill -9	23:01
xygnal	now there are none running	23:01
roaksoax	xygnal: ok so ps faux \| grep twistd3 shows no rogue processes running	23:02
roaksoax	xygnal: just double check that	23:02
xygnal	exactly	23:02
xygnal	zero results	23:02
roaksoax	xygnal: ok, so restart maas-regiond, wait a few seconds	23:02
roaksoax	xygnal: and restart maas-rackd	23:03
xygnal	alright	23:04
xygnal	looking good so far	23:04
xygnal	time to run maas create	23:04
xygnal	aaand	23:05
xygnal	cpu pegged 100% as soon as I did the command	23:05
xygnal	still no response	23:05
roaksoax	xygnal: can you share your logs /var/log/maas/*.log	23:06
roaksoax	but that's really really strange	23:06
mup	Bug #1714362 changed: [2.3] Power management balances between controllers on every power check <performance> <MAAS:Invalid> <https://launchpad.net/bugs/1714362>	23:06
xygnal	just the last 30 minutes of them or so?	23:06
roaksoax	xygnal: yeah that should be ok	23:07
xygnal	hm you know. does everything have a timestamp? if i try to grab just timstamp i am going to potentially flush out logs	23:08
xygnal	I mean, filter out, preventing you from seeing	23:08
roaksoax	xygnal: yeah	23:09
xygnal	well i can do it easy for maas.log and regiond.log but rackd has some stack traces, checking right now.	23:14
xygnal	looks like it threw a permission denied on lost+ found... what the...	23:14
xygnal	critical image downloading images failed	23:15
xygnal	where would it be downloading them?	23:15
xygnal	this was working before... looks like something has broken while I was away	23:16
roaksoax	strange	23:17
roaksoax	:/	23:17
xygnal	where do images usually get downloaded to?	23:17
xygnal	I didnt think we had a dedicated MOUNT for it	23:18
roaksoax	xygnal: images.maas.io	23:18
roaksoax	xygnal: oh	23:18
roaksoax	xygnal: to the database	23:18
roaksoax	xygnal: in the region controller	23:18
roaksoax	and then the rack controllers sync those images onto the filesystmes	23:18
xygnal	er... but i'm getting errors about a local file, permission denied on lost+found at the end of the stack trace	23:18
xygnal	for that critical error mentioned above	23:18
xygnal	if it goes strait into database... why am I seeing this?	23:19
roaksoax	xygnal: maybe you run out out of space ?	23:19
roaksoax	xygnal: maybe that's the rackd.log ?	23:19
roaksoax	or the rack putting those messages ?	23:19
xygnal	rackd.log on region	23:19
xygnal	rack controllers themselves are services off still from before	23:20
xygnal	its the region controller logs I am looking at	23:20
roaksoax	xygnal: right, that's strange, I've never send any similar issue	23:20
xygnal	and the rackd.log is whats saying that	23:20
roaksoax	maybe you just ran out of space ?	23:20
roaksoax	xygnal: ah rackd.log	23:20
xygnal	nope, hardly any space used	23:21
xygnal	well 42%	23:21
xygnal	lots of room yet	23:21
xygnal	same with inodes :)	23:21
roaksoax	strange	23:23

Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!