/srv/irclogs.ubuntu.com/2018/01/23/#maas.txt

mup	Bug #1733900 changed: [2.3final, UI] Machines that have failed testing don't have an error icon <2.3qa> <ui> <MAAS:Expired> <https://launchpad.net/bugs/1733900>	04:19
ejat	anyone here?	16:05
xygnal	roaksoax: around?	17:44
mpontillo	xygnal: I think he's out sick today; anything I can help with?	20:06
xygnal	mpontillo: we have a bug report open for a very impactful issue	20:07
* mpontillo just found the "exploding twisted" bug; not good		20:07
xygnal	on top of the exploding part, its really slow. it seems to always be loading the nodes list, and its always sooooo sloooow to load	20:10
mpontillo	xygnal: are you saying this bug did NOT occur prior to upgrading to MAAS 2.3? (that's interesting; I'm not aware of any changes that should have significantly impacted UI scalability between those two releases.)	20:10
xygnal	we added quite a few nodes to the system since last restart of MAAS, I believe. We also had some past crazy behavior in the past after a region restart so this could be a bug we 'got around' before that is more exposed now.	20:11
xygnal	even after eliminating any swap devices, i am seeing I/O wait times of 32 seconds soemtimes	20:12
xygnal	no idea what MAAS is doing in those moments to queue so hard	20:12
xygnal	but it could have something to do with the fact it burns through all the memory in minutes	20:15
mpontillo	xygnal: might be good to get some data on what postgresql is doing. I can imagine it might be worse if you've patched for meltdown/spectre as well... meanwhile, I wonder if you could help us get a some test data from your environment?	20:17
xygnal	we've got monitors on the pgsql host and its performance so far is largely idle	20:17
xygnal	yes please, what can I collect?	20:17
mpontillo	xygnal: if you can find out what queries MAAS executes just prior to the crash, (like, when you first load the page) that would be helpful. I'm trying to figure out if we can easily have you dump the database minus the OS images, but it's non trivial it seems	20:19
xygnal	how can i get those queries logged?	20:20
mpontillo	xygnal: it is possible that we're loading up the websocket connection with a huge amount of results, which causes the OOM situation. it is difficult to turn on logging in pgsql without logging way too much though. I'll look into it...	20:21
xygnal	we dont have direct access to pgsql box, it's on a box provided as a service. hm. we might have a non-root login.	20:21
xygnal	i was hoping we could dump that kind of info out of the regiond itself	20:21
mpontillo	xygnal: so it might be nice to confirm that it's the act of loading the machines listing itself that causes the OOM situation. here's something you might be able to do https://paste.ubuntu.com/26446310/	20:34
mpontillo	xygnal: you'd replace "mpontillo" in the example code with an admin username in MAAS, and run that after typing "sudo maas-region shell".	20:35
mpontillo	xygnal: that should run the database fetching outside the context of the region server - rather, in the Python shell itself. so if that process dies, that could confirm where the bug is	20:36
mpontillo	xygnal: here's a version you can copy/paste without thinking about it. https://paste.ubuntu.com/26446337/	20:38
xygnal	its running. still waiting.	20:42
xygnal	cd	20:43
xygnal	oops	20:43
mpontillo	xygnal: the other thing I was wondering: about how many concurrent UI sessions would you say are open? is it just the one?	20:45
xygnal	if i reset the region controller and login as the first user	20:46
xygnal	and go to nodes	20:46
xygnal	it does not freak out	20:46
xygnal	it just runs very slowly	20:46
xygnal	if i try to 'reload' the page	20:46
xygnal	that happens	20:46
xygnal	as if it could not finish its first scan and the second scan called it to freaaaak-out	20:46
xygnal	çç	20:47
gimmic	Can maas set the dns search suffix even if it is not running/authoritative DNS?	20:47
xygnal	curse my lazy bluetooth keyboard swtching.	20:48
xygnal	box tanked so hard i can't even SSH in. waiting for it to settle down.	20:50
mpontillo	xygnal: wow, thanks for confirming	20:50
xygnal	it should be handling it more gracefully than that if its memory, i removed all swap, so it shoukd oom_kill as soon as it hits 24GB	20:50
mpontillo	gimmic: currently no, we we use the list of all authoritative domains as the search list, and place the domain the machine is actually in first in the list	20:51
mpontillo	gimmic: as of MAAS 2.3 anyway - I think there was some inconsistent behavior prior to that	20:51
gimmic	Okay, we currently come back around with ansible to 'fix' this, but it would be nice to deploy it out of the gate or have an option to deploy it that way	20:52
mpontillo	gimmic: how would you want that to look? a per-domain flag to indicate if the domain should be in the search list?	20:54
gimmic	hmm. Maybe by zone	20:55
gimmic	domain would likely work too	20:55
mpontillo	gimmic: I think they're effectively the same thing to MAAS right now anyway; we derive things like reverse zones, not sure about sub-zones of a domain, but I would think you could model it however it suits you	20:57
xygnal	mpontillo: stopped/started regiond. your command works before I login to UI, your command works after I login to UI, your command continues to work until free memory hits 0	21:00
mpontillo	xygnal: wait, are you saying that more memory is consumed each time you run it in the same python shell?	21:02
xygnal	no no. I mean i dont see the memory issue at all if i restart maas-regiond and dont login to the UI.	21:03
xygnal	even when i am in the UI and its loading the list of nodes so slowly, tha command returns just fine	21:04
xygnal	its not the UI itself that is slow but the 'loading' of the nodes in the nodes page.	21:04
xygnal	and if you refresh the nodes page	21:04
xygnal	that memory issue kicks up and it soon after killed	21:04
xygnal	the slowness and the oom condition may not be directly the same issue, just appearing at the same time	21:07
xygnal	trying to get bumped up to 32gb today	21:08
xygnal	just to be sure it actually uses all of that	21:08
mpontillo	xygnal: yeah, it's odd that the refresh itself seems to push it over the edge; I'm guessing maybe it can handle one session with all that data loaded, but when you refresh, the old session doesn't immediately go away.	21:10
xygnal	its more than a double in memory jump	21:13
xygnal	quit	21:14
mpontillo	xygnal: I'll keep poking at it. one thing I did was open the network inspector in Chrome and look at what the websocket was doing on a large MAAS. I noticed that it seems to make a lot of requests with regard to the device discovery listing. I wonder if it might help for you to look at the same data on your system to see what it's up to	21:21
catbus	wililupy: hi	21:23
mpontillo	xygnal: that is, if you open the javascript console and select the "network" tab, then find the entry for "ws?csrftoken=...", then click the "Frames" tab, you'll be able to see what data the UI is requesting and receiving. that might tell us more about what the UI is so obsessed about that it needs to consume so much memory =)	21:23
* mpontillo needs to step out for lunch, back later		21:23
wililupy	Hi catbus! How are you doing?	21:23
catbus	wililupy: Hey, I am good. How are you?	21:24
catbus	wililupy: I have some questions about the demo here: https://insights.ubuntu.com/2017/03/01/devops-for-netops/ wonder if you can help clarify.	21:24
wililupy	catbus: Good. I'm glad you are doing well as well.	21:24
wililupy	catbus: I remember that article. I remember doing the demo as well. What are you questions?	21:25
catbus	wililupy: the wedge 100 running MAAS, is it an ONIE-based wedge 100? Is it classic ubuntu running on the switch?	21:26
catbus	wililupy: then what image does MAAS deploy to the wedge 40? assuming wedge 40 is also onie-based?	21:28
wililupy	catbus: Yes, it is the Accton Wedge 100. It was running Ubuntu 16.04 with MAAS installed	21:28
wililupy	catbus: MAAS deployment depended on 2 things for the Wedge. It can either use PXE and install Ubuntu Classic on the switch, or we could use ONIE to install an ONIE image that is hosted on the MAAS ToR Switch.	21:29
wililupy	catbus: If we wanted to use ONIE, we disable PXE boot on the Wedge, if we want to deploy and managed the switch from MAAS, we enabled PXE on the Wedge and deploy it just like we would a server.	21:30
catbus	wililupy: ok, in the demo, wedge 40 uses pxe, and it was classic ubuntu running on top of it deployed by maas, do I read it right?	21:33
wililupy	catbus: The demo we did at OCP last year was slightly different in that when MAAS enlisted a node, it would detect it was a switch, and then when we commissioned it would deploy Ubuntu 16.04 and then deploy the SONIC Snap automatically and build the required Kernel Modules needed for the ASIC to function.	21:33
wililupy	catbus yes ma'am.	21:33
catbus	wililupy: how does it deploy the SONIC snap automatically?	21:34
catbus	node-specific preseed?	21:34
catbus	how does maas know it's a switch..?	21:34
wililupy	catbus: it was for our demo so we had a custom preseed and custom image and some other customizations with MAAS to get this to work.	21:36
catbus	ok	21:36
wililupy	catbus: bacially when enlisting the node, it would detect during the lspci and the dmidecode the ASIC and then MAAS would tag the device as a switch. That is actually stock now in MAAS 2.3	21:37

Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!