mupBug #1733900 changed: [2.3final, UI] Machines that have failed testing don't have an error icon <2.3qa> <ui> <MAAS:Expired> <https://launchpad.net/bugs/1733900>04:19
ejatanyone here?16:05
xygnalroaksoax: around?17:44
mpontilloxygnal: I think he's out sick today; anything I can help with?20:06
xygnalmpontillo: we have a bug report open for a very impactful issue20:07
* mpontillo just found the "exploding twisted" bug; not good20:07
xygnalon top of the exploding part, its really slow.  it seems to always be loading the nodes list, and its always sooooo sloooow to load20:10
mpontilloxygnal: are you saying this bug did NOT occur prior to upgrading to MAAS 2.3? (that's interesting; I'm not aware of any changes that should have significantly impacted UI scalability between those two releases.)20:10
xygnalwe added quite a few nodes to the system since last restart of MAAS, I believe.  We also had some past crazy behavior in the past after a region restart so this could be a bug we 'got around' before that is more exposed now.20:11
xygnaleven after eliminating any swap devices, i am seeing I/O wait times of 32 seconds soemtimes20:12
xygnalno idea what MAAS is doing in those moments to queue so hard20:12
xygnalbut it could have something to do with the fact it burns through all the memory in minutes20:15
mpontilloxygnal: might be good to get some data on what postgresql is doing. I can imagine it might be worse if you've patched for meltdown/spectre as well... meanwhile, I wonder if you could help us get a some test data from your environment?20:17
xygnalwe've got monitors on the pgsql host and its performance so far is largely idle20:17
xygnalyes please, what can I collect?20:17
mpontilloxygnal: if you can find out what queries MAAS executes just prior to the crash, (like, when you first load the page) that would be helpful. I'm trying to figure out if we can easily have you dump the database minus the OS images, but it's non trivial it seems20:19
xygnalhow can i get those queries logged?20:20
mpontilloxygnal: it is possible that we're loading up the websocket connection with a huge amount of results, which causes the OOM situation. it is difficult to turn on logging in pgsql without logging way too much though. I'll look into it...20:21
xygnalwe dont have direct access to pgsql box, it's on a box provided as a service.  hm. we might have a non-root login.20:21
xygnali was hoping we could dump that kind of info out of the regiond itself20:21
mpontilloxygnal: so it might be nice to confirm that it's the act of loading the machines listing itself that causes the OOM situation. here's something you might be able to do https://paste.ubuntu.com/26446310/20:34
mpontilloxygnal: you'd replace "mpontillo" in the example code with an admin username in MAAS, and run that after typing "sudo maas-region shell".20:35
mpontilloxygnal: that should run the database fetching outside the context of the region server - rather, in the Python shell itself. so if that process dies, that could confirm where the bug is20:36
mpontilloxygnal: here's a version you can copy/paste without thinking about it. https://paste.ubuntu.com/26446337/20:38
xygnalits running. still waiting.20:42
mpontilloxygnal: the other thing I was wondering: about how many concurrent UI sessions would you say are open? is it just the one?20:45
xygnalif i reset the region controller and login as the first user20:46
xygnaland go to nodes20:46
xygnalit does not freak out20:46
xygnalit just runs very slowly20:46
xygnalif i try to 'reload' the page20:46
xygnalthat happens20:46
xygnalas if it could not finish its first scan and the second scan called it to freaaaak-out20:46
gimmicCan maas set the dns search suffix even if it is not running/authoritative DNS?20:47
xygnalcurse my lazy bluetooth keyboard swtching.20:48
xygnalbox tanked so hard i can't even SSH in.  waiting for it to settle down.20:50
mpontilloxygnal: wow, thanks for confirming20:50
xygnal it should be handling it more gracefully than that if its memory, i removed all swap, so it shoukd oom_kill as soon as it hits 24GB20:50
mpontillogimmic: currently no, we we use the list of all authoritative domains as the search list, and place the domain the machine is actually in first in the list20:51
mpontillogimmic: as of MAAS 2.3 anyway - I think there was some inconsistent behavior prior to that20:51
gimmicOkay, we currently come back around with ansible to 'fix' this, but it would be nice to deploy it out of the gate or have an option to deploy it that way20:52
mpontillogimmic: how would you want that to look? a per-domain flag to indicate if the domain should be in the search list?20:54
gimmichmm. Maybe by zone20:55
gimmicdomain would likely work too20:55
mpontillogimmic: I think they're effectively the same thing to MAAS right now anyway; we derive things like reverse zones, not sure about sub-zones of a domain, but I would think you could model it however it suits you20:57
xygnalmpontillo:  stopped/started regiond.  your command works before I login to UI, your command works after I login to UI, your command continues to work until free memory hits 021:00
mpontilloxygnal: wait, are you saying that more memory is consumed each time you run it in the same python shell?21:02
xygnal no no.  I mean i dont see the memory issue at all if i restart maas-regiond and dont login to the UI.21:03
xygnaleven when i am in the UI and its loading the list of nodes so slowly, tha command returns just fine21:04
xygnalits not the UI itself that is slow but the 'loading' of the nodes in the nodes page.21:04
xygnaland if you refresh the nodes page21:04
xygnalthat memory issue kicks up and it soon after killed21:04
xygnalthe slowness and the oom condition may not be directly the same issue, just appearing at the same time21:07
xygnaltrying to get bumped up to 32gb today21:08
xygnaljust to be sure it actually uses all of that21:08
mpontilloxygnal: yeah, it's odd that the refresh itself seems to push it over the edge; I'm guessing maybe it can handle one session with all that data loaded, but when you refresh, the old session doesn't immediately go away.21:10
xygnalits more than a double in memory jump21:13
mpontilloxygnal: I'll keep poking at it. one thing I did was open the network inspector in Chrome and look at what the websocket was doing on a large MAAS. I noticed that it seems to make a lot of requests with regard to the device discovery listing. I wonder if it might help for you to look at the same data on your system to see what it's up to21:21
catbuswililupy: hi21:23
mpontilloxygnal: that is, if you open the javascript console and select the "network" tab, then find the entry for "ws?csrftoken=...", then click the "Frames" tab, you'll be able to see what data the UI is requesting and receiving. that might tell us more about what the UI is so obsessed about that it needs to consume so much memory =)21:23
* mpontillo needs to step out for lunch, back later21:23
wililupyHi catbus! How are you doing?21:23
catbuswililupy: Hey, I am good. How are you?21:24
catbuswililupy: I have some questions about the demo here: https://insights.ubuntu.com/2017/03/01/devops-for-netops/ wonder if you can help clarify.21:24
wililupycatbus: Good. I'm glad you are doing well as well.21:24
wililupycatbus: I remember that article. I remember doing the demo as well. What are you questions?21:25
catbuswililupy: the wedge 100 running MAAS, is it an ONIE-based wedge 100? Is it classic ubuntu running on the switch?21:26
catbuswililupy: then what image does MAAS deploy to the wedge 40? assuming wedge 40 is also onie-based?21:28
wililupycatbus: Yes, it is the Accton Wedge 100. It was running Ubuntu 16.04 with MAAS installed21:28
wililupycatbus: MAAS deployment depended on 2 things for the Wedge. It can either use PXE and install Ubuntu Classic on the switch, or we could use ONIE to install an ONIE image that is hosted on the MAAS ToR Switch.21:29
wililupycatbus: If we wanted to use ONIE, we disable PXE boot on the Wedge, if we want to deploy and managed the switch from MAAS, we enabled PXE on the Wedge and deploy it just like we would a server.21:30
catbuswililupy: ok, in the demo, wedge 40 uses pxe, and it was classic ubuntu running on top of it deployed by maas, do I read it right?21:33
wililupycatbus: The demo we did at OCP last year was slightly different in that when MAAS enlisted a node, it would detect it was a switch, and then when we commissioned it would deploy Ubuntu 16.04 and then deploy the SONIC Snap automatically and build the required Kernel Modules needed for the ASIC to function.21:33
wililupycatbus yes ma'am.21:33
catbuswililupy: how does it deploy the SONIC snap automatically?21:34
catbusnode-specific preseed?21:34
catbushow does maas know it's a switch..?21:34
wililupycatbus: it was for our demo so we had a custom preseed and custom image and some other customizations with MAAS to get this to work.21:36
wililupycatbus: bacially when enlisting the node, it would detect during the lspci and the dmidecode the ASIC and then MAAS would tag the device as a switch. That is actually stock now in MAAS 2.321:37

Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!