[04:19] Bug #1733900 changed: [2.3final, UI] Machines that have failed testing don't have an error icon <2.3qa> [16:05] anyone here? [17:44] roaksoax: around? [20:06] xygnal: I think he's out sick today; anything I can help with? [20:07] mpontillo: we have a bug report open for a very impactful issue [20:07] * mpontillo just found the "exploding twisted" bug; not good [20:10] on top of the exploding part, its really slow. it seems to always be loading the nodes list, and its always sooooo sloooow to load [20:10] xygnal: are you saying this bug did NOT occur prior to upgrading to MAAS 2.3? (that's interesting; I'm not aware of any changes that should have significantly impacted UI scalability between those two releases.) [20:11] we added quite a few nodes to the system since last restart of MAAS, I believe. We also had some past crazy behavior in the past after a region restart so this could be a bug we 'got around' before that is more exposed now. [20:12] even after eliminating any swap devices, i am seeing I/O wait times of 32 seconds soemtimes [20:12] no idea what MAAS is doing in those moments to queue so hard [20:15] but it could have something to do with the fact it burns through all the memory in minutes [20:17] xygnal: might be good to get some data on what postgresql is doing. I can imagine it might be worse if you've patched for meltdown/spectre as well... meanwhile, I wonder if you could help us get a some test data from your environment? [20:17] we've got monitors on the pgsql host and its performance so far is largely idle [20:17] yes please, what can I collect? [20:19] xygnal: if you can find out what queries MAAS executes just prior to the crash, (like, when you first load the page) that would be helpful. I'm trying to figure out if we can easily have you dump the database minus the OS images, but it's non trivial it seems [20:20] how can i get those queries logged? [20:21] xygnal: it is possible that we're loading up the websocket connection with a huge amount of results, which causes the OOM situation. it is difficult to turn on logging in pgsql without logging way too much though. I'll look into it... [20:21] we dont have direct access to pgsql box, it's on a box provided as a service. hm. we might have a non-root login. [20:21] i was hoping we could dump that kind of info out of the regiond itself [20:34] xygnal: so it might be nice to confirm that it's the act of loading the machines listing itself that causes the OOM situation. here's something you might be able to do https://paste.ubuntu.com/26446310/ [20:35] xygnal: you'd replace "mpontillo" in the example code with an admin username in MAAS, and run that after typing "sudo maas-region shell". [20:36] xygnal: that should run the database fetching outside the context of the region server - rather, in the Python shell itself. so if that process dies, that could confirm where the bug is [20:38] xygnal: here's a version you can copy/paste without thinking about it. https://paste.ubuntu.com/26446337/ [20:42] its running. still waiting. [20:43] cd [20:43] oops [20:45] xygnal: the other thing I was wondering: about how many concurrent UI sessions would you say are open? is it just the one? [20:46] if i reset the region controller and login as the first user [20:46] and go to nodes [20:46] it does not freak out [20:46] it just runs very slowly [20:46] if i try to 'reload' the page [20:46] that happens [20:46] as if it could not finish its first scan and the second scan called it to freaaaak-out [20:47] çç [20:47] Can maas set the dns search suffix even if it is not running/authoritative DNS? [20:48] curse my lazy bluetooth keyboard swtching. [20:50] box tanked so hard i can't even SSH in. waiting for it to settle down. [20:50] xygnal: wow, thanks for confirming [20:50] it should be handling it more gracefully than that if its memory, i removed all swap, so it shoukd oom_kill as soon as it hits 24GB [20:51] gimmic: currently no, we we use the list of all authoritative domains as the search list, and place the domain the machine is actually in first in the list [20:51] gimmic: as of MAAS 2.3 anyway - I think there was some inconsistent behavior prior to that [20:52] Okay, we currently come back around with ansible to 'fix' this, but it would be nice to deploy it out of the gate or have an option to deploy it that way [20:54] gimmic: how would you want that to look? a per-domain flag to indicate if the domain should be in the search list? [20:55] hmm. Maybe by zone [20:55] domain would likely work too [20:57] gimmic: I think they're effectively the same thing to MAAS right now anyway; we derive things like reverse zones, not sure about sub-zones of a domain, but I would think you could model it however it suits you [21:00] mpontillo: stopped/started regiond. your command works before I login to UI, your command works after I login to UI, your command continues to work until free memory hits 0 [21:02] xygnal: wait, are you saying that more memory is consumed each time you run it in the same python shell? [21:03] no no. I mean i dont see the memory issue at all if i restart maas-regiond and dont login to the UI. [21:04] even when i am in the UI and its loading the list of nodes so slowly, tha command returns just fine [21:04] its not the UI itself that is slow but the 'loading' of the nodes in the nodes page. [21:04] and if you refresh the nodes page [21:04] that memory issue kicks up and it soon after killed [21:07] the slowness and the oom condition may not be directly the same issue, just appearing at the same time [21:08] trying to get bumped up to 32gb today [21:08] just to be sure it actually uses all of that [21:10] xygnal: yeah, it's odd that the refresh itself seems to push it over the edge; I'm guessing maybe it can handle one session with all that data loaded, but when you refresh, the old session doesn't immediately go away. [21:13] its more than a double in memory jump [21:14] quit [21:21] xygnal: I'll keep poking at it. one thing I did was open the network inspector in Chrome and look at what the websocket was doing on a large MAAS. I noticed that it seems to make a lot of requests with regard to the device discovery listing. I wonder if it might help for you to look at the same data on your system to see what it's up to [21:23] wililupy: hi [21:23] xygnal: that is, if you open the javascript console and select the "network" tab, then find the entry for "ws?csrftoken=...", then click the "Frames" tab, you'll be able to see what data the UI is requesting and receiving. that might tell us more about what the UI is so obsessed about that it needs to consume so much memory =) [21:23] * mpontillo needs to step out for lunch, back later [21:23] Hi catbus! How are you doing? [21:24] wililupy: Hey, I am good. How are you? [21:24] wililupy: I have some questions about the demo here: https://insights.ubuntu.com/2017/03/01/devops-for-netops/ wonder if you can help clarify. [21:24] catbus: Good. I'm glad you are doing well as well. [21:25] catbus: I remember that article. I remember doing the demo as well. What are you questions? [21:26] wililupy: the wedge 100 running MAAS, is it an ONIE-based wedge 100? Is it classic ubuntu running on the switch? [21:28] wililupy: then what image does MAAS deploy to the wedge 40? assuming wedge 40 is also onie-based? [21:28] catbus: Yes, it is the Accton Wedge 100. It was running Ubuntu 16.04 with MAAS installed [21:29] catbus: MAAS deployment depended on 2 things for the Wedge. It can either use PXE and install Ubuntu Classic on the switch, or we could use ONIE to install an ONIE image that is hosted on the MAAS ToR Switch. [21:30] catbus: If we wanted to use ONIE, we disable PXE boot on the Wedge, if we want to deploy and managed the switch from MAAS, we enabled PXE on the Wedge and deploy it just like we would a server. [21:33] wililupy: ok, in the demo, wedge 40 uses pxe, and it was classic ubuntu running on top of it deployed by maas, do I read it right? [21:33] catbus: The demo we did at OCP last year was slightly different in that when MAAS enlisted a node, it would detect it was a switch, and then when we commissioned it would deploy Ubuntu 16.04 and then deploy the SONIC Snap automatically and build the required Kernel Modules needed for the ASIC to function. [21:33] catbus yes ma'am. [21:34] wililupy: how does it deploy the SONIC snap automatically? [21:34] node-specific preseed? [21:34] how does maas know it's a switch..? [21:36] catbus: it was for our demo so we had a custom preseed and custom image and some other customizations with MAAS to get this to work. [21:36] ok [21:37] catbus: bacially when enlisting the node, it would detect during the lspci and the dmidecode the ASIC and then MAAS would tag the device as a switch. That is actually stock now in MAAS 2.3