/srv/irclogs.ubuntu.com/2017/03/30/#juju-gui.txt

bdxrick_h: yeah .. the best way I can describe what I'm experiencing is "controller lag"03:04
bdxI'm experiencing it now03:04
bdxus-east-103:04
bdxmy model has been sitting like this for 5 minutes http://paste.ubuntu.com/24278678/03:05
bdxI'm wondering if this controller/model lag has anything to do with the fact I've deployed so many machines to this model, and so many revs of the same charms?03:07
bdxthe prm-worker-prod instances are c3.xlarges .... they usually just blast right through getting all the deps installed03:10
bdxso, this is why I think the lag I'm experiencing has to do with the long/large history of my model03:11
bdx^^ because if/when I create a new model and deploy everything the exact same way it works without a hitch03:11
bdxbut the more deploys I send off on this model, the slower it gets03:12
bdxI know it sounds odd03:12
bdxbut now that I'm thinking about it, I've experienced this behavior before, when I have had loaded models with long history03:13
bdxbeta-controller-peeps:^^03:17
bdxheres the model uuid -> 38efd3e5-70da-42a3-8d0c-84d4cc0c583503:18
bdxnot sure if that will give you any insight, but it seems like that model is crashing or something03:19
bdxthats the best way I can describe it03:19
urulamabdx: that region was fixed now, let me know if it's still sluggish. the number of machines and units shouldn't be an issue at all (as we're not talking thousands :D) 06:16
urulamabdx: are agents in a ready state?06:17
=== frankban|afk is now known as frankban
bdxurulama: it was complete choas17:00
bdxurulama: creating a new model was the fix17:00
bdxurulama: every deploy was getting slower and slower, and each more unpredictable as far as what state things would end up in 17:01
bdxurulama: @anastasiamac posted a list of bugs that are fixed in 2.2, to me it looks like what I was experiencing might be fixed in 2.2 then17:04
bdxurulama: bug # 1671258, bug # 1651291, bug # 1634328, bug # 1587644, bug # 1581069, bug # 1649719 17:04
bdx*chaos17:04
bdxurulama: to be clear, I can take the same bundle that was relentlessly failing and sluggish, and deploy it in a new model and see things work correctly, and in a fraction of the time 17:08
=== frankban is now known as frankban|afk
urulamabdx: hey, sorry, out for the day ... you still see slow models even now?17:16
urulamabdx: update to 2.2 will get production to as soon it is final, sometime end of April 17:17
urulamabdx: so, we need to resolve what's going on with your model17:17
urulamabdx: and agents are all "alive"?17:17
bdxurulama: I destroyed it17:18
bdxI needed the new model to have the same name17:18
bdxurulama: I've actually hit this a few times .... I had previously destroyed the affected models, and started on a clean slate and had the same results 17:19
bdxurulama: let me see if I can recreate this for you17:20
urulamabdx: hm, controllers are fully instrumented and they're not showing any high CPU load or Mem issues17:20
urulamabdx: so, could be something "interesting"17:20
bdxurulama: thats what rick_h was saying last night when the issue was at its worst17:21
bdxurulama: that the controllers looked fine ... he couldn't identify any resource contention or high load17:21
bdxurulama: I'll be afk till monday after today ... possibly I should try to recreate this and ping you on monday 17:23
urulamabdx: perfect, i'm out tomorrow as well :)17:23
urulamabdx: but yes, let's get to the root of this17:23
urulamabdx: maybe a good test would be trying on GCE ... just to eliminate the provider issue17:25
bdxurulama: yeah ... I don't have gce creds though 17:25
urulamabdx: ok, worst case, i'll create a model on GCE and share it with you so that you can deploy to it17:28
bdxurulama: if you want to go ahead and do that, I'll start sending deploys while I'm at my desk today17:29
bdxoh ... your out today ... well just ping me whenever you get back to it then, or I'll get in touch on monday 17:31
urulamabdx: you should see the model if you go to your profile page 17:36
bdxurulama: nice, I see it17:37
urulamabdx: i hope i have enough vcpu's free to be able to deploy all that :D17:58
bdxurulama: well ... 18:12
bdxurulama: I think I broke it already ... not sure its the same issue I was experiencing last night, but things seem "stuck" to say the least18:13
urulamabdx: it doesn't look good ... don't think it'll even finish18:13
bdxyeah ... 18:13
urulamabdx: lemme poke at controllers18:14
bdxk18:14
urulamabdx: is this a bundle?18:17
urulamabdx: something that i can poke at outside of jaas, or even on 2.2-tip?18:18
bdxurulama: I just deployed canonical-kubernetes + ubuntux10 + my personal bundle18:20
bdxurulama: if I add you to my launchpad team, you could deploy it18:20
urulamabdx: ok, np, we'll build something equivalent to test18:21
bdxurulama: its prm-web and worker are just rails apps18:21
bdxurulama: each with relations to redis, postgres, and haproxy18:21
urulamakk18:21
bdxnothing special18:21
bdxurulama: yeah ... so we could do it more incrementally .... that doesn't explain why the machines are stuck though18:23
bdxI'll remove all the apps from the model for now18:24
urulamabdx: that's what i was thinking to test ... is it a bulk of api calls too much to handle and should it be incremental deploy18:25
rick_hbdx: yea, the native juju deploy bundle work has been known to choke on some large bundles like big openstack deploys. I wonder if chunking it will help or using the deployer vs juju deploy. 18:26
rick_hbdx: the deployer has some built in timing/retry stuff in that juju doesn't do (as it should be updated to just work better as a bugfix tbh)18:26
bdxrick_h: ahhh, that doesn't explain this http://paste.ubuntu.com/24282721/18:27
rick_hbdx: no, but just a heads up. It's a known issue with the current bundle deploy w/o JAAS. 18:27
bdxbut yeah ... good things to keep in mind ... I mean, my bundle only has 6 applications18:27
bdxrick_h: I see18:27
rick_h84 machines? wow18:27
bdxyeah .. but none of them leave my model18:28
bdx*the model18:28
bdxthe first 40 deployed fine via bundle18:28
bdxthen I killed them off, and tried to redeploy and "cCRasHHH"18:28
bdxbut they didn't realy die completely from juju's perspective at least18:29
bdxthey are gone!18:29
bdxit just took 10 mins18:30
bdxunless one of you did something to clear it out 18:30
urulamanope, didn't touch it, but the model is empty now18:31
rick_hbdx: no, but since juju does serialize things I'm not surprised. 18:31
bdxseems to be working fine now18:34
urulamabdx: one machine?18:35
bdxyeah18:35
urulamabdx: yeah, i think it's the amount of concurrent requests ... we'll do some scale testing next week and will keep you posted18:35
bdxok18:37
urulamabdx: so, i'm gonna deploy canonical-kubernetes there, then add 10x ubuntu, then add more, but with delays. just to check18:39
=== dames is now known as thedac
urulamabdx: ah, you're already doing it :)18:40
bdxurulama: ok, yeah  ... I figured I should just iterate on bringin CDK up and down a few times in the model to ramp up the history18:41
bdxdo what you wil18:41
rick_hurulama: might be something for someone while I'm away to update the stress tool to be able to deploy to the same model over and over vs new models. 18:44
urulamarick_h: i don't think it's deploy/destroy, it's just "deploy N services" concurrently18:46
rick_hurulama: gotcha ok. I wasn't sure if you were trying to increase the size of the history of events/etc. 18:47
bdxrick_h, urulama: I had upgraded the charms on my model many times ... I'm wondering how much revision history comes in to play here too18:49
rick_hbdx: that used to just cause issues with disk space over time but that's been corrected before/during juju2 and so I don't know of any current issues there18:50
bdxok18:50
bdxso, I only had < 5 charms deployed over 10 instances, and the lagg was appearant even when only one instance was deploying, or a single unit added 18:52
rick_hbdx: on gce or the aws from last night?18:53
urulamarick_h: gce was even worse than aws18:54
rick_hurulama: :/ ok18:54
bdxyesterday/lastnight - AWS18:54
rick_hbdx: ah ok gotcha. I thought you were saying just now. 18:55
urulamaballoons: i think we'll have to do new set of scaling tests. this time, ha controllers, and then "juju deploy" a large bundle (with say 80 machines, lots of relations). the bundle doesn't have to make sense, but the point is to find the amount of concurrent requests set to the controller that "breaks" it (ie, mongo transactions can't handle them anymore)19:04
urulamauiteam: ^19:04
bdxurulama: I'm wondering if we remove all the ubuntu boxes, and just deploy another CDK bundle, if the controller would still be lagging as it is now19:14
bdxor the model per say19:14
bdxor if we just removed everything, do you think we would still experience the lag that we are seeing righ tnow?19:15
urulamait should work as normal again once everything is removed19:17
urulamathere19:18
urulamalast units can't be processed19:18
urulamalast units can't be processed19:19
urulamaoops19:19
bdxurulama: do you mind if I kill "ubuntu", and deploy a second k8s?19:21
urulamabdx: go ahead, but with a delay just to make sure, as cleaning 100 machines will take some time19:22
urulamabdx: ok, i can reproduce this for the test now. after your deploy of k8s. i'll destroy the model and we'll look into this. will keep you posted. for now, advice would be to deploy in chunks19:22
bdxurulama: I'm glad we have exposed this issue, but it would only be the same as the one I've experienced if the sluggish behavior stuck around following the removal of all the instances19:25
urulamaok, then go ahead, let's verify what happens19:25
bdxwhat I was experiencing was a 500x sluggish model with only 10 instances19:25
bdxok19:25
bdxoh .. even that already is clearly more responsive then what I was experiencing19:26
bdxI would remove an application, and it would take minutes before `juju status` even updatede19:27
urulamai've removed all machines for ubuntu, this should put some stress to the controller19:35
urulamabdx: ok, seems that model is stuck, nothing gets in our out anymore ... funny thing is, that controller is not affected and i am able to deploy from another model. never seen this before :)19:48
* urulama has destroyed the model19:49
bdxurulama: you destroyed it?19:56
urulamabdx: yeah, had to, the model was not responding at all anymore and didn't want to leave 100 machines around19:59
urulama(had to remove them from the console in the end)19:59
bdxurulama: i see ... wow20:02
urulamabdx: yeah :-/ looks like fun week ahead :) 20:03
=== urulama is now known as urulama|afk

Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!