[03:04] <bdx> rick_h: yeah .. the best way I can describe what I'm experiencing is "controller lag"
[03:04] <bdx> I'm experiencing it now
[03:04] <bdx> us-east-1
[03:05] <bdx> my model has been sitting like this for 5 minutes http://paste.ubuntu.com/24278678/
[03:07] <bdx> I'm wondering if this controller/model lag has anything to do with the fact I've deployed so many machines to this model, and so many revs of the same charms?
[03:10] <bdx> the prm-worker-prod instances are c3.xlarges .... they usually just blast right through getting all the deps installed
[03:11] <bdx> so, this is why I think the lag I'm experiencing has to do with the long/large history of my model
[03:11] <bdx> ^^ because if/when I create a new model and deploy everything the exact same way it works without a hitch
[03:12] <bdx> but the more deploys I send off on this model, the slower it gets
[03:12] <bdx> I know it sounds odd
[03:13] <bdx> but now that I'm thinking about it, I've experienced this behavior before, when I have had loaded models with long history
[03:17] <bdx> beta-controller-peeps:^^
[03:18] <bdx> heres the model uuid -> 38efd3e5-70da-42a3-8d0c-84d4cc0c5835
[03:19] <bdx> not sure if that will give you any insight, but it seems like that model is crashing or something
[03:19] <bdx> thats the best way I can describe it
[06:16] <urulama> bdx: that region was fixed now, let me know if it's still sluggish. the number of machines and units shouldn't be an issue at all (as we're not talking thousands :D) 
[06:17] <urulama> bdx: are agents in a ready state?
[17:00] <bdx> urulama: it was complete choas
[17:00] <bdx> urulama: creating a new model was the fix
[17:01] <bdx> urulama: every deploy was getting slower and slower, and each more unpredictable as far as what state things would end up in 
[17:04] <bdx> urulama: @anastasiamac posted a list of bugs that are fixed in 2.2, to me it looks like what I was experiencing might be fixed in 2.2 then
[17:04] <bdx> urulama: bug # 1671258, bug # 1651291, bug # 1634328, bug # 1587644, bug # 1581069, bug # 1649719 
[17:04] <bdx> *chaos
[17:08] <bdx> urulama: to be clear, I can take the same bundle that was relentlessly failing and sluggish, and deploy it in a new model and see things work correctly, and in a fraction of the time 
[17:16] <urulama> bdx: hey, sorry, out for the day ... you still see slow models even now?
[17:17] <urulama> bdx: update to 2.2 will get production to as soon it is final, sometime end of April 
[17:17] <urulama> bdx: so, we need to resolve what's going on with your model
[17:17] <urulama> bdx: and agents are all "alive"?
[17:18] <bdx> urulama: I destroyed it
[17:18] <bdx> I needed the new model to have the same name
[17:19] <bdx> urulama: I've actually hit this a few times .... I had previously destroyed the affected models, and started on a clean slate and had the same results 
[17:20] <bdx> urulama: let me see if I can recreate this for you
[17:20] <urulama> bdx: hm, controllers are fully instrumented and they're not showing any high CPU load or Mem issues
[17:20] <urulama> bdx: so, could be something "interesting"
[17:21] <bdx> urulama: thats what rick_h was saying last night when the issue was at its worst
[17:21] <bdx> urulama: that the controllers looked fine ... he couldn't identify any resource contention or high load
[17:23] <bdx> urulama: I'll be afk till monday after today ... possibly I should try to recreate this and ping you on monday 
[17:23] <urulama> bdx: perfect, i'm out tomorrow as well :)
[17:23] <urulama> bdx: but yes, let's get to the root of this
[17:25] <urulama> bdx: maybe a good test would be trying on GCE ... just to eliminate the provider issue
[17:25] <bdx> urulama: yeah ... I don't have gce creds though 
[17:28] <urulama> bdx: ok, worst case, i'll create a model on GCE and share it with you so that you can deploy to it
[17:29] <bdx> urulama: if you want to go ahead and do that, I'll start sending deploys while I'm at my desk today
[17:31] <bdx> oh ... your out today ... well just ping me whenever you get back to it then, or I'll get in touch on monday 
[17:36] <urulama> bdx: you should see the model if you go to your profile page 
[17:37] <bdx> urulama: nice, I see it
[17:58] <urulama> bdx: i hope i have enough vcpu's free to be able to deploy all that :D
[18:12] <bdx> urulama: well ... 
[18:13] <bdx> urulama: I think I broke it already ... not sure its the same issue I was experiencing last night, but things seem "stuck" to say the least
[18:13] <urulama> bdx: it doesn't look good ... don't think it'll even finish
[18:13] <bdx> yeah ... 
[18:14] <urulama> bdx: lemme poke at controllers
[18:14] <bdx> k
[18:17] <urulama> bdx: is this a bundle?
[18:18] <urulama> bdx: something that i can poke at outside of jaas, or even on 2.2-tip?
[18:20] <bdx> urulama: I just deployed canonical-kubernetes + ubuntux10 + my personal bundle
[18:20] <bdx> urulama: if I add you to my launchpad team, you could deploy it
[18:21] <urulama> bdx: ok, np, we'll build something equivalent to test
[18:21] <bdx> urulama: its prm-web and worker are just rails apps
[18:21] <bdx> urulama: each with relations to redis, postgres, and haproxy
[18:21] <urulama> kk
[18:21] <bdx> nothing special
[18:23] <bdx> urulama: yeah ... so we could do it more incrementally .... that doesn't explain why the machines are stuck though
[18:24] <bdx> I'll remove all the apps from the model for now
[18:25] <urulama> bdx: that's what i was thinking to test ... is it a bulk of api calls too much to handle and should it be incremental deploy
[18:26] <rick_h> bdx: yea, the native juju deploy bundle work has been known to choke on some large bundles like big openstack deploys. I wonder if chunking it will help or using the deployer vs juju deploy. 
[18:26] <rick_h> bdx: the deployer has some built in timing/retry stuff in that juju doesn't do (as it should be updated to just work better as a bugfix tbh)
[18:27] <bdx> rick_h: ahhh, that doesn't explain this http://paste.ubuntu.com/24282721/
[18:27] <rick_h> bdx: no, but just a heads up. It's a known issue with the current bundle deploy w/o JAAS. 
[18:27] <bdx> but yeah ... good things to keep in mind ... I mean, my bundle only has 6 applications
[18:27] <bdx> rick_h: I see
[18:27] <rick_h> 84 machines? wow
[18:28] <bdx> yeah .. but none of them leave my model
[18:28] <bdx> *the model
[18:28] <bdx> the first 40 deployed fine via bundle
[18:28] <bdx> then I killed them off, and tried to redeploy and "cCRasHHH"
[18:29] <bdx> but they didn't realy die completely from juju's perspective at least
[18:29] <bdx> they are gone!
[18:30] <bdx> it just took 10 mins
[18:30] <bdx> unless one of you did something to clear it out 
[18:31] <urulama> nope, didn't touch it, but the model is empty now
[18:31] <rick_h> bdx: no, but since juju does serialize things I'm not surprised. 
[18:34] <bdx> seems to be working fine now
[18:35] <urulama> bdx: one machine?
[18:35] <bdx> yeah
[18:35] <urulama> bdx: yeah, i think it's the amount of concurrent requests ... we'll do some scale testing next week and will keep you posted
[18:37] <bdx> ok
[18:39] <urulama> bdx: so, i'm gonna deploy canonical-kubernetes there, then add 10x ubuntu, then add more, but with delays. just to check
[18:40] <urulama> bdx: ah, you're already doing it :)
[18:41] <bdx> urulama: ok, yeah  ... I figured I should just iterate on bringin CDK up and down a few times in the model to ramp up the history
[18:41] <bdx> do what you wil
[18:44] <rick_h> urulama: might be something for someone while I'm away to update the stress tool to be able to deploy to the same model over and over vs new models. 
[18:46] <urulama> rick_h: i don't think it's deploy/destroy, it's just "deploy N services" concurrently
[18:47] <rick_h> urulama: gotcha ok. I wasn't sure if you were trying to increase the size of the history of events/etc. 
[18:49] <bdx> rick_h, urulama: I had upgraded the charms on my model many times ... I'm wondering how much revision history comes in to play here too
[18:50] <rick_h> bdx: that used to just cause issues with disk space over time but that's been corrected before/during juju2 and so I don't know of any current issues there
[18:50] <bdx> ok
[18:52] <bdx> so, I only had < 5 charms deployed over 10 instances, and the lagg was appearant even when only one instance was deploying, or a single unit added 
[18:53] <rick_h> bdx: on gce or the aws from last night?
[18:54] <urulama> rick_h: gce was even worse than aws
[18:54] <rick_h> urulama: :/ ok
[18:54] <bdx> yesterday/lastnight - AWS
[18:55] <rick_h> bdx: ah ok gotcha. I thought you were saying just now. 
[19:04] <urulama> balloons: i think we'll have to do new set of scaling tests. this time, ha controllers, and then "juju deploy" a large bundle (with say 80 machines, lots of relations). the bundle doesn't have to make sense, but the point is to find the amount of concurrent requests set to the controller that "breaks" it (ie, mongo transactions can't handle them anymore)
[19:04] <urulama> uiteam: ^
[19:14] <bdx> urulama: I'm wondering if we remove all the ubuntu boxes, and just deploy another CDK bundle, if the controller would still be lagging as it is now
[19:14] <bdx> or the model per say
[19:15] <bdx> or if we just removed everything, do you think we would still experience the lag that we are seeing righ tnow?
[19:17] <urulama> it should work as normal again once everything is removed
[19:18] <urulama> there
[19:18] <urulama> last units can't be processed
[19:19] <urulama> last units can't be processed
[19:19] <urulama> oops
[19:21] <bdx> urulama: do you mind if I kill "ubuntu", and deploy a second k8s?
[19:22] <urulama> bdx: go ahead, but with a delay just to make sure, as cleaning 100 machines will take some time
[19:22] <urulama> bdx: ok, i can reproduce this for the test now. after your deploy of k8s. i'll destroy the model and we'll look into this. will keep you posted. for now, advice would be to deploy in chunks
[19:25] <bdx> urulama: I'm glad we have exposed this issue, but it would only be the same as the one I've experienced if the sluggish behavior stuck around following the removal of all the instances
[19:25] <urulama> ok, then go ahead, let's verify what happens
[19:25] <bdx> what I was experiencing was a 500x sluggish model with only 10 instances
[19:25] <bdx> ok
[19:26] <bdx> oh .. even that already is clearly more responsive then what I was experiencing
[19:27] <bdx> I would remove an application, and it would take minutes before `juju status` even updatede
[19:35] <urulama> i've removed all machines for ubuntu, this should put some stress to the controller
[19:48] <urulama> bdx: ok, seems that model is stuck, nothing gets in our out anymore ... funny thing is, that controller is not affected and i am able to deploy from another model. never seen this before :)
[19:49]  * urulama has destroyed the model
[19:56] <bdx> urulama: you destroyed it?
[19:59] <urulama> bdx: yeah, had to, the model was not responding at all anymore and didn't want to leave 100 machines around
[19:59] <urulama> (had to remove them from the console in the end)
[20:02] <bdx> urulama: i see ... wow
[20:03] <urulama> bdx: yeah :-/ looks like fun week ahead :)