/srv/irclogs.ubuntu.com/2017/03/30/#juju-gui.txt

bdx	rick_h: yeah .. the best way I can describe what I'm experiencing is "controller lag"	03:04
bdx	I'm experiencing it now	03:04
bdx	us-east-1	03:04
bdx	my model has been sitting like this for 5 minutes http://paste.ubuntu.com/24278678/	03:05
bdx	I'm wondering if this controller/model lag has anything to do with the fact I've deployed so many machines to this model, and so many revs of the same charms?	03:07
bdx	the prm-worker-prod instances are c3.xlarges .... they usually just blast right through getting all the deps installed	03:10
bdx	so, this is why I think the lag I'm experiencing has to do with the long/large history of my model	03:11
bdx	^^ because if/when I create a new model and deploy everything the exact same way it works without a hitch	03:11
bdx	but the more deploys I send off on this model, the slower it gets	03:12
bdx	I know it sounds odd	03:12
bdx	but now that I'm thinking about it, I've experienced this behavior before, when I have had loaded models with long history	03:13
bdx	beta-controller-peeps:^^	03:17
bdx	heres the model uuid -> 38efd3e5-70da-42a3-8d0c-84d4cc0c5835	03:18
bdx	not sure if that will give you any insight, but it seems like that model is crashing or something	03:19
bdx	thats the best way I can describe it	03:19
urulama	bdx: that region was fixed now, let me know if it's still sluggish. the number of machines and units shouldn't be an issue at all (as we're not talking thousands :D)	06:16
urulama	bdx: are agents in a ready state?	06:17
=== frankban\|afk is now known as frankban
bdx	urulama: it was complete choas	17:00
bdx	urulama: creating a new model was the fix	17:00
bdx	urulama: every deploy was getting slower and slower, and each more unpredictable as far as what state things would end up in	17:01
bdx	urulama: @anastasiamac posted a list of bugs that are fixed in 2.2, to me it looks like what I was experiencing might be fixed in 2.2 then	17:04
bdx	urulama: bug # 1671258, bug # 1651291, bug # 1634328, bug # 1587644, bug # 1581069, bug # 1649719	17:04
bdx	*chaos	17:04
bdx	urulama: to be clear, I can take the same bundle that was relentlessly failing and sluggish, and deploy it in a new model and see things work correctly, and in a fraction of the time	17:08
=== frankban is now known as frankban\|afk
urulama	bdx: hey, sorry, out for the day ... you still see slow models even now?	17:16
urulama	bdx: update to 2.2 will get production to as soon it is final, sometime end of April	17:17
urulama	bdx: so, we need to resolve what's going on with your model	17:17
urulama	bdx: and agents are all "alive"?	17:17
bdx	urulama: I destroyed it	17:18
bdx	I needed the new model to have the same name	17:18
bdx	urulama: I've actually hit this a few times .... I had previously destroyed the affected models, and started on a clean slate and had the same results	17:19
bdx	urulama: let me see if I can recreate this for you	17:20
urulama	bdx: hm, controllers are fully instrumented and they're not showing any high CPU load or Mem issues	17:20
urulama	bdx: so, could be something "interesting"	17:20
bdx	urulama: thats what rick_h was saying last night when the issue was at its worst	17:21
bdx	urulama: that the controllers looked fine ... he couldn't identify any resource contention or high load	17:21
bdx	urulama: I'll be afk till monday after today ... possibly I should try to recreate this and ping you on monday	17:23
urulama	bdx: perfect, i'm out tomorrow as well :)	17:23
urulama	bdx: but yes, let's get to the root of this	17:23
urulama	bdx: maybe a good test would be trying on GCE ... just to eliminate the provider issue	17:25
bdx	urulama: yeah ... I don't have gce creds though	17:25
urulama	bdx: ok, worst case, i'll create a model on GCE and share it with you so that you can deploy to it	17:28
bdx	urulama: if you want to go ahead and do that, I'll start sending deploys while I'm at my desk today	17:29
bdx	oh ... your out today ... well just ping me whenever you get back to it then, or I'll get in touch on monday	17:31
urulama	bdx: you should see the model if you go to your profile page	17:36
bdx	urulama: nice, I see it	17:37
urulama	bdx: i hope i have enough vcpu's free to be able to deploy all that :D	17:58
bdx	urulama: well ...	18:12
bdx	urulama: I think I broke it already ... not sure its the same issue I was experiencing last night, but things seem "stuck" to say the least	18:13
urulama	bdx: it doesn't look good ... don't think it'll even finish	18:13
bdx	yeah ...	18:13
urulama	bdx: lemme poke at controllers	18:14
bdx	k	18:14
urulama	bdx: is this a bundle?	18:17
urulama	bdx: something that i can poke at outside of jaas, or even on 2.2-tip?	18:18
bdx	urulama: I just deployed canonical-kubernetes + ubuntux10 + my personal bundle	18:20
bdx	urulama: if I add you to my launchpad team, you could deploy it	18:20
urulama	bdx: ok, np, we'll build something equivalent to test	18:21
bdx	urulama: its prm-web and worker are just rails apps	18:21
bdx	urulama: each with relations to redis, postgres, and haproxy	18:21
urulama	kk	18:21
bdx	nothing special	18:21
bdx	urulama: yeah ... so we could do it more incrementally .... that doesn't explain why the machines are stuck though	18:23
bdx	I'll remove all the apps from the model for now	18:24
urulama	bdx: that's what i was thinking to test ... is it a bulk of api calls too much to handle and should it be incremental deploy	18:25
rick_h	bdx: yea, the native juju deploy bundle work has been known to choke on some large bundles like big openstack deploys. I wonder if chunking it will help or using the deployer vs juju deploy.	18:26
rick_h	bdx: the deployer has some built in timing/retry stuff in that juju doesn't do (as it should be updated to just work better as a bugfix tbh)	18:26
bdx	rick_h: ahhh, that doesn't explain this http://paste.ubuntu.com/24282721/	18:27
rick_h	bdx: no, but just a heads up. It's a known issue with the current bundle deploy w/o JAAS.	18:27
bdx	but yeah ... good things to keep in mind ... I mean, my bundle only has 6 applications	18:27
bdx	rick_h: I see	18:27
rick_h	84 machines? wow	18:27
bdx	yeah .. but none of them leave my model	18:28
bdx	*the model	18:28
bdx	the first 40 deployed fine via bundle	18:28
bdx	then I killed them off, and tried to redeploy and "cCRasHHH"	18:28
bdx	but they didn't realy die completely from juju's perspective at least	18:29
bdx	they are gone!	18:29
bdx	it just took 10 mins	18:30
bdx	unless one of you did something to clear it out	18:30
urulama	nope, didn't touch it, but the model is empty now	18:31
rick_h	bdx: no, but since juju does serialize things I'm not surprised.	18:31
bdx	seems to be working fine now	18:34
urulama	bdx: one machine?	18:35
bdx	yeah	18:35
urulama	bdx: yeah, i think it's the amount of concurrent requests ... we'll do some scale testing next week and will keep you posted	18:35
bdx	ok	18:37
urulama	bdx: so, i'm gonna deploy canonical-kubernetes there, then add 10x ubuntu, then add more, but with delays. just to check	18:39
=== dames is now known as thedac
urulama	bdx: ah, you're already doing it :)	18:40
bdx	urulama: ok, yeah ... I figured I should just iterate on bringin CDK up and down a few times in the model to ramp up the history	18:41
bdx	do what you wil	18:41
rick_h	urulama: might be something for someone while I'm away to update the stress tool to be able to deploy to the same model over and over vs new models.	18:44
urulama	rick_h: i don't think it's deploy/destroy, it's just "deploy N services" concurrently	18:46
rick_h	urulama: gotcha ok. I wasn't sure if you were trying to increase the size of the history of events/etc.	18:47
bdx	rick_h, urulama: I had upgraded the charms on my model many times ... I'm wondering how much revision history comes in to play here too	18:49
rick_h	bdx: that used to just cause issues with disk space over time but that's been corrected before/during juju2 and so I don't know of any current issues there	18:50
bdx	ok	18:50
bdx	so, I only had < 5 charms deployed over 10 instances, and the lagg was appearant even when only one instance was deploying, or a single unit added	18:52
rick_h	bdx: on gce or the aws from last night?	18:53
urulama	rick_h: gce was even worse than aws	18:54
rick_h	urulama: :/ ok	18:54
bdx	yesterday/lastnight - AWS	18:54
rick_h	bdx: ah ok gotcha. I thought you were saying just now.	18:55
urulama	balloons: i think we'll have to do new set of scaling tests. this time, ha controllers, and then "juju deploy" a large bundle (with say 80 machines, lots of relations). the bundle doesn't have to make sense, but the point is to find the amount of concurrent requests set to the controller that "breaks" it (ie, mongo transactions can't handle them anymore)	19:04
urulama	uiteam: ^	19:04
bdx	urulama: I'm wondering if we remove all the ubuntu boxes, and just deploy another CDK bundle, if the controller would still be lagging as it is now	19:14
bdx	or the model per say	19:14
bdx	or if we just removed everything, do you think we would still experience the lag that we are seeing righ tnow?	19:15
urulama	it should work as normal again once everything is removed	19:17
urulama	there	19:18
urulama	last units can't be processed	19:18
urulama	last units can't be processed	19:19
urulama	oops	19:19
bdx	urulama: do you mind if I kill "ubuntu", and deploy a second k8s?	19:21
urulama	bdx: go ahead, but with a delay just to make sure, as cleaning 100 machines will take some time	19:22
urulama	bdx: ok, i can reproduce this for the test now. after your deploy of k8s. i'll destroy the model and we'll look into this. will keep you posted. for now, advice would be to deploy in chunks	19:22
bdx	urulama: I'm glad we have exposed this issue, but it would only be the same as the one I've experienced if the sluggish behavior stuck around following the removal of all the instances	19:25
urulama	ok, then go ahead, let's verify what happens	19:25
bdx	what I was experiencing was a 500x sluggish model with only 10 instances	19:25
bdx	ok	19:25
bdx	oh .. even that already is clearly more responsive then what I was experiencing	19:26
bdx	I would remove an application, and it would take minutes before `juju status` even updatede	19:27
urulama	i've removed all machines for ubuntu, this should put some stress to the controller	19:35
urulama	bdx: ok, seems that model is stuck, nothing gets in our out anymore ... funny thing is, that controller is not affected and i am able to deploy from another model. never seen this before :)	19:48
* urulama has destroyed the model		19:49
bdx	urulama: you destroyed it?	19:56
urulama	bdx: yeah, had to, the model was not responding at all anymore and didn't want to leave 100 machines around	19:59
urulama	(had to remove them from the console in the end)	19:59
bdx	urulama: i see ... wow	20:02
urulama	bdx: yeah :-/ looks like fun week ahead :)	20:03
=== urulama is now known as urulama\|afk

Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!