/srv/irclogs.ubuntu.com/2017/06/08/#juju-dev.txt

wallyworldso doesn't happen immediately if no cpu is available00:00
thumperwallyworld: it seems worse than that00:08
axwmhilton: your updates LGTM, thank you. did you run the change to use interactive auth-type past uros?00:10
axwwallyworld: thanks I'll fix that doc up in a minute00:25
thumperI've got a bad feeling about this00:36
thumperaxw: can I talk something through for 5 minutes?00:37
thumperah... stand up00:37
thumperguess now00:37
thumpernot00:37
thumperhmm... maybe this is fine...00:41
axwthumper: sorry, missed your message. still need to talk? I can be with you in a couple of minutes01:18
wallyworldaxw: from what i can see, the token bucket implementation isn't really what i need - that provides a way to control the max rate at which things are processed. i need to use a random delay based on the observed rate of incoming connections.01:19
thumperaxw: no, I think I'm ok01:19
thumperjust testing manually now01:19
thumperwas concerned about client  <-> server facade version handling01:20
thumperbut I think we are ok01:20
axwthumper: okey dokey01:20
axwwallyworld: why random?01:20
axwwallyworld: token bucket is a standard method for rate limiting, just wondering why it's not enough01:21
wallyworldwe want to increase the pause depending on load01:21
wallyworldthe less load, the shorter the pause01:21
wallyworldand jitter is also good01:21
wallyworldif the controller is not loaded, not processing lots of connections, no real need to rate liit the same was as when  it is loaded01:22
wallyworldso the algorithm is mindelay + (rate of connections metric) * 5ms01:22
wallyworldup to a max delay01:22
* thumper is heading out, daughter to dentist01:24
thumperbbl01:24
=== thumper is now known as thumper-afk
axwwallyworld: isn't that a recipe for DoS? throw a bunch of connections at the server, and then every one else is penalised01:25
axwwallyworld: because it's exponential backoff for everyone01:26
babbageclunkwallyworld: what else has status history? Do we clean up status history for machines?01:56
babbageclunkor axw ^01:57
babbageclunkShould we clean it up for machines? Or do we just rely on the pruner to get rid of that?01:57
axwbabbageclunk: search for probablyUpdateStatusHistory in state, all those things have status history01:58
axwI guess we should clean it up for them01:58
axwthough I don't think they're going to be anywhere near as high volume as units01:58
babbageclunkaxw: Ok, thanks. Just trying to work out whether anything else has the same problem.02:04
=== thumper-afk is now known as thumper
thumperwell... fark02:33
thumpernow I need to work out why this isn't working02:33
thumperaxw: I could use that chat now if you have 10 minutes02:47
axwthumper: sure02:48
axwthumper: https://hangouts.google.com/hangouts/_/canonical.com/axw-thumper?authuser=1 ?02:48
thumperack02:48
babbageclunkjam: take a look at https://github.com/juju/juju/pull/7468? I'm popping out to pick up my daughter.02:53
axwwallyworld jam: https://github.com/axw/juju/commit/e77cbf1b49d0a9e158f54c629a44dca253c32426 <- WIP to add rate limiting to log sink. would appreciate your thoughts on if this approach is OK before I proceed03:52
axwwallyworld jam: that ratelimit.Bucket is shared by all logsink connections, in case it's not obvious03:54
wallyworldaxw: looking04:01
wallyworldjam: here's a connection rate limit PR. i don't have a full feel for the type of connection rates we are seeing, so am not sure of the pause numbers used (eg are they too low) https://github.com/juju/juju/pull/747004:02
wallyworldaxw: looks ok to me. i wonder if the 1ms refill rate is set to the most appropriate value. hard to say without a feel for a) what mgo can handle, b) the rate at which incoming log messages arrive04:11
axwwallyworld: yeah, I don't know. I see you're making things configurable via env vars in your branch, so I could do that04:12
wallyworldyeah, we can then do some perf testing04:12
wallyworldset env var, bounce agent etc04:13
thumpercan anyone point to api client tests that hit multiple fake remote versions?04:13
thumperso we just test best version handling?04:13
wallyworldthere will be tests for FindTools() somewhere04:14
wallyworldwhere people get confused is running proposed tools and then it not seeing released rools04:14
wallyworldso upgrades from rc to ga need to set agent-stream04:14
wallyworldor something along those lines04:14
thumperah fark04:14
thumperapi/base/testing/apicaller.go hard codes best version to zero04:15
thumper:(04:15
wallyworldyay04:15
thumperStubFacadeCaller looks like the ticket04:16
thumpernope...04:32
* thumper writes one04:33
thumperwow... it works04:33
thumperhazaah04:33
axwwallyworld: FYI, here's how I'm planning to parameterise the config: https://github.com/axw/juju/commit/77a061739b01f37f4eb85448664018c1ee0cec19. I'd rather it get pulled out of env at the command level, and poked in via ServerConfig04:39
wallyworldaxw: yeah could do, but i wasn't looking to make the config anything formally accessible; is purely intended for our testing purposes04:42
wallyworlddon't really want to expose those kbobs04:42
thumperjust putting my code up for review04:42
axwwallyworld: this is just in agent config, so the user can't see it anyway. *shrug*04:43
wallyworldsure, i can add the extra code04:43
thumperhttps://github.com/juju/juju/pull/7471 a bit more chunky than I would have liked but it fixes a bug in dump-model as well by changing the output format04:47
axwneed exercise... bbs04:48
wallyworldthumper: looking04:49
wallyworldthumper: here's a one line pr https://github.com/juju/testing/pull/12704:49
* thumper looks04:49
thumperwallyworld: should we be checking versions around it?04:49
wallyworldthumper: we only support 3.2, and 1.25 should use pinned dep04:50
thumperack04:50
thumperlgtm04:50
wallyworldta04:50
thumperwait04:51
thumperwallyworld: you targetted master04:51
wallyworldthumper: yeah, this is juju/testing repo04:51
thumperoh04:51
thumperduh04:51
thumpersorry04:51
wallyworldnp04:51
wallyworldaxw: a small one https://github.com/juju/juju/pull/747305:50
axwwallyworld: how is that acceptance test working? expecting --noprealloc when it shouldn't?06:09
wallyworldaxw: it's not being run - it was for the 2.4->2.6->3.2 upgrade06:09
axwwallyworld: ah, ok06:09
wallyworldpretty sure06:09
wallyworldthought i'd update anyway06:10
axwwallyworld: LGTM06:11
wallyworldtyvm06:12
wallyworldjam: are you free to look at the pr for server connection rate limiting? we're looking to cut rc2 tomorrow07:04
wallyworldhttps://github.com/juju/juju/pull/747007:04
wpkwallyworld: DefaultLoginRetyPause07:09
wpkyou ate 'r'07:10
wpkrest after I get to the office07:10
wallyworldso i did07:10
jamwallyworld: just got back home, I'll look at rate limiting07:19
wallyworldthank you07:19
axwwallyworld: I'm struggling to come up with a way of testing the logsink rate limiting without changing a heap of code, which I don't think is wise for 2.2. do you think it would be OK to land https://github.com/axw/juju/commit/77a061739b01f37f4eb85448664018c1ee0cec19 as is, and add tests on develop (with significant refactoring)?07:54
wallyworldaxw: i guess you ran up a system with lots of log traffic?07:56
axwwallyworld: I ran up enough to observe that rate limiting takes effect07:56
wallyworldif jam gives a +1, ok with me07:56
axwwallyworld: and twiddled the knobs in agent.conf the see that that works07:56
jamwallyworld: hey ian, sorry about the delay, I have to get the salary recommendations before Tim goes and then I'll look at it again.07:57
jamwallyworld: for good or bad, I can no longer connect to Comcast, so I have more review time :)07:57
wallyworldno worries, i'm off to soccer soonish07:57
wallyworldjam: i guess that means we don't know the status of site after deleting unit status hisotry rtc?07:58
jamwallyworld: I was able to connect this morning08:07
jamwallyworld: for about 30min or so08:07
jamthings were looking a lot better with the history gone08:07
jamwallyworld: but it wasn't 'quiet' yet, either. Status returned, but took 10min08:07
jamwallyworld: I did get a couple of cpu profiles dumped, but those are sitting on the disk over there08:07
jamit isn't very easy to get data out.08:08
wallyworldjam: tim is landing a pr to get all the model data much more efficiently using export. if that performs well, we can rewrite status to use that instead of very inefficiently walking the model08:11
jamwallyworld: well, status when things were happy on monday was 15-30s08:13
jamwallyworld: so while 'we can have better status code', we can also get the system much happier than it is right now08:13
jamIt may be that ultimately reworking status just gives better scaling under load, not sure08:14
jam10min vs 30s is 20x factor08:14
jamI could see 1-by-1 querying being more impacted by load, though.08:15
wallyworldjam: ah i see, didn't realise it was as low as 30s. well yeah, then we have work to do to figure out where things are going amiss08:15
wallyworldhopefully the guys on site can get good data/measurements08:16
jamwallyworld: I imagine I'll be able to get back in after another 2-4hrs08:17
axwjam: https://github.com/juju/juju/pull/7474 adds logsink rate-limiting. as I mentioned to wallyworld above, I've been unable to come up with a test that doesn't involve heavy refactoring of apiserver08:30
axwjam: so I'd like to land that as is if you're comfortable (already have +1 from wallyworld), and do refactoring + tests in develop08:31
* wallyworld off to soccer, back later08:36
jamaxw: still looking at ian's patch, but almost done08:42
jamaxw: arguably we should use similar algorithms for how we throttle08:42
jamToken Bucket looks quite promising, and is used for most network throttling08:42
axwjam: I agree, I did suggest it to wallyworld. his approach could not be captured in token bucket AFAICT, but I'm not sure the approach of exponential backoff across the board is necessarily good anyway. it means latecomers are disadvantaged, which means a DoS could starve out users08:44
jamaxw: so for logging, I would do it more per logger08:45
jamvs over all of them08:45
axwalthough there is a upper limit, so I guess it's not that bad08:45
jamI guess you'd want both weights involved?08:45
jambut throttling the slow logging because someone is spammy doesn't sound right08:45
axwjam: the thought did cross my mind that we might want it at both levels08:46
jamaxw: its mostly a 'play fair with your neighbors' algorithm we're looking for08:46
jamwallyworld: reviweed09:10
axwjam: I've updated my PR to rate limit per connection, rather than all together. I think we may want both, but doing both requires more thought and I don't want to stall this09:23
axwjam: it's at least not *worse* this way, which it could be if we did a shared token bucket09:24
wallyworldjam: thanks for review. with the on-the-fly config, that was never intended to be in scope; the fact that is has *any* configurability is a concession to initial tuning under lab conditions. with the token bucket thing, i didn't go with that because we discussed a random, on average increasing delay the faster the rate of incoming connections. disconnects don't matter because typically i would think the load would be incurred in accepting the11:18
wallyworldconnection, ie this was designed for the thundering herd problem on startup, not steady state load11:18
wallyworldif we do stick with the approach, i probably should make the 5ms configurable11:19
jamwallyworld: sure. I think either way it helps the thundering herd problem. I'm wondering if changing it slightly would make it more understandable what we are tweaking, and it s going to be hard to test live by restarting the controllers11:57
jambut we can live with it11:57
wallyworldjam: i'm updating the pr to expose more knobs as suggested. because of the purpose, and the desire not to give people more knobs, i think it's ok to just use agent config and require a restart. it's for us to tune, not field folks initially. that can change of course. how did you want to make it more understnadable?11:59
jamwallyworld: right, the issue is more that while I'm trying to tune it, I have to kill half the world (3rd the world?)12:00
jamanyway, having be a start and a target12:00
wallyworldyeah, but that introduces the herd problem which this is designed to fix :-)12:00
jamso "at X connections delay should be Y"12:00
jamwallyworld: so giving 2 (conns, delay) coordinates12:01
jamand then just linear interpolation12:01
jamconnections_low, delay_low = (10, 10ms)12:01
jamconnections_high, delay_high = (1000, 5s)12:01
wallyworldconns (absolute) or conns (rate of arival)12:01
wallyworldatm it's rate12:01
jamwallyworld: probably rate is better12:01
wallyworldhow many per 10ms12:01
jamwallyworld: I'd use human units, someting like /s or /min12:02
jamprobably /s12:02
wallyworldso now, it's simple - 10ms min plus 5ms per rate up to  a max12:02
jamwallyworld: right but 5ms is fixed and not particularly tuned to anything12:02
wallyworldi'm making that tunable12:02
wallyworldplus the lookback window12:02
wallyworldie how old the earlier conn time is before we stop looking back12:03
wallyworldjam: so i think what is there gives you the low/high thing you want, but it also has a linear backup in between12:05
wallyworld*backoff12:05
jamwallyworld: so my suggestion was to linearly interpolate between those two points12:05
jamwhich is essentially the same. it fits more in *my* head how to think about it, but I'm sure its just a transformation between the two12:06
wallyworldok, i can do that12:06
wallyworldone less thing to have to tweak12:06
wallyworldjam: i've pushed some changes, see what you think. the alogirthm now has no randomness, and should be as per what we discussed13:15
jamwallyworld: looking13:18
wallyworldjam: anything major that's an issue?13:59
jamwallyworld: sorry, OTP with the site14:00
jambeen catching them up to speed14:00
wallyworldok, np. midnight here i so might need to end up sson14:01
jamwallyworld: sorry to hold you up. lgtm, only small thing would be "conns/10ms" is harder to think about as a human than "conns/s" and its just a scale factor of 1:10014:12
wallyworldjam: ok, i'll see if i can tweak it. how are things at site?14:13
jamwallyworld: Juju is up and running, all agents are green.14:16
jamstatus takes 10min14:16
jamwhich is not great, but it succeeds14:16
wallyworldand agents all green, which is good14:16
jamthe Controllers are all using 2-300% cpu as is mongo14:17
wallyworldhmmm14:17
jamwallyworld: so I think this is our "juju goes into consuming cpus baseline" that we saw with the JAAS tests14:17
wallyworldat least we can now start profiling14:17
jamwallyworld: so *right* now, I'm working with heather an nick to get us a place we can run "go tool pprof --svg"14:17
wallyworldgreat14:17
jamwallyworld: yeah, we can't get files out of the system, so we have to install go and juju source, etc.14:17
wallyworldjoy14:18
wpkjam: technically or politically?14:34
jamwpk: mostly politically14:35
wpkjam: because technically there's always https://www.aldeid.com/wiki/File-transfer-via-DNS ;)14:41
jamwpk: :)14:44
=== hml_ is now known as hml
=== salmankhan1 is now known as salmankhan
=== akhavr1 is now known as akhavr
wallyworldbabbageclunk: veebers: anastasiamac: standup?23:46
babbageclunksorry, omw23:47
veeberswallyworld: d'oh snuck up on me omw23:47

Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!