[00:00] so doesn't happen immediately if no cpu is available [00:08] wallyworld: it seems worse than that [00:10] mhilton: your updates LGTM, thank you. did you run the change to use interactive auth-type past uros? [00:25] wallyworld: thanks I'll fix that doc up in a minute [00:36] I've got a bad feeling about this [00:37] axw: can I talk something through for 5 minutes? [00:37] ah... stand up [00:37] guess now [00:37] not [00:41] hmm... maybe this is fine... [01:18] thumper: sorry, missed your message. still need to talk? I can be with you in a couple of minutes [01:19] axw: from what i can see, the token bucket implementation isn't really what i need - that provides a way to control the max rate at which things are processed. i need to use a random delay based on the observed rate of incoming connections. [01:19] axw: no, I think I'm ok [01:19] just testing manually now [01:20] was concerned about client <-> server facade version handling [01:20] but I think we are ok [01:20] thumper: okey dokey [01:20] wallyworld: why random? [01:21] wallyworld: token bucket is a standard method for rate limiting, just wondering why it's not enough [01:21] we want to increase the pause depending on load [01:21] the less load, the shorter the pause [01:21] and jitter is also good [01:22] if the controller is not loaded, not processing lots of connections, no real need to rate liit the same was as when it is loaded [01:22] so the algorithm is mindelay + (rate of connections metric) * 5ms [01:22] up to a max delay [01:24] * thumper is heading out, daughter to dentist [01:24] bbl === thumper is now known as thumper-afk [01:25] wallyworld: isn't that a recipe for DoS? throw a bunch of connections at the server, and then every one else is penalised [01:26] wallyworld: because it's exponential backoff for everyone [01:56] wallyworld: what else has status history? Do we clean up status history for machines? [01:57] or axw ^ [01:57] Should we clean it up for machines? Or do we just rely on the pruner to get rid of that? [01:58] babbageclunk: search for probablyUpdateStatusHistory in state, all those things have status history [01:58] I guess we should clean it up for them [01:58] though I don't think they're going to be anywhere near as high volume as units [02:04] axw: Ok, thanks. Just trying to work out whether anything else has the same problem. === thumper-afk is now known as thumper [02:33] well... fark [02:33] now I need to work out why this isn't working [02:47] axw: I could use that chat now if you have 10 minutes [02:48] thumper: sure [02:48] thumper: https://hangouts.google.com/hangouts/_/canonical.com/axw-thumper?authuser=1 ? [02:48] ack [02:53] jam: take a look at https://github.com/juju/juju/pull/7468? I'm popping out to pick up my daughter. [03:52] wallyworld jam: https://github.com/axw/juju/commit/e77cbf1b49d0a9e158f54c629a44dca253c32426 <- WIP to add rate limiting to log sink. would appreciate your thoughts on if this approach is OK before I proceed [03:54] wallyworld jam: that ratelimit.Bucket is shared by all logsink connections, in case it's not obvious [04:01] axw: looking [04:02] jam: here's a connection rate limit PR. i don't have a full feel for the type of connection rates we are seeing, so am not sure of the pause numbers used (eg are they too low) https://github.com/juju/juju/pull/7470 [04:11] axw: looks ok to me. i wonder if the 1ms refill rate is set to the most appropriate value. hard to say without a feel for a) what mgo can handle, b) the rate at which incoming log messages arrive [04:12] wallyworld: yeah, I don't know. I see you're making things configurable via env vars in your branch, so I could do that [04:12] yeah, we can then do some perf testing [04:13] set env var, bounce agent etc [04:13] can anyone point to api client tests that hit multiple fake remote versions? [04:13] so we just test best version handling? [04:14] there will be tests for FindTools() somewhere [04:14] where people get confused is running proposed tools and then it not seeing released rools [04:14] so upgrades from rc to ga need to set agent-stream [04:14] or something along those lines [04:14] ah fark [04:15] api/base/testing/apicaller.go hard codes best version to zero [04:15] :( [04:15] yay [04:16] StubFacadeCaller looks like the ticket [04:32] nope... [04:33] * thumper writes one [04:33] wow... it works [04:33] hazaah [04:39] wallyworld: FYI, here's how I'm planning to parameterise the config: https://github.com/axw/juju/commit/77a061739b01f37f4eb85448664018c1ee0cec19. I'd rather it get pulled out of env at the command level, and poked in via ServerConfig [04:42] axw: yeah could do, but i wasn't looking to make the config anything formally accessible; is purely intended for our testing purposes [04:42] don't really want to expose those kbobs [04:42] just putting my code up for review [04:43] wallyworld: this is just in agent config, so the user can't see it anyway. *shrug* [04:43] sure, i can add the extra code [04:47] https://github.com/juju/juju/pull/7471 a bit more chunky than I would have liked but it fixes a bug in dump-model as well by changing the output format [04:48] need exercise... bbs [04:49] thumper: looking [04:49] thumper: here's a one line pr https://github.com/juju/testing/pull/127 [04:49] * thumper looks [04:49] wallyworld: should we be checking versions around it? [04:50] thumper: we only support 3.2, and 1.25 should use pinned dep [04:50] ack [04:50] lgtm [04:50] ta [04:51] wait [04:51] wallyworld: you targetted master [04:51] thumper: yeah, this is juju/testing repo [04:51] oh [04:51] duh [04:51] sorry [04:51] np [05:50] axw: a small one https://github.com/juju/juju/pull/7473 [06:09] wallyworld: how is that acceptance test working? expecting --noprealloc when it shouldn't? [06:09] axw: it's not being run - it was for the 2.4->2.6->3.2 upgrade [06:09] wallyworld: ah, ok [06:09] pretty sure [06:10] thought i'd update anyway [06:11] wallyworld: LGTM [06:12] tyvm [07:04] jam: are you free to look at the pr for server connection rate limiting? we're looking to cut rc2 tomorrow [07:04] https://github.com/juju/juju/pull/7470 [07:09] wallyworld: DefaultLoginRetyPause [07:10] you ate 'r' [07:10] rest after I get to the office [07:10] so i did [07:19] wallyworld: just got back home, I'll look at rate limiting [07:19] thank you [07:54] wallyworld: I'm struggling to come up with a way of testing the logsink rate limiting without changing a heap of code, which I don't think is wise for 2.2. do you think it would be OK to land https://github.com/axw/juju/commit/77a061739b01f37f4eb85448664018c1ee0cec19 as is, and add tests on develop (with significant refactoring)? [07:56] axw: i guess you ran up a system with lots of log traffic? [07:56] wallyworld: I ran up enough to observe that rate limiting takes effect [07:56] if jam gives a +1, ok with me [07:56] wallyworld: and twiddled the knobs in agent.conf the see that that works [07:57] wallyworld: hey ian, sorry about the delay, I have to get the salary recommendations before Tim goes and then I'll look at it again. [07:57] wallyworld: for good or bad, I can no longer connect to Comcast, so I have more review time :) [07:57] no worries, i'm off to soccer soonish [07:58] jam: i guess that means we don't know the status of site after deleting unit status hisotry rtc? [08:07] wallyworld: I was able to connect this morning [08:07] wallyworld: for about 30min or so [08:07] things were looking a lot better with the history gone [08:07] wallyworld: but it wasn't 'quiet' yet, either. Status returned, but took 10min [08:07] wallyworld: I did get a couple of cpu profiles dumped, but those are sitting on the disk over there [08:08] it isn't very easy to get data out. [08:11] jam: tim is landing a pr to get all the model data much more efficiently using export. if that performs well, we can rewrite status to use that instead of very inefficiently walking the model [08:13] wallyworld: well, status when things were happy on monday was 15-30s [08:13] wallyworld: so while 'we can have better status code', we can also get the system much happier than it is right now [08:14] It may be that ultimately reworking status just gives better scaling under load, not sure [08:14] 10min vs 30s is 20x factor [08:15] I could see 1-by-1 querying being more impacted by load, though. [08:15] jam: ah i see, didn't realise it was as low as 30s. well yeah, then we have work to do to figure out where things are going amiss [08:16] hopefully the guys on site can get good data/measurements [08:17] wallyworld: I imagine I'll be able to get back in after another 2-4hrs [08:30] jam: https://github.com/juju/juju/pull/7474 adds logsink rate-limiting. as I mentioned to wallyworld above, I've been unable to come up with a test that doesn't involve heavy refactoring of apiserver [08:31] jam: so I'd like to land that as is if you're comfortable (already have +1 from wallyworld), and do refactoring + tests in develop [08:36] * wallyworld off to soccer, back later [08:42] axw: still looking at ian's patch, but almost done [08:42] axw: arguably we should use similar algorithms for how we throttle [08:42] Token Bucket looks quite promising, and is used for most network throttling [08:44] jam: I agree, I did suggest it to wallyworld. his approach could not be captured in token bucket AFAICT, but I'm not sure the approach of exponential backoff across the board is necessarily good anyway. it means latecomers are disadvantaged, which means a DoS could starve out users [08:45] axw: so for logging, I would do it more per logger [08:45] vs over all of them [08:45] although there is a upper limit, so I guess it's not that bad [08:45] I guess you'd want both weights involved? [08:45] but throttling the slow logging because someone is spammy doesn't sound right [08:46] jam: the thought did cross my mind that we might want it at both levels [08:46] axw: its mostly a 'play fair with your neighbors' algorithm we're looking for [09:10] wallyworld: reviweed [09:23] jam: I've updated my PR to rate limit per connection, rather than all together. I think we may want both, but doing both requires more thought and I don't want to stall this [09:24] jam: it's at least not *worse* this way, which it could be if we did a shared token bucket [11:18] jam: thanks for review. with the on-the-fly config, that was never intended to be in scope; the fact that is has *any* configurability is a concession to initial tuning under lab conditions. with the token bucket thing, i didn't go with that because we discussed a random, on average increasing delay the faster the rate of incoming connections. disconnects don't matter because typically i would think the load would be incurred in accepting the [11:18] connection, ie this was designed for the thundering herd problem on startup, not steady state load [11:19] if we do stick with the approach, i probably should make the 5ms configurable [11:57] wallyworld: sure. I think either way it helps the thundering herd problem. I'm wondering if changing it slightly would make it more understandable what we are tweaking, and it s going to be hard to test live by restarting the controllers [11:57] but we can live with it [11:59] jam: i'm updating the pr to expose more knobs as suggested. because of the purpose, and the desire not to give people more knobs, i think it's ok to just use agent config and require a restart. it's for us to tune, not field folks initially. that can change of course. how did you want to make it more understnadable? [12:00] wallyworld: right, the issue is more that while I'm trying to tune it, I have to kill half the world (3rd the world?) [12:00] anyway, having be a start and a target [12:00] yeah, but that introduces the herd problem which this is designed to fix :-) [12:00] so "at X connections delay should be Y" [12:01] wallyworld: so giving 2 (conns, delay) coordinates [12:01] and then just linear interpolation [12:01] connections_low, delay_low = (10, 10ms) [12:01] connections_high, delay_high = (1000, 5s) [12:01] conns (absolute) or conns (rate of arival) [12:01] atm it's rate [12:01] wallyworld: probably rate is better [12:01] how many per 10ms [12:02] wallyworld: I'd use human units, someting like /s or /min [12:02] probably /s [12:02] so now, it's simple - 10ms min plus 5ms per rate up to a max [12:02] wallyworld: right but 5ms is fixed and not particularly tuned to anything [12:02] i'm making that tunable [12:02] plus the lookback window [12:03] ie how old the earlier conn time is before we stop looking back [12:05] jam: so i think what is there gives you the low/high thing you want, but it also has a linear backup in between [12:05] *backoff [12:05] wallyworld: so my suggestion was to linearly interpolate between those two points [12:06] which is essentially the same. it fits more in *my* head how to think about it, but I'm sure its just a transformation between the two [12:06] ok, i can do that [12:06] one less thing to have to tweak [13:15] jam: i've pushed some changes, see what you think. the alogirthm now has no randomness, and should be as per what we discussed [13:18] wallyworld: looking [13:59] jam: anything major that's an issue? [14:00] wallyworld: sorry, OTP with the site [14:00] been catching them up to speed [14:01] ok, np. midnight here i so might need to end up sson [14:12] wallyworld: sorry to hold you up. lgtm, only small thing would be "conns/10ms" is harder to think about as a human than "conns/s" and its just a scale factor of 1:100 [14:13] jam: ok, i'll see if i can tweak it. how are things at site? [14:16] wallyworld: Juju is up and running, all agents are green. [14:16] status takes 10min [14:16] which is not great, but it succeeds [14:16] and agents all green, which is good [14:17] the Controllers are all using 2-300% cpu as is mongo [14:17] hmmm [14:17] wallyworld: so I think this is our "juju goes into consuming cpus baseline" that we saw with the JAAS tests [14:17] at least we can now start profiling [14:17] wallyworld: so *right* now, I'm working with heather an nick to get us a place we can run "go tool pprof --svg" [14:17] great [14:17] wallyworld: yeah, we can't get files out of the system, so we have to install go and juju source, etc. [14:18] joy [14:34] jam: technically or politically? [14:35] wpk: mostly politically [14:41] jam: because technically there's always https://www.aldeid.com/wiki/File-transfer-via-DNS ;) [14:44] wpk: :) === hml_ is now known as hml === salmankhan1 is now known as salmankhan === akhavr1 is now known as akhavr [23:46] babbageclunk: veebers: anastasiamac: standup? [23:47] sorry, omw [23:47] wallyworld: d'oh snuck up on me omw