[00:09] <babbageclunk> veebers: yes please!
[00:13] <veebers> babbageclunk: cool, one snuck through, but failed. Let me check why
[03:25] <babbageclunk> axw: eg looks really neat, thanks for the tip!
[03:25] <axw> babbageclunk: cool :)
[03:25] <axw> babbageclunk: FYI the PR I mentioned is https://github.com/juju/juju/pull/7446, the template I used is in the description
[03:28] <wallyworld> axw: what do you think about adding the mongotop metrics to a prometheus collector? and other things like txn.logs size
[03:30] <axw> wallyworld: there is an existing prometheus exporter (https://github.com/dcu/mongodb_exporter) which I think we should use if possible. last time I tried to use it, it was a bit panicky
[03:30] <axw> wallyworld: not sure if that captures per-collection sizes. if it does not, adding txn.logs size sounds like a good idea to me
[03:31] <wallyworld> axw: agree to use something existing if possible. top gets useful stats which IMO we'd want to graph over time and correlate with other measurements
[03:31] <axw> yup
[03:36] <axw> wallyworld: I think there might be one already, but if there's not we should look at snapping the mongodb prometheus exporter, to make it super easy to set up on the controller
[03:37] <wallyworld> axw: that would be nice. as an aside, i had a brief look at the prometheus snap itself and didn't see an easy way to tell it to use a given config yaml, but i didn't look too hard
[03:38] <axw> wallyworld: there should be an existing config file, I forget where... search for prometheus.yml under /snap/prometheus
[03:38] <axw> wallyworld: also see https://awilkins.id.au/post/juju-2.1-prometheus/ if you haven't already, might be helpful
[03:39] <wallyworld> axw: yeah there is one, but it sorta sucks to have to search for it and replace it and restart the process
[03:40] <wallyworld> axw: i already have prometheus running against a local controller; not much to see as it's not busy
[03:40] <axw> wallyworld: maybe we should provide a tool to reconfigure a prometheus to add scrape targets for juju controllers?
[03:41] <wallyworld> now that would be good
[03:45] <wallyworld> axw: are you able to look at fixing the introspection worker to support cpu profiling as a quick win?
[03:45] <axw> wallyworld: it does support it, it's just the script that's broken
[03:46] <axw> wallyworld: I can look at fixing the script if it's really important
[03:46] <wallyworld> right, i meant the script. i'm not 100% sure what needs to change. replace GET with curl?
[03:46] <axw> wallyworld: I'm not sure either. I can look at it
[03:46] <wallyworld> would be good to have it work out of the box for 2.2
[03:46] <wallyworld> since we are upgrading the customer to 2.2 controllers
[03:46] <axw> ok
[06:10] <wallyworld> jam: on the surface of it, i can't see a way to intercept incoming http connections prior to the tls negotiation stage to reject logins at that point. there's some methods on the tls.Config that appear to be called for each request that we can override, but doing so results in an internal error in the std lib code. did you have any thoughts on how to implement?
[06:25] <jam> wallyworld: I didn't have any thoughts yet. my first instinct would be to have a custom Listener
[06:26] <wallyworld> jam: yeah, getting the right points to intercept before tls happens is the fun bit
[06:26] <wallyworld> bboab, school pick up
[06:26] <jam> wallyworld: tls.Config takes a net.Listener
[06:26] <jam> so if we wrap the passed in net.Listener with our own
[06:26] <jam> I think it could work
[06:27] <jam> line 226 of apiserver/apiserver.go
[06:27] <mup> Bug #1696311 opened: layer-basic does not support centos7 <juju-core:New> <https://launchpad.net/bugs/1696311>
[06:30] <mup> Bug #1696311 changed: layer-basic does not support centos7 <juju-core:New> <https://launchpad.net/bugs/1696311>
[06:42] <wallyworld> jam: yeah, just poking around. part of the issue it's only agent logins we want to throttle. and we only get to read the data off the rpc request to determine that once we've established the secure connection.
[06:42] <mup> Bug #1696311 opened: layer-basic does not support centos7 <juju-core:New> <https://launchpad.net/bugs/1696311>
[06:52] <jam> wallyworld: right, if we just added a 1s sleep, or a load-based sleep, I think we could still get away with it, we could do a *bigger* sleep later
[06:52] <jam> or we could do it by IP address
[06:53] <jam> 'local' addresses get a bigger delay as they are more likely to be agents vs client
[06:53] <jam> we could just slow down all Connects when we're under load/based on number of active connections, etc.
[06:53] <wallyworld> that might work initially
[06:53] <jam> and then slow down even further once we get to Login layer
[06:54] <jam> wallyworld: to slow down retries, I had initially investigated a sleep before returning the error
[06:54] <jam> which should still reduce total load
[06:54] <jam> its just nice to also reduce it before you get TLS handshake stuff
[06:54] <wallyworld> jam: right, i am adding an optional puase to the liniter
[06:55] <wallyworld> so Acquire() might not return immediately even if it can get a slot
[06:56] <wallyworld> actually, i am looking at pausing before polling the channel
[06:58] <jam> wallyworld: you mean pause-before-Accept?
[06:58] <wallyworld> in the Acquire() method of limiter
[06:59] <wallyworld> pause before attempting to acquire a login slot
[06:59] <wallyworld> juju/utils/limiter.go
[07:00] <wallyworld> that will throttle the agents. maybe not the best place to do it?
[07:00] <wallyworld> seems liked it was nice and transparent to the server
[07:01] <wallyworld> i guess login limit is only 10
[07:01] <wallyworld> so it may not help that much
[07:02] <wallyworld> but it will delay any err retry
[07:02] <wallyworld> so that it limits the cost of the agents trying again and again
[07:02] <jam> wallyworld: so, I wouldn't do it universally in the generic code, but you could pass in an optional 'time.Duration' if we wanted
[07:02] <jam> but just doing it at line 92 of admin.go
[07:02] <wallyworld> right, that's what i'm doing
[07:02] <jam> knows that we're explicitly rate limiting *logins* right there
[07:03] <wallyworld> passing in an optional duration to NewLimiter()
[07:03] <jam> wallyworld: sure, and that's also potentially testable, etc.
[07:03] <wallyworld> yep
[07:04] <wallyworld> and pausing before Acquire() means the agents are truely blocked
[07:04] <wallyworld> as no ErrRetry is issued
[07:04] <wallyworld> and s they can't just ping again
[07:04] <wallyworld> immediately
[07:04] <wallyworld> or that's my theory anyway
[07:05] <jam> wallyworld: sure, before or after Acquire is fine
[07:05] <jam> just before returning an error
[07:06] <wallyworld> yep
[07:45] <wallyworld> jam: here's a utils PR https://github.com/juju/utils/pull/281
[07:46] <wallyworld> bah, i broke APi, I will need to fix
[07:46] <jam> wallyworld: I feel like we need (min, max) instead of (0, max) thoughts?
[07:47] <wallyworld> yeah ok, can easily add
[07:47] <jam> or something like (avg, stddev) where we just pick some value for stddev based on avg
[07:47] <wallyworld> and i'll fix the api too
[07:47] <wallyworld> hmmm, do we really need that aside from min,max?
[07:48] <jam> wallyworld: so its the same effect, just thinking about what is useful to express
[07:48] <jam> well, stddev means you would have a normal distribution instead of a flat one,
[07:48] <jam> not sure that is useful
[07:48] <mup> Bug #1696311 changed: layer-basic does not support centos7 <Charm Helpers:New> <https://launchpad.net/bugs/1696311>
[07:49] <jam> wallyworld: so even just 'max' is better than nothing
[07:49] <jam> it just means the 'average' time is going to be 'max/2'
[07:52] <wallyworld> jam: i'll add the min, easy enough
[07:53] <wallyworld> after dinner though
[07:54] <jam> wallyworld: reviewed
[08:07] <jam> wallyworld: I do wonder if we could have a way to know "I've got a lot of load right now, lets slow down active connections a bit more", and provide backpressure
[08:18] <wallyworld> jam: i also think that we need to do more - this current change is just a small step
[08:20] <Mmike> Hi, lads. Is there a way to configure juju to store less than 4GB of logs in mongodb?
[09:56] <thumper> hmm...
[09:56] <thumper> trying to use the peer-xplod charm from the acceptance tests
[09:56] <thumper> getting errors with lxd where it says '/usr/bin/env python' doesn't exist
[09:57] <thumper> root@juju-61a95f-0:~# /usr/bin/env python
[09:57] <thumper> /usr/bin/env: ‘python’: No such file or directory
[09:57] <thumper> from the machine itself
[09:59] <thumper> seems like the current lxd xenial images only have python3
[10:05] <jam> thumper: indeed, xenial doesn't come with python 2
[10:05] <thumper> :-|
[10:05] <jam> thumper: I thought I had dealt with that once in the past, but maybe that was on my version of the charm and not the one they are using ?
[10:05] <jam> thumper: 'apt install python2' in 'install'
[10:06] <thumper> yep
[10:06] <thumper> did that
[10:06] <thumper> although i used apt-get so it work on trusty too
[10:06] <jam> thumper: sure
[10:06] <thumper> :)
[10:06] <jam> I have 'apt install -y python' in mine
[10:08] <jam> thumper: is it a ~juju-qa charm ?
[10:08] <thumper> no, the one in acceptancetests dir in tree now
[10:10] <jam> there are a couple small changes between the one in tree and lp:~jameinel/charms/trusty/peer-xplod
[10:11] <jam> nothing particularly major, just the 'apt-get install' and some small things about 'maximum=0' intending to be unlimited
[10:11] <jam> thumper: want me to put a PR that brings them in sync?
[10:11] <thumper> jam: sure, if you have the time
[10:16] <jam> thumper: https://github.com/juju/juju/pull/7463
[10:29] <wallyworld> jam: here's a WIP which uses the login rate limiting plus a general connection throttle https://github.com/juju/juju/compare/2.2...wallyworld:throttle-controller-connections?expand=1
[10:29] <jam> wallyworld: WIP, WIP it good :)
[10:29] <wallyworld> does it look reasonable? i plucked the numbers out of the air
[10:30] <wallyworld> funny man
[10:30] <jam> wallyworld: so I'm wondering why we are sleeping longer for Conn than Login
[10:30] <jam> wallyworld: I would have thought 1s for conn, and 5s for login
[10:30] <wallyworld> i can do that
[10:31] <wallyworld> i thought login was limited to 10 at atome anyway
[10:31] <wallyworld> but conns once logged in could grow more
[10:32] <wallyworld> probably flawed thinking
[10:32] <jam> wallyworld: so conn affects users as well as agents, but you're right that the login rate limit only triggers once we're at 10 active
[10:32] <jam> ah sorry, we always acquire so we would always hit that
[10:32] <jam> but only for agents
[10:33] <wallyworld> yeah, this latest wip does affect clients as well
[10:33] <wallyworld> but if the system is really, really loaded, then even they should wait  abit?
[10:33] <wallyworld> they will see a slow down anyway
[10:33] <jam> wallyworld: 1s is fine IMO
[10:33] <wallyworld> 1s max
[10:33] <jam> the question is whether that is *enough* generally, but adding an extra 5 for agents probably will be
[10:33] <wallyworld> and 5ms per conn?
[10:34] <jam> wallyworld: so a max 1s delay for Conn to return and a 5s extra delay for Agent Login to return 'go away'.
[10:35] <jam> neither is what I'd like in 'ideal world' which would be focused on scaling the numbers based on number of active connections
[10:35] <jam> but its probably a start
[10:35] <wallyworld> jam: so the 5s max for Accept() was really to attempt to throttle the thundering herd, and the pause time only grows by 5ms per conn
[10:35] <wallyworld> yeah, this is a quick win for 2.2rc2
[10:35] <jam> ah, I missed that throttling went up and down
[10:35] <wallyworld> on a normally loaded system there should be no noticable difference
[10:36] <wallyworld> yeah, it grows as we get more connections accepted
[10:36] <jam> wallyworld: so 5s on Conn isn't great. it affects 'juju status' when running on lxd
[10:36] <jam> 'why is it taking 5s to get a result back with 2 machines'
[10:36] <wallyworld> that's 5s max
[10:37] <jam> wallyworld: still avg 2.5s
[10:37] <wallyworld> only if there are 1000 connections
[10:37] <wallyworld> the max time grows
[10:37] <wallyworld> well, that was the intent
[10:37] <wallyworld> start at min 10ms or so, and then the max pause time grows with conn count
[10:37] <jam> wallyworld: ah sorry, I'v twisted it in my head,
[10:37] <jam> just got coffe
[10:38] <wallyworld> np, i'm tired so i could have messed up
[10:38] <wallyworld> so for accept, on a normally loaded system -> no dicernable difference
[10:38] <wallyworld> but all connections are forced to wait  a bit as conn count grows
[10:38] <jam> wallyworld: so, all Accept() attempts have a 10ms floor that increases by 5ms for every active connection
[10:39] <wallyworld> yeah
[10:39] <jam> up to a max of 5ms from Accept until we do the SSL handshake
[10:39] <wallyworld> max of 5s
[10:39] <jam> on Comcast world, that will, on average have 2500/3 = 800, say 1000 active agents
[10:39] <jam> every 'juju status' will be slower by 5s
[10:40] <wallyworld> ah right because the connections are long lived
[10:40] <wallyworld> i could do it based on rate of connection
[10:40] <jam> wallyworld: right, not for the *clients* which have to pay that on every connect
[10:40] <jam> wallyworld: but all the agents which have long-lived only pay it 1x
[10:40] <jam> wallyworld: something like 'number of connections in the last X seconds' would be good
[10:41] <wallyworld> yep, that would solve the thundering herd issue
[10:41] <wallyworld> i can tweak it
[10:41] <jam> wallyworld: (arguably we could do per-IP tracking or something, but again, that would be penalizing users that are actively engaging with the system)
[10:41] <jam> we really just want the pushback on agents
[10:41] <jam> and we only know that at the Login time
[10:41] <wallyworld> agreed, but we don't concretely know what those ip addresses are at that point
[10:41] <wallyworld> we can gues, but....
[10:41] <jam> wallyworld: yeah, I don't think we want to do IP based, cause then you have to track all of that
[10:42] <jam> I think just doing 'how many have connected in the last X' and slow it down up to 5s is ok
[10:42] <wallyworld> so i reckon 5ms per X rate of new connections
[10:42] <wallyworld> yep, up to 5s max
[10:42] <jam> wallyworld: I'd then also have Login that is going to *reject* an agent to come back later, wait another 5s
[10:43] <jam> wallyworld: which means all the people over the current 10 that we are going to reject, get delayed a little bit extra
[10:43] <jam> and I'm not apposed to something that delays before Acquire as well
[10:44] <wallyworld> jam: so add a pause when limiter.Acquire() returns false?
[10:44] <wallyworld> i think delay before is ok too
[10:44] <jam> wallyworld: those are the ones that will be reconnecting 3s later
[10:44] <wallyworld> ok, i can add another apram to NewLimiter()
[10:45] <wallyworld> fixed time to pause if a reject happens
[10:45] <jam> wallyworld: its not hard to put it just before the "return ErrRetry"
[10:45] <wallyworld> yeah, ok
[10:46] <wallyworld> jam: so hopefully the net effect of this (pun half intended) is to allow things to come up more controlled without resorting ti IP tables
[10:46] <jam> wallyworld: yeah, we need to set up some testing of 'restart times' so we can tune some of these numbers
[10:46] <wallyworld> next thing would be to throttle log connections
[10:47] <wallyworld> yeah, testing needed for sure
[10:47] <jam> wallyworld: I can probably set wpk on it today
[10:47] <jam> he seemed interested
[10:47] <wallyworld> ok, i'll finish this work
[10:47] <jam> wallyworld: I'm also curious what the net effect would be if you are running in HA
[10:47] <jam> a given controller is going to push back, but will the others, etc
[10:47] <wallyworld> yeha
[10:47] <wallyworld> jam: i almost convinced myself those delay params should be configurable, not consts
[10:48] <wallyworld> so we can play with the numbers
[10:48] <wallyworld> maybe via env vars
[10:48] <jam> wallyworld: well, I would hack them with ENV vars, etc to test it
[10:48] <jam> wallyworld: but it also is something that as soon as we know *we* want a knob
[10:48] <jam> somebody else will ask for it
[10:48] <wallyworld> right, but we hide that knob
[10:48] <wallyworld> those env vars are not publicised
[10:49] <wallyworld> but we can ask CI to set up a system with lots of xplod charms, get it to steady state, see how it goes, and then kill the controller and see what happens then as well
[10:49] <wallyworld> and tweak the numbers
[11:09] <axw> wallyworld jam: https://github.com/juju/juju/pull/7465 has updates to support CPU profiling in the introspection CLI, as well as adding support for easily exposing as HTTP
[11:10] <axw> wallyworld jam: I started down the road of just modifying the bash code a little bit, but it was very fragile. so ended up with something a bit more comprehensive...
[11:13] <jam> axw: is this a bit too much for a 2.2 at this point? I suppose we aren't changing the actual socket, nor are we changing the scripts that we used to support
[11:13] <jam> just how they connect
[11:13] <jam> and possibly exposing a new thing people will ues
[11:13] <jam> its nice to not need to 'apt install socat' all the time
[11:14] <jam> small note 'juju-introspect' or 'jujud-introspect'... not sure
[11:14] <jam> myself
[11:14] <jam> I guess it is 'juju-run'
[11:14] <jam> though honestly *that* one is mostly a source of confusion
[11:15] <axw> jam: the alternatives I can see are: (a) do nothing, (b) use curl, which makes the command more fragile (because of timing issues, starting socat and curl not necessarily having --retry, and other weirdness around socat)
[11:16] <axw> jam: IMO, this could wait for 2.2.1. it's possible to do all these thigns already with 2.2, just not in a neat command
[11:17] <jam> axw: so the singlehostreverseproxy is to handle redirecting HTTP to a unix socket?
[11:17] <jam> well abstract domain sockt
[11:17] <axw> jam: yep
[11:20] <jam> axw: to check are we changing the raw content output then?
[11:20] <jam> you made a comment about not having the headers
[11:20] <jam> which sounds good
[11:20] <jam> but does mean the actual output of "juju-goroutines > saved.txt" is going to be slightly different?
[11:20] <jam> (AFAICT, it actually means you don't have to munge the file before it is actually useful)
[11:20] <axw> jam: yes. it's the same except without the HTTP response header
[11:20] <axw> jam: right
[11:21] <jam> axw: my concern is anyone whose scripted it may be removing it themselves and we're breaking that
[11:21] <jam> thats the sort of "shouldn't do in a .patch' release", I think
[11:22] <jam> axw: I do believe it was a gotcha trying to use things like the heap profile
[11:22] <jam> so ultimately better
[11:22] <jam> but probably a risk for putting it into rc2, but also a big win for not breaking it in a .patch
[11:23] <axw> jam: I'm not aware of anyone interpreting them anyway - are you? not that that's proof or anything, but I am curious. they've always just been handed back to dev IME
[11:25] <jam> axw: well, *I've* used them to run against go tool, and its always been a pain that you have to munge. Its certainly the sort of thing where I'd want us to be careful with compat
[11:25] <jam> axw: and saying "<2.2.0 you need to trim the front, but we do that automatically in 2.2" sounds much better than
[11:25] <jam> in '2.2.1'
[11:25] <axw> jam: yep, fair point
[11:25] <jam> axw: I'd *like* others to chime in on the "should it be 2.2.0rc2 or 2.2.1"
[11:26] <jam> but you have my vote
[11:26] <axw> jam: thanks. I will wait for wallyworld and thumper to chime in at least
[11:27] <jam> a couple small things
[11:27] <jam> you list the symlinks in one list over here, but individually multiple times over there
[11:27] <jam> and 'juju-introspection' vs 'jujud-introspection'.. I'm not sure there, either
[11:27] <jam> juju- matches other things, but really we are introspecting a jujud
[11:31] <axw> jam: yep, thanks I'm fixing that list. I'm -0 on jujud-introspect because it has a different prefix to the introspection helpers (juju-goroutines, juju-heap-profile, etc.). they're all about jujud too, but I don't think it'd be helpful to users to have two different prefixes for the same class of commands
[11:31] <jam> fairy nuff
[11:32] <axw> jam: family's home, gtg. thanks for the review
[11:32] <thumper> axw: shipit for 2.2-rc2
[11:32] <thumper> axw: I was just considering something like this myself
[11:32] <thumper> so yay
[11:32] <axw> thumper: okey dokey. I believe the bot is disabled, so how does one do that?
[11:32] <thumper> axw: one asks one of the QA folk to poke the bot manually
[11:32] <axw> ah I have to run, I'll check back later
[11:33] <thumper> axw: probably need to get balloons to do it when he starts
[11:33] <jam> balloons: ^^ https://github.com/juju/juju/pull/7465
[11:33]  * thumper should go to bed
[11:33] <jam> we would like to land that for 2.2rc2
[11:33] <thumper> well, go do dishes first
[11:33] <thumper> night all
[11:33] <jam> thumper: go sleep :)
[20:03] <marcoceppi> how can I upgrade to 2.2-rc1 from a previous stable version?
[20:04] <marcoceppi> --agent-version=2.2-rc1 says "ERROR no matching binaries available"
[20:31] <marcoceppi> I got it upgrading, but how long should an implace upgrade take?
[20:43] <wallyworld> marcoceppi: see the release  notes for rc1 - we split the logs into per model collections so for this upgrade, it can take a whiile
[20:43] <wallyworld> the upgrade may need to split apart up to 4GB of logs
[20:47] <marcoceppi> wallyworld: thanks
[20:48] <wallyworld> marcoceppi: i'm guessing it took maybe 5 or 10 minutes?
[20:48] <wallyworld> we should surface a more complete message that just "upgrading" perhaps
[20:49] <wallyworld> this was done to improve the model destroy performance for large numbers of models
[21:07] <marcoceppi> wallyworld: I think my upgrade might be stuck, but I have no way of telling
[21:07] <marcoceppi> it was started at 48 after the hour
[21:08] <wallyworld> was it a big deploy?
[21:08] <marcoceppi> disk space consumption has not changed, and the logs are mostly filled with "login denied, upgrade in progress"
[21:08] <marcoceppi> 6 machines
[21:08] <marcoceppi> 1 model
[21:08] <marcoceppi> but it was a 2.0.4 -> 2.2-rc1
[21:09] <wallyworld> should work though
[21:09] <wallyworld> are you able to get a mongo shell and do a db.logs.size() and also a size on the new model logs collection to see if the records are still being copied?
[21:10] <wallyworld> the new logs collection is something like logs.<modeluuid>
[21:11] <marcoceppi> wallyworld: how do I get a mongo shell?
[21:12] <wallyworld> ssh to controller, and then mongo --ssl -u admin -p <oldpassword> localhost:37017/admin --sslAllowInvalidCertificates
[21:12] <wallyworld> where oldpassword is sudo grep oldpassword /var/lib/juju/agents/machine-0/agent.conf
[21:12] <wallyworld> then once in shell, do a "use juju"
[21:13] <wallyworld> that selects the juju database
[21:16] <marcoceppi> let me take a look
[21:40] <babbageclunk> wallyworld: should I pick up a bug from the release blockers section?
[21:40] <wallyworld> babbageclunk: in release call now, just discussing what needs to be done
[21:40] <babbageclunk> ok
[22:03] <marcoceppi> wallyworld: I get login fialed with that command
[22:03] <marcoceppi> but the upgrade completed
[22:03] <marcoceppi> so I don't care anymore
[22:05] <wallyworld> marcoceppi: sweet, ok. but we should report better
[22:07] <wallyworld> babbageclunk: HO in standup?
[22:09] <babbageclunk> wallyworld: sure
[22:16] <marcoceppi> wallyworld: I do have another problem
[22:17] <marcoceppi> since the ugprade `juju models` hangs
[22:21] <wallyworld> marcoceppi: ah bum, ok
[22:21] <wallyworld> we haven't seen that
[22:21] <babbageclunk> :(
[22:21] <wallyworld> can you turn on debug logging and see what it says?
[22:21] <wallyworld> raise a bug for sure with as much detail as possible
[22:29] <marcoceppi> wallyworld: it just says connected to ws
[22:31] <wallyworld> marcoceppi: does show-model work?
[22:31] <marcoceppi> wallyworld: add and destroy model work
[22:32] <wallyworld> show-model?
[22:32] <marcoceppi> wallyworld: nope
[22:32] <marcoceppi> wallyworld: http://paste.ubuntu.com/24803880/
[22:32] <marcoceppi> wallyworld: it says "connection established" then that's it
[22:33] <thumper> well bollocks
[22:33] <wallyworld> marcoceppi: can you turn on debug logging and provide a snippet from juju debug-log
[22:33] <marcoceppi> I think debug logging is on?
[22:33] <wallyworld> juju model-config logging-config="<root>=DEBUG;"
[22:34] <thumper> marcoceppi: juju debug-log -m controller
[22:34] <marcoceppi> model config hangs
[22:34] <thumper> this is a pretty serious regression
[22:34] <wallyworld> look at current logging-config first so you can set it back later. juju model-config
[22:34] <marcoceppi> model-config hangs all together
[22:34] <wallyworld> wtf
[22:35] <marcoceppi> to be fair, two hours ago this was a 2.0-beta18 controller
[22:35] <thumper> marcoceppi: wat?
[22:35] <wallyworld> can you log onto the controller and look at the apiserver.log file
[22:35] <marcoceppi> 2.0-beta18 -> 2.0.4 -> 2.2-rc1
[22:35] <thumper> marcoceppi: I'm not sure beta 18 was upgradable
[22:35] <marcoceppi> thumper: well, 2.0.4 worked
[22:35] <thumper> marcoceppi: we didn't say upgradable until 2.0-rc1
[22:35] <thumper> hmm...
[22:36] <thumper> in theory, it should work
[22:36] <thumper> marcoceppi: 'juju debug-log -m controller --replay | pastebinit'
[22:36] <wallyworld> once we see server logs, we can deduce what's wrong hopefully
[22:36] <marcoceppi> well now everything is hanging
[22:36] <marcoceppi> let me see what is happening onthe server
[22:38] <marcoceppi> load of 13, helllooo
[22:39] <marcoceppi> okya, model-config works, models doesn't
[22:44] <marcoceppi> thumper: http://paste.ubuntu.com/24803956/
[22:45] <marcoceppi> wallyworld: ^
[22:46] <thumper> machine-0: 18:38:52 DEBUG juju.utils setting GOMAXPROCS to 1
[22:46] <thumper> huh?
[22:46] <marcoceppi> my hope is I can just "model migrate" this to 2.2.0 and resolve a lot of whatever the hell I did
[22:47] <thumper> I wonder why we are seeing so much of this: machine-0: 18:38:54 DEBUG juju.mongo dialled mongodb server at "10.142.0.2:37017"
[22:49] <marcoceppi> you all want ssh?
[22:49] <wallyworld> thumper: it appears the api worker can't start
[22:49] <wallyworld> maybe
[22:50] <marcoceppi> jujud is pegging this controller at 100%
[22:50] <marcoceppi> but it's been doing that since 2.0-beta18
[22:50] <marcoceppi> happy to give this vm more resources if that's what it takes
[22:50] <thumper> marcoceppi: probably a broken setup...
[22:50] <thumper> it shouldn't be doing that
[22:50] <marcoceppi> that's what I wanted to go to 2.2, get them perf fixes
[22:50] <thumper> heh
[22:51] <thumper> marcoceppi: need to do this "juju model-config -m controller logging-config=juju=debug"
[22:51] <marcoceppi> and CMR ,and like all the other good things
[22:52] <thumper> then some debug log over the models call
[22:57] <marcoceppi> I've apparently exhausted memeory
[22:57] <marcoceppi> http://paste.ubuntu.com/24804029/
[22:58] <marcoceppi> I'mve going to bump up the VM
[23:05] <marcoceppi> rebooted, more cpu/ mem
[23:05] <marcoceppi> now I get this
[23:05] <marcoceppi> marco@T430:~$ juju models
[23:05] <marcoceppi> ERROR cannot list models: upgrade in progress (upgrade in progress)
[23:05] <marcoceppi> marco@T430:~$ juju switch controller
[23:05] <marcoceppi> silph.io-prod1:admin/test -> silph.io-prod1:admin/controller
[23:05] <marcoceppi> marco@T430:~$ juju status
[23:05] <marcoceppi> Model       Controller      Cloud/Region     Version  Notes                               SLA
[23:05] <marcoceppi> controller  silph.io-prod1  google/us-east1  2.2-rc1  upgraded on "2017-06-07T21:13:29Z"  unsupported
[23:05] <marcoceppi> App  Version  Status  Scale  Charm  Store  Rev  OS  Notes
[23:05] <marcoceppi> Unit  Workload  Agent  Machine  Public address  Ports  Message
[23:05] <marcoceppi> Machine  State  DNS            Inst id        Series  AZ          Message
[23:05] <marcoceppi> 0        down   35.185.85.250  juju-c9c599-0  xenial  us-east1-b  RUNNING
[23:05] <marcoceppi> marco@T430:~$ juju models
[23:05] <marcoceppi> ERROR cannot list models: upgrade in progress (upgrade in progress)
[23:06] <marcoceppi> crap
[23:06] <marcoceppi> http://paste.ubuntu.com/24804067/
[23:23] <thumper> marcoceppi: it may well be migrating the logs
[23:23] <thumper> marcoceppi: that will take some time
[23:23] <thumper> marcoceppi: to move 4G of logs on my laptop with an SSD was over 7 minutes
[23:52] <axw> veebers: hey, would you please land https://github.com/juju/juju/pull/7465 for 2.2? it has thumper's seal of approval
[23:52] <thumper> axw: we asked veebers to stop making 2.2 special for now
[23:52] <axw> thumper: ah ok
[23:53] <veebers> thumper: ah yeah, I'll fix that up now, sorry
[23:53] <thumper> but we'll keep an eye on who submist what
[23:53] <thumper> veebers: thanks
[23:53] <axw> okey dokey
[23:54] <veebers> thumper, axw: done it should just go through as per normal (once picked up)
[23:54] <axw> veebers: cheers
[23:55] <veebers> thumper, axw: any idea what else needs to land for rc2?
[23:55] <axw> veebers: azure auth stuff
[23:55] <axw> veebers: which has changed since I reviewed it, re-reviewing now
[23:55] <thumper> veebers: I'm adding some stuff around state export
[23:56] <thumper> veebers: wallyworld is working on a statushistory deletion bug
[23:56] <thumper> veebers: possibly wallyworld's connection backoff code
[23:56] <thumper> axw: can I get you to look over that too?
[23:56] <wallyworld> babbageclunk is working on the delete bug
[23:56] <thumper> wallyworld: ok, ta
[23:56] <axw> thumper: sure
[23:57] <veebers> thumper, axw: ack. If you can keep burton and myself in the loop so we know which CI runs to track (and baby) so we're ready to rock and/or roll when needed for release
[23:57] <thumper> hmm... dealing with a facade bump where we change the args and return values...
[23:58] <thumper> veebers: yep, sure
[23:58] <thumper> veebers, wallyworld: we also need to work out why the capped collection overflow didn't stop the agents
[23:58] <thumper> it *should* have caused all agents to stop immediately
[23:59] <wallyworld> depends if CPU was overloaded etc
[23:59] <wallyworld> agents stop once channel selects are processed etc