[00:00] we've got a charm that seems to be stuck on installing, but the juju logs say the start hook ran, any ideas on how we can dig into it? [00:00] thumper: wallyworld ^ this is related to current IS issues [00:00] ok, thanks for update [00:01] perrito666: that does indeed seem to be a serious problem [00:02] perrito666: file a bug and point thumper at it. one of us will fix it soon. [00:02] * thumper keeps his head down [00:02] thumper: candy? [00:02] * thumper mutters under his breath something about deadlines and too much work [00:03] * thumper ignores the candy [00:03] perrito666: we'll have to add an upgrade step to fix existing records [00:07] bradm: if you haven't already can you look at the logs on the unit's machine itself? (/var/log/juju/unit-FOO.log) [00:09] menn0: all good now, it was just taking a long time to realise it was up [00:10] bradm: ok, good to hear [00:10] well, all good might be a stretch, but its all onto ceph now [00:10] its making good progress [00:14] perrito666: I can certainly see the statuses env-uuid problem in a local env here [00:14] menn0: adding the bug with detail [00:14] NOTICE: jujucharms.com and the charmstore are back up. The storage in IS is working to rebalance/sync and might time out or be slow for a bit longer. [00:14] menn0: ^ [00:15] rick_h_: sweet [00:15] perrito666: I wonder how status lookups are even workings at all [00:17] menn0: me too [00:17] menn0: thumper https://bugs.launchpad.net/juju-core/+bug/1474606 [00:17] Bug #1474606: entities status is loosing env-uuid upon setting status. [00:19] menn0: enjoy [00:19] hah, he says as he takes his candy back [00:20] rick_h_: well I left a very nice report in exchange [00:20] menn0: this happens for services, units, agents, machines and every other thing that has a status [00:22] perrito666: I see you've already got a fix for it too [00:22] perrito666: although I might try and fix this in the multi-env txn layer [00:22] too [00:23] menn0: I do, but I was not sure if I cover all aspects of this issue [00:23] I just fixed my patch of land [00:24] perrito666: understood [00:26] what the heck is this test doing ? [00:26] FAIL: kvm-broker_test.go:241: kvmBrokerSuite.TestStartInstancePopulatesNetworkInfo [00:26] [LOG] 0:00.001 DEBUG juju.testing setting feature flags: address-allocation [00:26] kvm-broker_test.go:251: instanceConfig := s.instanceConfig(c, "42") [00:26] /home/ubuntu/src/github.com/juju/juju/container/testing/common.go:90: c.Assert(err, jc.ErrorIsNil) [00:26] ... value *os.PathError = &os.PathError{Op:"mkdir", Path:"/var/lib/lxc", Err:0xd} ("mkdir /var/lib/lxc: permission denied") [00:26] of course this is going to fail [00:26] mortals don't have permissino to write to that dir [00:33] wallyworld: just confirmed... rsyslog is screwed both before and after the upgrade [00:34] :( [00:34] writing up a ticket now [00:34] oh dear [00:35] in different ways though so you know, at least that's interesting [00:35] even better [00:35] I think the issue in 1.24.0 has already been fixed in a later 1.24 [00:36] the rsyslog issue? so do we need to ammend an upgrade step? [00:37] not sure yet [00:37] before the upgrade rsyslogd is continually being restarted by juju [00:37] every 30s or so [00:37] sounds like a bug we may have fixed yeah [00:37] the only thing in juju's logs is the rsyslog worker saying "reloading rsyslog config" [00:38] after the upgrade that stops [00:38] but most/all of the units can't connect [00:38] due to cert verification [00:39] thumper: "I did say that the way to fix this properly was to not use the state method to load the charms." -- what else would you use to enumerate charms, if not state? [00:39] which seems like the thing that's been attempted to be fixed several times now [00:39] anyway, i'll write up the ticket [00:39] wallyworld: ^^ [00:39] Bug #1474606 opened: entities status is losing env-uuid upon setting status.

[00:39] Bug #1474607 opened: worker/uniter/relation: HookQueueSuite.TestAliveHookQueue failure [00:39] axw: making a request of the collection and have it return the raw data as dicts [00:40] axw: that way you aren't expecting a particular structure [00:40] axw: we have had to write many upgrade steps in this way [00:40] and I believe it is better [00:40] because you just ask for what you need, and change what you must [00:40] and don't worry about the structure of the doc as much [00:41] menn0: sorry was distracted by a review. i recall vaguely the cer issue was fixed. rsyslog i think uses a different cert to state server connections [00:42] wallyworld: do you know when it was fixed given that I'm seeing this with an upgrade to 1.24.2? [00:42] menn0: not offhand - ericsnow might know [00:45] thumper: i am going to fix it like you wanted in 1.22 - do we really need to do the work in 1.24 if the step runs after env uuid has been added? [00:45] if yes, i can fix [00:47] wallyworld: is there a test we can add that would have highlighted the issue? [00:47] wallyworld: and that would highlight future issues [00:48] axw: for this case yes - a CI test that adds charms to the 1.20 env prior to upgrade and then checks that they are imported after [00:49] that's on my todo list to follow up on [00:50] wallyworld: presumably we could write a unit test that adds a charm entry to state, and wipes out its env-uuid field to exercise the bug? but the CI test would be better, since it'll catch future bugs too [00:50] yeah, that was my thinking [00:52] menn0: and for further joy, bug 1469077 is back again, i'll need to point william to it [00:52] Bug #1469077: Leadership claims, document larger than capped size

[00:52] wallyworld: yeah I saw that [00:52] wallyworld: so unawesome [00:53] i know right :-( [00:59] here's the rsyslog ticket: bug 1474614 [00:59] Bug #1474614: rsyslog connections fail with certificate verification errors after upgrade to 1.24.2

[01:00] wallyworld: ^^ [01:00] looking [01:01] thanks for filing, a nice bug report [01:03] menn0: axw will pick up that rsyslog bug [01:05] axw: goose pr lgtm [01:06] wallyworld: thanks [01:09] Bug #1474291 opened: juju called unexpected config-change hooks after read tcp 127.0.0.1:37017: i/o timeout

[01:09] Bug #1474614 opened: rsyslog connections fail with certificate verification errors after upgrade to 1.24.2

[01:19] axw: tagging pr reviewed, but there is a question [01:20] wallyworld: ok, thanks, will look in a sec [01:50] wallyworld: if we don't, it is just a time bomb for the next time things change [01:50] it is makeing a problem for future us [01:51] thumper: looking at the code - there's *lots* of current upgrade steps that use the docs directly, not maps. the only ones that use maps are the ones to inser env uuid [01:53] well... we are just making problems for ourselves IMO [01:54] i thumper your team wrote a lot of them [01:55] you are proably not wrong [01:55] i guess the difference is that the doc are used with the raw collection [01:55] I'm telling you the result of accumulated wisdom [01:55] so rawCollection.Find(&someoc) [01:56] one would hope that CI tests would evolve to better catch upgrade issues [03:46] thumper: sorry was deep in a support scenario. [03:47] thumper: sounds good mate, submit it earlier is way better than later, as the queue takes a couple days to sift down to newly submitted stuff [03:47] lazyPower: no worries [03:47] I'm not going to block my deployment on it getting reviewed :) [03:47] we're averaging ~ 5 days on initial touch, still trying to get that number down, but its way better than the 13 days in history. [03:47] as you shouldn't be :) [03:47] namespaces!!!! [03:47] * lazyPower toots the namespace horn [03:48] cheers [03:50] namespaces? [04:05] thumper, waigani, wallyworld: see email for findings related to bug 1474195 [04:05] Bug #1474195: juju 1.24 memory leakage

[04:05] thumper, waigani, wallyworld: looks like will's theory was right [04:16] menn0: red box of death? [04:17] waigani: yep... see the note on the field [04:17] ah yep, just saw note [04:17] :) [04:18] nice work [04:18] waigani: when you added the auto env life assertion to the txn layer did you remove the ones that already existed elsewhere, or did they not exist anywhere before JES? [04:18] I guess it wasn't really necessary when there was just one env [04:19] menn0: yeah, it's going back a bit now, but I don't remember there being any - which as you point out makes sense. [04:19] cool [04:20] waigani: i'm trying to figure out the right places to check [04:20] adding a machine certainly [04:20] menn0: you mean where we really need to assert for a live environ? [04:20] waigani: yep [04:21] menn0: as a starting point, didn't will say whenever we add a service, unit, relation or machine? [04:22] Bug #1454468 changed: nodes deployed successfully by maas but juju status remains pending with juju 1.23.2 and services stuck in allocating

[04:25] waigani: I wonder if we can reduce that set to just service and machine [04:25] so what's the worst case? we add a unit/relation to a dying environment... [04:26] waigani: I'm wondering if we can just add an environment cleanup to kill them [04:26] waigani: in fact, the current cleanupServicesForDyingEnvironment might already do it [04:29] * thumper is going to lie down [04:29] * thumper is not 100% [04:29] menn0: so that sets the existing services to dying, but expects that no new services can be added to a dying environment [04:30] waigani: standup hangout? (as per PM) [04:30] menn0: yep [05:07] menn0: should be any time we allocate something that costs [05:07] eg machine, storage etc [05:09] wallyworld: yep, that's what i'm looking at now... anything that results in a physical change certainly needs the env life assert [05:09] (as physical as a virtual machine is anyway) [05:09] well, physical change that costs $$$ [05:10] don're really care about containers [05:10] wallyworld: i've got a pretty clear picture now of what I want to do. i'm going to catch will later on tonight to confirm [05:10] but machines yes [05:10] ok [05:10] wallyworld: that's a good point, maybe we don't check for containers [05:11] menn0: yeah, so for stuff that doesn't cost, we just have a cleanup job after env is killed [05:11] wallyworld: yep, and we already have most of that it turns out [05:11] i can't see why we'd check more than is necessary [05:11] and before JES, we didn't check [05:12] so TBH i'm not sure why we started checking with JES [05:13] i guess JES has greater chance of concurrent access [05:13] yep and b/c before when you issued destroy-environment everything died at that point including the API server [05:14] so you had very little opportunity to add a new machine or whatever once the env was dying [05:14] waigani: so is that +2 a ship it? btw - that func needs to be exported because it is in state package [05:14] and called by upgrades package [05:14] now for hosted envs the API server stays up so there's a much great chance of env changing operations as the env is dying [05:14] fair point [05:15] anyway, stopping now since i'm going to be back on later === menn0 is now known as menn0-afk === _stowa_ is now known as _stowa [07:46] dooferlad, morning [07:46] dimitern: hi [07:46] dooferlad, I thought we dealt with the kvm-inaccessable-after-reboot issue in 1.24 as well ? see bug 1474508 [07:46] Bug #1474508: Rebooting the virtual machines breaks Juju networking [07:47] dimitern: I thought so too. [07:48] dooferlad, maybe the fix is in master only? [07:48] dimitern: will need to take a look and see if I missed landing it [07:48] dooferlad, cheers [07:53] dimitern: darn, wasn't backported. [07:53] dimitern: will be trivial to do. [07:56] wallyworld: sorry, just saw your message - this for moving charm tests to state? yes, +2 shipit. [07:56] ta [07:59] wallyworld: ah, I thought I clicked shipit - done now. The pattern of needing exported state funcs for upgrade steps is something fwereade is keen to change - possibly just exporting one upgrade step from state which then calls the other unexported steps. But for now we just need to try to make it clear that while the func is exported, no-one except the upgrades package should be using it. [08:00] waigani: np. and this was for a 1.22 release, so old code [08:02] wallyworld: yeah true [08:06] dooferlad, awesome! will you do the dance then please? - card, bug, etc. [08:06] :) [08:06] sure [08:06] ta! === menn0-afk is now known as menn0 [08:15] fwereade: ping? [08:45] dimitern: http://reviews.vapour.ws/r/2163/ for a quick +2 [08:50] dooferlad, ship it! :) [08:59] fwereade: dimitern: food just arrived so I'm going to miss the standup. But I'm working on breaking down the Uncommitted state stuff into development items (I'd like to chat directly with fwereade later if you have time before our cycle review) === jam1 is now known as jam [09:00] jam, fwereade, TheMue, dimitern: stand up! [09:01] dooferlad, omw [09:01] omw [09:01] jam, thanks for the heads up [10:24] can anyone tell me something about plans relevant to the EnvironmentsCacheFile feature? [10:26] jam, fwereade: ^ [10:28] rogpeppe: I don't particularly know it by that name, but it looks like something thumper would have been doing to support multiple environments [10:28] Bug #1474788 opened: ec2: provisioning machines sometimes fails with "tagging instance: The instance ID does not exist"

[10:29] just by reading its description from https://github.com/juju/juju/blob/master/feature/flags.go#L29 [10:29] jam: i'm just wondering what our future plans are. is the plan to do away with .jenv files entirely? [10:30] rogpeppe: thats how I read the description in there. I haven't heard of that before, nor had read that particular detail in the JES stuff. But it does read that way. [10:30] jam: surely we have some roadmap plans somewhere? [10:30] cherylj: do you know about this, by any chance? [10:31] rogpeppe: so I've got docs for JES CLI, JES Logging, MESS Work Items and one more. The last two are "Historical" and it might be described in there, but they are roughly before I started tracking all the proposals directly. [10:37] dimitern: this is what I have for the spaces API stuff: https://github.com/juju/juju/compare/net-cli...dooferlad:net-cli-apiserver-spaces?expand=1 [10:39] dimitern: would be good to have a chat about if that is shaping up in the way you imagined. I am not sure I like having the stub network stuff in apiserver/testing. I think having its own package is nicer. [10:39] dimitern: what do you think? [10:40] dooferlad, looking [10:43] dooferlad, I like the refactoring around moving the shared stubs in apiserver/testing [10:43] dooferlad, haven't looked at every line, but so far it looks solid [10:44] dooferlad, please, s/ast/apiservertesting/ (or whichever alias for that path is more common) [10:44] do you have an opinion about if fake_spaces_subnets.go shoud be in its own package so we can just import and use it rather than having to call InitStubNetwork? [10:45] dooferlad, also InitStubNetwork() could be defined as a method on a fixture struct, which can be embedded into the suites that need it and call it in SetUpSuite, rather than init() [10:46] dooferlad, have a look at LiveTests (or was it Tests ?) for example [10:47] dimitern: sure, github.com/juju/juju/environs/jujutest/livetests.go right? [10:47] dooferlad, I have a lingering feeling the shared stubs are not goroutine safe (when used outside apiserver/testing) - make sure you run with -race [10:47] dooferlad, that's the one yeah [10:49] dimitern: great, thanks for the pointers. [11:20] fwereade: bug 1469077 has come up again on 1.24.2, so i removed the incomplete status [11:20] Bug #1469077: Leadership claims, document larger than capped size

[11:23] wallyworld, grar. axw, do you have context on this? ^^ [11:23] fwereade: nope [11:24] fwereade: i have no context on the cause or fix sadly, but i see some info has been attached to the bug [11:25] wallyworld: fwereade: I could see a case where contending on the txn-queue field and having it grow large enough that we can't handle all the txns listed before a new one comes in [11:26] and then it grows every 30s until there are so many entries that it is larger than we're allowed to make a document (or in this case larger than the size of a capped collection?) [11:26] sounds plausible [11:26] jam, yeah -- I just thought that *someone* had addressed the writes that caused that [11:26] jam, I just forget who [11:26] jam, perhaps I hallucinated it [11:26] fwereade: we handled that for addresses by fixing the addresser [11:26] ah mr *someone* :-) [11:26] I don't know of a fix for the leadership stuff [11:27] jam, ok, bugger, I thought that was part of the stuff axw had done but evidently not [11:27] m-enn-o had done some work to clean out transactions that are thought of as already applied (to handle our other assertion-only TXNs don't get cleaned out) [11:27] but I would think that's a different issue. [11:27] jam, yeah [11:27] jam, ok, let's chalk it up to a hallucination then ;p [11:28] jam, oh wait [11:28] jam, I thought it was the remove/insert behaviour that led to growing txn queues, and mr.someone had made a fix to the lease persistor that stopped it doing that? [11:29] wallyworld: [11:29] // TODO(wallyworld) - this logic is a stop-gap until a proper refactoring is done [11:29] // We'll be especially paranoid here - to avoid potentially overwriting lease info [11:29] // from another client, if the txn fails to apply, we'll abort instead of retrying. [11:30] wallyworld, originally it was remove/insert every time, which was causing unbounded queue growth [11:30] hmmm, le tm elook up that code [11:30] wallyworld, state/lease.go [11:31] wallyworld, maybe it wasn't backported..? [11:31] i don't recall that todo all at, yet it has my nname on it :-) [11:32] haha [11:32] I know the feeling [11:32] fwereade: i just checked the code, i.24 is the same [11:35] wallyworld: I do remember fwereade reviewing a patch you submitted so that we changed how leases are requested so that it wouldn't be a "delete current one, create a new one" sort of operation.d [11:35] fwereade: as far as that goes, *if* we ever get to a point where we have an invalid TXN in the queue (one that we cannot clear) [11:36] then we'll overflow the txn-queue eventually [11:36] because creatiion of a *new* txn adds a value to the field [11:36] and then when we go to evaluate the txn, we see the bad txn and die, and now we have yet-another txn in the queue [11:37] so the "document too big" could just be a symptom of "invalid TXN in queue" [11:44] jam: i vaguely recall that too, i'll have to go digging [12:01] fwereade: iteration planning meeting? [12:04] jam, there [12:04] hm. I don't see you in the one I'm in [12:08] katco: ping [12:11] Bug #1474508 changed: Rebooting the virtual machines breaks Juju networking

[12:17] morning all [12:40] perrito666: o/ [13:07] ericsnow: ping [13:24] so cold and alone [13:24] wwitzel3: lol [13:24] natefinch: these things tend to happen when working on rsyslog stuff ;) [13:28] wwitzel3: ahh yeah, totally [13:33] dimitern, TheMue: please be opinionated at http://reviews.vapour.ws/r/2169/ [13:33] dooferlad: ok [13:34] dooferlad: too many files, cannot be good *lol* [13:57] dimitern, kiijubg\ [13:57] wtf?! [13:57] dooferlad, looking :) [14:14] Bug #1468815 opened: Upgrade fails moving syslog config files "invalid argument"

[14:32] rogpeppe: I didn't do the work to enable the cache file, but I might be able to answer specific questions you may have about it. [14:33] dooferlad, reviewed [14:44] dimitern: thanks - exactly what I needed. [14:44] dooferlad, cool :) [15:02] Bug #1474885 opened: juju deploy fails with ERROR EOF [15:02] Bug #1474892 opened: User friendly error message for system destroy could be improved [15:52] * fwereade is stopping, has a review up: http://reviews.vapour.ws/r/2172/ [16:02] ericsnow: meeting [16:24] katco, would you mind filling gsamfira in our the details of how we use feature branches? [16:24] I pointed him to the wiki but he has some questioned I am not qualified to answer [16:25] alexisb: sure thing [16:26] gsamfira: lmk what questions you have [16:49] fwereade, I'll be proposing small uniter changes later on, don't need reviewing yet but I'll ping you about them tomorrow, wanted to let you know before then just to let you know that part of it might be controversial but I think I have a good justification [16:51] kvm-broker_test.go:201: kvm0 := s.startInstance(c, "1/kvm/0") [16:51] /home/ubuntu/src/github.com/juju/juju/container/testing/common.go:90: c.Assert(err, jc.ErrorIsNil) [16:51] ... value *os.PathError = &os.PathError{Op:"mkdir", Path:"/var/lib/lxc", Err:0xd} ("mkdir /var/lib/lxc: permission denied") [16:52] I am seeing this error constantly on a fresh ubuntu machine [16:52] it seems pretty fatal [16:52] has anyone else ever seen this [16:52] i'm sure it's because lxc pacakges are not installed, so /var/lib/lxc is not present [16:52] but this seems like a pretty serious isolation failure [16:59] davecheney: I haven't seen it, but I have lxc installed [17:00] this is on a fresh install [17:00] tests fail because this directory is [17:00] 1. not present [17:00] 2. will not be present, bucause /var/lib is owned by root [17:01] davecheney: certainly, it's an isolation problem. I wonder if there aren't a lot more similar problems in those tests, if lxc is not installed. [17:02] i'm too scared to look [17:02] also, how is that test supposed to pass on windows ? [17:03] * davecheney logs a bug and moves on [17:03] davecheney: I presume all the tests are marked as skipped on windows [17:11] do we have voting windows CI tests ? [17:11] ericsnow, wwitzel3: review me? http://reviews.vapour.ws/r/2174/ [17:12] sinzui: what davecheney said ^ [17:12] https://bugs.launchpad.net/juju-core/+bug/1474946 [17:12] Bug #1474946: worker/provisioner: tests are poorly isolated [17:12] davecheney: I know they run and passed at one time, but I don't know if they're voting or not. I believe so, since I have gotten windows bugs [17:12] from CI failures [17:14] natefinch windows tests do vote. they have passed in but not in a week. I am told many test are skipped [17:14] wwitzel3: btw, I already foward ported the first bug in your bug task: https://bugs.launchpad.net/juju-core/+bug/1370896 [17:14] Bug #1370896: juju has conf files in /var/log/juju on instances

[17:15] juju-ci-tools as a similar problem when we run its own suite on OS X. I created /var/lib/lxc on the machine to get a pass [17:15] that's terrible [17:15] natefinch: yeah, saw that, I discovered that the problem still exists in juju-1.24 master so I'm working on a fix now, before porting the other PRs [17:15] why doesn't it fail for the landing bot ? [17:17] davecheney: windows test suite is run by ci, not the merge bot, and since the test take about 2 hours to get a pass, do you really want to slow down merges? mgz suggested that the test suite be made reliable so that we could get the run down to 40 minutes per merge [17:17] sinzui: i like forcing the issue [17:18] ;) [17:20] Bug #1474946 opened: worker/provisioner: tests are poorly isolated === natefinch is now known as natefinch-afk [18:11] perrito666: juju-ci-tools has the first part of my testing arg change. I am going to do another round to, but It wont be merged until tomorrow. [18:21] sinzui: tx for the heads up [18:52] ericsnow: sorry, got caught up in meetings. reviewing your prs now [19:29] afk picking my car up from the shop === natefinch-afk is now known as natefinch [21:31] perrito666: ping? [21:32] menn0: pong? [21:32] good morning [21:32] perrito666: good evening [21:32]