[00:09] axw: Does jujud-machine-0 need to contact itself on 127.0.0.1:17070? I'm seeing those connections pop up even after seeing remote ones drop away. [00:10] axw: i.e. should I allow the loopback ones? [00:13] blahdeblah: it does yeah, maybe just stop the agents *except* jujud-machine-0 because jujud-machine-0 talks to itself over loopback also [00:14] axw: Righto; so allow loopback, and stop the other agents on the machine 0 [00:14] blahdeblah: yup [00:16] axw: OK - that's done; how long shall I leave it before checking for recovery? [00:23] blahdeblah: I would expect it to be fairly quick if it's going to fix the issue. 10-15 mins [00:23] OK - time to make a coffee and see how it goes then. :-) [00:41] axw: load dropped off pretty dramatically: https://libertysys.com.au/imagebin/2iugcW6L.png [00:41] and network traffic, but that's pretty much expected given I'm stopping the packets: https://libertysys.com.au/imagebin/B8B9g8F7.png [00:42] axw: and here's your smoking gun: network https://libertysys.com.au/imagebin/Dqygzb0N.png [00:43] axw: what next - start the local agents back up? [00:45] blahdeblah: yes, unblock them and see what happens please [00:48] OK - that's done [00:48] * blahdeblah starts writing an update for the bug report [00:49] blahdeblah: thanks for that [00:58] axw: So, where to next with this? The agents have gone straight back to spamming the controller's logs... [00:59] lots of these: unit-ntpmaster-5[2554]: 2016-11-23 00:58:53 INFO juju.worker.dependency engine.go:351 "uniter" manifold worker stopped: dependency not available [00:59] and these: [00:59] unit-nrpe-6[2568]: 2016-11-23 00:59:20 DEBUG juju.worker.dependency engine.go:301 starting "uniter" manifold worker [00:59] unit-nrpe-6[2568]: 2016-11-23 00:59:20 DEBUG juju.worker.dependency engine.go:268 "uniter" manifold requested "agent" resource [01:19] axw: And we're straight back to previous load & network traffic levels [01:19] RAM still creeping up [01:20] blahdeblah: are there errors in the log? about lease manager not being available or whatever it is [01:57] FAIL: pinger_test.go:104: pingerSuite.TestAgentConnectionDelaysShutdownWithPing [01:57] seriously? [01:58] this test is still failing? [01:58] FFS [01:58] thumper: yep... I thought I had it licked but not [02:00] thumper: menn0: funny how it's not on my hit list of intemittents that block promotion from develop to staging.. [02:00] thumper: menn0: do u have time to pick it up? [02:01] dealing with migration stuff right now [02:01] anastasiamac: ditto... I'm not supposed to pick up anything else [02:01] awesome \o/ [02:04] axw: Sorry - missed that message earlier. Should the messages be of type ERROR? [02:04] blahdeblah: yes [02:04] blahdeblah: in the agent logs [02:04] axw: not machine 0? [02:05] blahdeblah: there might be some in there too, but the errors I'm thinking of are the cause of the (non-controller) agent restarts [02:05] wallyworld: review done [02:05] tyvm [02:24] axw: no sign of any errors to do with lease manager [02:25] blahdeblah: hmm, ok. no errors at all in the agents, suggesting they're restarting? [02:25] axw: only the ones relating to the connection going down when I blocked traffic [02:25] you want me to collect full logs from machine 0 for you? [02:26] blahdeblah: yeah wouldn't hurt, thanks [02:27] thumper: i wonder what the magic incantation is - still can't reproduce it after 520 runs under stress [02:30] menn0: https://github.com/juju/juju/pull/6597 easy one [02:30] thumper: looking [02:31] thumper: QA steps? [02:31] run the tests [02:31] I made sure it failed first [02:31] thumper: ok fine [02:32] thumper: how serious is the problem that's being fixed? [02:32] seems problematic [02:32] menn0: migrations will fail [02:32] if anyone has ever removed a unit that had an action [02:32] as the cleanups needs to be empty for the model [02:32] also, data just hangs around [02:33] plan on backporting to the 2.0 branch... [02:33] thumper: glad you found it now then [02:33] well [02:33] are we going to do another 2.0 release? [02:33] nfi [02:33] thumper: ship it [02:33] menn0: found it as part of the b7 upgrade [02:34] as horrible as that work has been, found many issues [02:38] thumper: another 2.0.x release mayb a possibility -it's on a per-need basis.. [02:38] thumper: original plan is to support 2.0.x until feb [03:03] menn0: tech board? [03:03] wallyworld: just when I was actually starting to make some progress today... [03:03] yeah, always the way === frankban|afk is now known as frankban [08:41] voidspace frobware: I've a couple of IPv6-related PRs in, if you have a minute. Both required for the next juju PR. [08:41] https://github.com/juju/testing/pull/117 [08:42] https://github.com/juju/replicaset/pull/3 [08:42] Both simple too. [11:21] frobware: ping [11:21] just wondering what you discovered WRT the bridge script in your testing yesterdoy [11:36] macgreagoir: just to mention, you can ping me as well now. :) [11:37] jam: Of course :-) cheers. [11:42] jam: well, one thing is that if there's nothing to do we still run ifupdown - which was fine before, but not now. [11:42] jam: was offline - powercut [11:42] frobware: good catch [11:42] good thing you're using 4g everywhere, right? :) [11:42] macgreagoir: reviewing [11:43] jam: I got sidetracked by landing https://github.com/juju/juju/pull/6599 [11:43] jam: we will need the script to be present from now on so it seemed prudent to make that so... [11:44] sure, though maybe we should just trigger writing it from jujud itself? given we have a copy inside the jujud binary [11:45] jam: can do. baby steps. [11:45] :) [11:45] I also wonder if we ultimately want it as a python script if we are moving to doing it ourselves. It made sense when it was driven by cloud-init. Stuff to consider, at least. [11:45] jam: I got a little annoyed it wasn't on there during my testing. [11:45] jam: yes, something I had considered too. [11:46] jam: as it stands now, it will work for dynamic and our current static bridging. [11:59] frobware: replicaset comments amended. [12:00] macgreagoir: commented. [12:00] macgreagoir: was still looking at this [12:15] macgreagoir: reviewed both [12:15] jam, macgreagoir, voidspace: PTAL @ https://github.com/juju/juju/pull/6599 [12:27] it seems the pre-flight PR checks are failing [12:27] take a look at: http://juju-ci.vapour.ws/job/github-check-merge-juju/268/artifact/artifacts/trusty-err.log [12:27] scripts/setup-lxd.sh: line 10: lxd: command not found [12:29] rick_h: grabbing coffee will be 2 mins late to first standup [12:29] voidspace: rgr [12:54] mgz: http://juju-ci.vapour.ws/job/github-check-merge-juju/268/ is that just out-of-disk space error? [12:54] lxd-out.log at least has: "Failed to copy file. Source: /var/lib/jenkins/cloud-city/jes-homes/merge-juju-lxd/models/cache.yaml Destination: /var/lib/jenkins/workspace/github-check-merge-juju@2/artifacts/lxd/controller" [12:55] and why does http://juju-ci.vapour.ws/job/github-check-merge-juju/268/artifact/artifacts/trusty-out.log have "17.04,Angsty Antelope,angsty,2016-10-23,2017-04-30,2018-01-29 " [12:58] ++ lxd --version [12:58] scripts/setup-lxd.sh: line 10: lxd: command not found [12:58] seems fishy [12:58] hey jam are you on our tz this week? or just working incredibly late? [12:59] perrito666: it is just hitting 5pm here. [12:59] I'm UTC+4, so not *that* far away from your TZ, just not very close. [12:59] its 7hs, it far enough [13:00] also I forgot it was morning here :p [13:00] jam: the Angsty thing is a hack to make sure distro-info behaves [13:00] I might be needing vacations [13:01] perrito666: when you start mixing up day and night... [13:01] mgz: k, just thinking that since it is labeled 17.04 it will become wrong pretty soon. [13:01] jam: just morning and afgernoon, to be fair 10AM and 8PM are not really different around here [13:01] perrito666: more drinking at 10am? [13:02] I wont admit nor deny that [13:03] mgz: so I think the final thing is that trusty somehow is missing LXD, (not installed from trusty-backports?) [13:06] jam: the thing is, it works a portion of the time [13:06] and the script isn't changing [13:06] so I don't at a glance know what's up [13:07] someone just needs to spend some time to debug it, happens enough that it really needs resolving [13:07] I'll see if there's a bug already [13:07] mgz: I don't see anything in http://juju-ci.vapour.ws/job/github-check-merge-juju/268/artifact/artifacts/trusty-out.log that indicates it is installing lxd [13:08] is it doing something like accidentally using the Xenial script where lxd is preinstalled ? [13:08] * mgz finds a passing one to compare [13:08] mgz: also, looking at the apt-update HTTP list, I don't see trusty-backports in there [13:08] trusty and trusty-updates, but not trusty-backport [13:09] mgz: I remember there were some issues with particular Trusty MAAS images that didn't have trusty-backports enabled by default, which differed from CPC images [13:09] ha, a passing one also fails there [13:09] but exits 0 [13:09] mgz: so passing is because it failed-to-fail ? [13:10] see 267 [13:10] no, it ran the tests, by the trusty-out.log and they pass [13:10] mgz: yeah, I see the "lxd not found" line. [13:11] so, the fatal error is actually the "Connection to ...amazonaws.com closed by remote host." [13:12] I don't see a bug, so will file one. [13:12] atm check job history is mostly red, and I suspect that's largely spurious [13:21] frobware: commented on https://github.com/juju/juju/pull/6599 [13:33] every time the SSO two factor login annoys me, I remember I implemented it... [13:33] dammit [13:34] voidspace: aha! does this mean we can all come to you with complaints? :) [13:34] mgz: haha, I'm not on that team any more [13:34] mgz: but feel free to complain, sure [13:35] do juju 1.X and 2.X both use the same version of goose ? [13:36] mgz: ^^^^ [13:36] gnuoy: I doubt it [13:38] gnuoy: different revisions of goose.v1 [13:38] voidspace, hmm, ok, ta [13:38] gnuoy: http://pastebin.ubuntu.com/23522141/ [13:38] gnuoy: 2.0 is on a 2016-11 revision [13:39] gnuoy: 1.25 a 2015-11 revision [13:39] voidspace, ok thanks, thats v. helpful [13:42] gnuoy: nope [13:42] * mgz was late [13:42] gnuoy: we can bug fix goose for 1.25 if needed though [13:42] but it notably isn't going to have the new neutron stuff [13:42] mgz, I think the bug impacts both tbh but I've only confirmed on 1.X [13:43] mgz, the new neutron stuff has landeD ? [13:43] gnuoy: yes, I've delayed sending out announcement as there was some fallout to handle [13:43] but will do so. [13:45] rick_h: I'm out of bourbon, can you bring me a nice bottle to Barcelona and I'll recompense you there... (pretty please :-) [13:45] rick_h: if you can remember and be bothered of course... [13:47] Voidspace hah, what is your preferred one? [13:48] rick_h: any reasonable kentucky bourbon will be fine, I'm no expert [13:48] rick_h: the one I've just finished that I liked [13:48] rick_h: was bulleit (or something like that) [13:48] rick_h: scotch is too rough for me, but you Americans can make nice whisky [13:49] jam: there is a wrapper around the bridge script which invokes it at its new location [13:50] jam: that wrapper always existed. It looks to see what version of python is available, and whether all interfaces should be bridged. [13:51] jam: and then runs: /path/to/python /path/to/the/bridge/script.py [13:54] mgz, fwiw https://bugs.launchpad.net/juju/+bug/1625624/comments/7 [13:54] Bug #1625624: juju 2 doesn't remove openstack security groups [13:55] rick_h: and if there's anything I can bring from the UK just let me know, scotch, marmite, scottish shortbread.... [13:56] gnuoy: thanks, having a look [13:56] note bug 1643057 was fixed in MAAS 2.1.2 [13:56] Bug #1643057: juju2 with maas 2.1.1 LXD containers get wrong ip addresses [13:57] gnuoy: do you know if this is tied to a particular openstack version? [13:57] because goose hasn't changed the underlying http handling in a long time I believe [13:57] mgz, that is an excellent question. I believe this broke for us when we went to Mitaka [13:58] gnuoy: that sounds very likely [13:58] most of our CI testing is against older (much older) versions [14:08] ahh frobware, you need to look at http://juju-ci.vapour.ws/job/github-check-merge-juju/269/artifact/artifacts/trusty-out.log/*view*/. You have unit test failures. Everything else is noise [14:22] balloons: heh, none of these seem to be related to what I changed. [14:22] balloons: 268 is the one to look at. it either dced after build or something else similar. [14:22] 269 is perrito666's [14:22] (which does have real unit test failures) [15:09] Bug #1625624 opened: juju 2 doesn't remove openstack security groups [15:20] natefinch: on that branch; not shown: i have to land the CallMocker type into juju/testing [15:21] katco: ahh, thanks, that would be a key piece :) [15:21] natefinch: yeah sorry i just ran out of time last night [15:21] natefinch: you can see what it looks like because it's removed from deploy_test.go [15:21] natefinch: but doesn't build atm. feel free to wait on review if you like [15:22] natefinch: i will endeavor to get that landed around lunch [15:25] katco: I can start the review now, that's no problem [15:27] wow that call mocker adds a lot of complexity [15:30] natefinch: going into a meeting, but would appreciate more thoughts on that [16:31] mgz: ?? [16:32] mgz: I dont have a merge in progress [16:36] perrito666: the check on pr #6564 [16:38] mgz: ah yes, I know === frankban is now known as frankban|afk [18:45] Bug #1644333 opened: !amd64 controller on MAAS cloud requires constraints [20:24] hey menn0 [20:24] babbageclunk: howdy [20:25] so my logtransfer stuff is failing because the migrationmaster worker is connecting as a machine agent but the debuglog endpoint expects a user. [20:26] menn0: hey, curious what you think of this https://github.com/juju/testing/pull/118/files in relation to this https://github.com/juju/juju/pull/6598/files [20:27] menn0: I guess I should add another httpContext authentication method that has an authfunc for any one of machine/unit/user? [20:27] babbageclunk: well just machine+user would be enough right? [20:28] menn0: oh, right [20:28] katco: i'll take a look shortly [20:28] menn0: no rush, ta [20:28] menn0: I was going to use stateForRequestAuthenticated, but that doesn't check permissions [20:29] babbageclunk: but yes that seems ok. there's not much danger in opening up that endpoint to machines [20:29] menn0: Ok, thanks, I'll do that. Tempted to do it as a function that accepts the tag kinds and rework the others in the same way. [20:30] babbageclunk: that could be nice as long as it's not too big a change [20:30] babbageclunk: i.e. don't get too distracted [20:32] menn0: No, it's simple. [20:32] babbageclunk: cool, then go for it [20:32] menn0: I'm so close to getting this working I can taste it! [20:33] babbageclunk: what does it taste like? :) [21:34] menn0: it tastes like... victory [21:40] babbageclunk: for a moment I thought we where still talking about what came out of your keyboard :p [21:41] perrito666: ugh, who put all this cat hair on my keyboard!? [21:41] perrito666: never clean anything, it only leads to heartache [22:23] anyone else noticed a plethora of ERRORs now being written to the logs like this: 11:17:30 ERROR juju.rpc error writing response: write tcp 127.0.0.1:17070->127.0.0.1:52408: write: connection reset by peer [22:23] normally about 10 at a time [22:42] thumper: yeah, I think I've seen those - I didn't realise they were new though [22:47] axw, thumper, menn0: can someone take a look at this? https://github.com/juju/juju/pull/6606 [23:00] katco: I like the approach. i've added a bunch of suggestions let's get that in. [23:00] babbageclunk: looking [23:01] menn0: oh fuck... [23:02] * menn0 cringes [23:02] menn0: I've just hit something quite interesting [23:02] quick HO to discuss? [23:02] * menn0 doesn't like interesting [23:02] thumper: no :) [23:02] :( [23:02] * menn0 buries head in sand [23:02] thumper: see you in 1:1 [23:02] ack [23:04] machine-0: 12:01:20 ERROR juju.worker.migrationmaster:9097b3 source prechecks failed: unit ubuntu/0 not idle (failed) [23:20] thumper: ping? [23:20] babbageclunk: hey [23:22] thumper: my migration master's getting an error when connecting to the logtransfer endpoint - the target model isn't importing (since it's already at success). [23:22] ah... [23:22] thumper: how do you think I should handle that - another phase? [23:23] what comes after success? [23:23] thumper: ...profit? [23:23] heh [23:23] thumper, babbageclunk: SUCCESS -> LOGTRANSFER [23:23] menn0: but is the target model in that state? [23:24] I think perhaps the method should take the expected import state [23:24] stateForMigration should take an expected state [23:24] * menn0 agrees with thumper [23:24] and check against that [23:24] rather than just assuming importing [23:25] the logtransfer endpoint should only work if the migration state is "none" [23:25] thumper, menn0 - yeah, makes sense. In this case that's MigrationModeNone [23:25] (that's not the exact name) [23:25] why is the state none? [23:25] ok, thanks chaps - doing that now [23:25] if it is still migrating? [23:26] menn0: ^ [23:26] menn0: FYI, the precheck is failing because the machine instance status is "started" not "running" [23:27] thumper: ok cool .. easy to fix then [23:27] I'm looking at what full status is looking at [23:27] * thumper is still digging [23:28] thumper: fullstatus should be using the same thing to generate the status but perhaps it's the interpretation of the status [23:28] yeah, that's what I'm checking [23:29] babbageclunk: review done. looks great. [23:29] menn0: thanks! [23:33] mgz: thanks for the review, PTAL at my response when you have a moment [23:34] axw: wallyworld I need to skip standup sorry ill update you guys upon return [23:34] axw: good catch with region check [23:35] perrito666: ok, ttyl [23:37] perrito666: BTW see validateCloudRegion in state/model.go for one place we're doing validation of region. I think we just want to copy that logic into environs.MakeCloudSpec [23:37] we're validating when we *do* pass a region, but not when we don't [23:37] menn0: status always shows "started" never "running" [23:43] * thumper sighs [23:44] apiserver converts running -> started somewhere [23:44] * thumper still digging [23:46] thumper: i'll show you where after the standup [23:46] ok [23:47] perrito666: standup? [23:55] menn0: fyi https://bugs.launchpad.net/juju/+bug/1620438 [23:55] Bug #1620438: model-migration pre-checks failed to catch issue: not idle [23:59] Bug #1640521 changed: Unable to deploy Windows 2012 R2 on AWS [23:59] Bug #1644333 changed: !amd64 controller on MAAS cloud requires constraints