[00:05] bradm: which version of juju? [00:06] bradm: sounds to me like some runawaw leadership code [00:07] thumper: the agents are 1.23.3, we have juju 1.24.2 client installed [00:07] bradm: ok... I'm guessing leadership stuff then (known to be problematic in 1.23) [00:08] but it is mostly a guess [00:08] bradm: does restarting the agents fix it? [00:08] is that possible ? [00:09] thumper: we're being super cautious with this environment, its prodstack 4.5 [00:09] :-) [00:09] I understand the caution [00:09] sorry, personally I don't have an answer for you [00:10] thats fine, just trying to see if there's something we should be doing to see whats going on [00:10] its not a problem per se, the load is only sitting at about 1, just curious [00:11] so it looks like it was restarted, cleared out a bunch of memory, but its still chewing cpu time [00:12] the memory usage was more of a problem, but the restart cleared most of that [00:50] ah fark [00:56] thumper: ? [00:57] just resolving a mega-merge (master -> jes-cli) and I deleted a file I shouldn't have [00:58] :( sounds painful ... [01:02] * thumper runs make check again [01:34] ericsnow: ping? [01:39] axw: can you remind me - there was a bug about upgrades not imgesting charms from cloud storage to env storage I think? [01:39] wallyworld: tools, not charms [01:40] wallyworld: on ec2 only [01:40] wallyworld: you want the number? [01:40] ah. there's a bug report that implies charms are not being loaded [01:40] sure [01:40] wallyworld: the one I fixed was about the s3 signing, due to ":" being in the URL [01:41] i'll have to try and replicate the charms one [01:41] ah yes, i recall now [01:41] wallyworld: https://bugs.launchpad.net/juju-core/+bug/1469130 [01:41] Bug #1469130: tools migration fails when upgrading 1.20.14 to 1.24.1 on ec2 [01:41] ty [03:00] wallyworld: http://reviews.vapour.ws/r/2152/ [03:00] looking [03:02] thumper: looks ok. are the other issues raised about rsyslog etc valid? [03:03] I'm not sure which issues you are referring to [03:04] thumper: i menno's email [03:05] in [03:05] yes, the other issues are valid [03:05] :-( [03:07] also, looking through the code... [03:08] I saw something. [03:08] * thumper wonders what happens [03:08] if you try to send on a closed channel in a select [03:13] * thumper has to head out for a bit [03:18] Bug #1474195 opened: juju 1.24 memory leakage [04:32] davecheney: are you really around? [04:32] if so, I have a golang question [04:32] if I have a select, and one of the cases is a send to a channel [04:32] and something closes that channel [04:32] will it blow up, or just choose the other case? [04:32] hmm... [04:32] * thumper goes to test with the playground [04:36] thumper: i'm sad [04:37] upgrades from 1.20 to 1.22 are busted [04:38] :( [04:38] I'm sad [04:38] the migration of the charm collection to add the env uuid occurs AFTER the migration into storage of the charms, hence no charms are imported [04:38] a send on a closed channel panics even in a select [04:38] oh? [04:39] you asked "was there another place where we need to care about closing the channels" [04:39] and I looked [04:39] i shouldn't have asked :-) [04:39] there are go routines that are started to send values down the subscriber channels [04:39] which try to send for a while, then timeout [04:39] if there are any of these running when we close the subscriber channels [04:39] panic [04:39] i guess we need to include <-chan in the select [04:40] it is worse than that [04:40] we need to make sure they are all dead before we close the subscribers [04:40] which is a bitch to write and probably worse to tests [04:40] yuk [04:40] well, isn't too hard to write [04:40] just a wait group [04:41] thumper: save me looking, can you remember, are upgrade steps run serially in the order defined? i would have hoped so [04:41] yes [04:41] yes they are [04:42] so how the f*ck are we running charm import in the middle of env uuid fixing [04:42] NFI [04:42] is it not an upgrade step? [04:42] it is [04:43] is it doing things asynchronously? [04:43] not that i can see, i'll have to look closer [04:43] * thumper stabs the lease code [04:44] not sure how to fix, may need to trigger the upgrade steps again somehow [04:45] ah so migrate charms into storage is a 1.21 step [04:45] add env uuid to collections is a 1.22 step [04:45] and yet the initial env uuid steps are logged prior to the 1.21 steps [04:49] oh wait, env uuid steps are split across 1.21 and 1.22 [04:50] and if 1.22 state server runs the steps, it expects an env uuid to be there [04:50] but it won't be because that is only done in a 1.22 step [04:51] and the charm migration happens in a 1.21 step [04:51] so looks like we need to force an upgrade via 1.21 \o/ [04:53] hang on... hangout so you can talk me through this [04:53] something doesn't smell right [04:53] ok [04:54] 1:1 [04:54] ack [05:12] wallyworld: you froze [05:12] and I couldn't hear you [05:12] so I hung up [05:13] thumper: sorry, bigjools wifi died [05:14] thumper: best thing to do is hang up on wallyworld [05:15] :-) [05:15] thumper: i did the same to you [05:15] fair enough [06:43] Bug #1253613 changed: Hooks want to publish information to juju status [07:24] dooferlad: hey, i've addressed your comments in the PR, if it looks ok, and you and dimiter think it is suffcient to solve the issues you've seen, let me know and i'll land [07:53] wallyworld: +1 from me. Doesn't look like dimitern is around. TheMue, could you take a look at http://reviews.vapour.ws/r/2138/ please? [07:54] ty [07:54] dooferlad: will do, just merging my branch [07:54] * wallyworld off to soccer training now anyway [07:54] wallyworld: have fun! One of us can $$merge$$ it [08:13] dooferlad: I added one question to #2138, could you answer it as a native speaker? [08:14] TheMue: sure [08:14] dooferlad: thx [08:15] dooferlad: it's just a feeling based on the German meanings of those two different words [08:18] TheMue: no problem. In this case only is the right word to use. [08:20] TheMue: I think that this is mostly because solely is a bit of a mouthful and we tend to only use it in more formal language, such as "Bob is a sole trader" . [08:20] dooferlad: fine, so ignore my note [08:21] dooferlad: in German "only" has a more negative meaning while "solely" is narrowing down and positive [08:22] dooferlad: but btw, it won't merge, does not fix the right blocker \o/ [08:23] TheMue: :-( [08:24] dooferlad: yep, my latest PR also knocks on the CIs doors [08:57] dooferlad: as a fix it's accepted, but the tests failed [09:42] wallyworld, hey, thanks for taking care of bug 1472014 ! [09:42] Bug #1472014: juju 1.24.0: wget cert issues causing failure to create containers on 14.04.2 with lxc 1.07 [10:10] Bug #1474291 opened: juju called unexpected config-change hooks after read tcp 127.0.0.1:37017: i/o timeout [10:15] TheMue, dooferlad, so no maas call today - the guys are away it seems [10:15] dimitern: ok, thanks for info [10:16] dimitern: ack [11:54] good morning all [12:00] axw: meeting? [12:01] Bug #1461993 changed: support using an existing vpc [12:01] Bug #1457575 opened: archive/tar: write too long [12:19] oh boy, juju not supporting a vpc account without a default vpc strikes again! we should really fix this bug 1321442 [12:19] Bug #1321442: Juju does not support EC2 with no default VPC [12:19] hazmat will like this a lot :) [12:37] Bug # changed: 1461959, 1462417, 1463133, 1464280, 1469186 [12:50] Bug # changed: 1463870, 1464254, 1464255, 1466513, 1469184, 1469189 [12:56] Bug # opened: 1463870, 1464254, 1464255, 1466513, 1469184, 1469189 [13:05] Bug # changed: 1463870, 1464254, 1464255, 1466513, 1469184, 1469189 [13:59] Bug #1474382 opened: MeterStateSuite teardown failure on windows [13:59] Bug #1474386 opened: Problems bootstrapping the manual provider with CentOS [14:01] bogdanteleaga: poke me if you need anything from the windows test suite [14:01] bogdanteleaga: I'll also proofread your message to the list about setup function calling if you like [14:02] mgz: sure [14:02] mgz: any eta on when the 1.24.3 thing will start testing the upgrade? [14:04] bogdanteleaga: when we have 1.24.3 on our maas machine, I'll ask sinzui about the release timeline [14:07] katco, natefinch, wwitzel3: standup? [14:09] could anyone tell me how to deploy a charm on the dummy provider such that hooks get run? or is that not an option? [14:35] natefinch: looks like you didn't add a link for the forward-port PR to bug #1370896 [14:35] Bug #1370896: juju has conf files in /var/log/juju on instances [14:41] Bug #1474411 opened: juju --help text for upgrade is out of date [15:10] ericsnow: thanks for reminding me [15:10] natefinch: :) [15:23] Bug #1472596 changed: bootstrap failed yet retry says it succeeded [15:26] Bug #1472596 opened: bootstrap failed yet retry says it succeeded [15:47] Bug #1472596 changed: bootstrap failed yet retry says it succeeded [15:47] Bug #1473197 changed: openstack: juju failed to launch instance, remains pending forever [16:31] natefinch: hey just checking in between errands. did you find something to work on? alexisb's suggestion of 1.25 bugs is not bad, but check with ericsnow to see if there's anything he needs help with to get the wpm demo ready for the mid-cycle [16:43] mm, where is the list of envirionment aware collections? [16:44] nevermind [16:47] Bug #1473197 opened: openstack: juju failed to launch instance, remains pending forever [16:57] katco: yep, doing bugs, cleaning up some of my old bugs that needed forward porting [16:57] natefinch: cool. i looked at the bugs we have flagged in our backlog, and they look valid to me? what was the problem? [16:58] katco: there were a couple more that I deleted that were already fixed... sorry if deleting was not the correct thing to do. I didn't want to put them into done, since they weren't work that we actually did. [16:59] katco: but they were definitely already marked as fix released in all series to which they were targetted (probably work was done after we created the cards) [16:59] natefinch: oh, i'm referring to the two bugs in our iteration backlog [16:59] natefinch: created on the 6th, no one assigned in LP [17:00] natefinch: e.g. https://canonical.leankit.com/Boards/View/114568542/115913838 [17:00] katco: I... somehow completely overlooked those in favor of what was in the backlog (not iteration backlog). Sorry. Well, I'll start on one of those right away [17:00] natefinch: cool beans. ty [17:01] natefinch: did that not come up in the standup this morning? [17:02] katco: no, I was doing my "cleanup assigned bug" task (which will be done when trunk opens again), and talking to eric about one of the bugs that he'd worked on. I think eric just believed me when I said all the bugs were assigned or finished ;) [17:03] natefinch: ah, ok :) [17:47] Bug #1473470 changed: Windows cannot ensurePassword === xwwt_ is now known as xwwt === lazyPower_ is now known as lazyPower === wwitzel3_ is now known as wwitzel3 [18:56] I think I found a networking problem with the KVM provider. https://bugs.launchpad.net/juju-core/+bug/1474508 - Can someone from core take a look and let me know if they need anymore information? [18:56] Bug #1474508: Rebooting the virtual machines breaks Juju networking [18:59] Bug #1474291 changed: juju called unexpected config-change hooks after read tcp 127.0.0.1:37017: i/o timeout [18:59] Bug #1474508 opened: Rebooting the virtual machines breaks Juju networking [19:02] Bug #1474508 changed: Rebooting the virtual machines breaks Juju networking [19:02] Bug #1474291 opened: juju called unexpected config-change hooks after read tcp 127.0.0.1:37017: i/o timeout [19:03] mbruzek: attach the machine and unit log from that machine, and the state server's machine log as well, if you can. [19:09] natefinch: I can't juju ssh or juju scp from the units any longer since Juju thinks the IP address changed. [19:09] natefinch: I will get the logs if I can figure out another day. [19:09] way [19:15] Bug #1474291 changed: juju called unexpected config-change hooks after read tcp 127.0.0.1:37017: i/o timeout [19:15] Bug #1474508 opened: Rebooting the virtual machines breaks Juju networking [19:27] mbruzek: juju ssh won't work, but plain old ssh should still work. [19:28] It does, just having trouble submitting the form on Launchpad [19:28] The button does not work on Firefox, and Chrom I get an error [19:32] mbruzek: maybe you should try IE ;) [19:54] natefinch: I have tried this several different ways, I can not upload the machine-0.log [19:55] natefinch: because it was owned by root, not mbruzek [19:55] natefinch: uploaded [19:55] mbruzek: great :) [19:56] launchpad doesn't tell you that, I can navigate to the location and select the actual file. [19:59] mbruzek: may be a browser/OS issue where the specific problem is not well communicated. [20:00] Yeah I was just stating the reason I did not immediately see the resolution === liam_ is now known as Guest10453 [20:43] <_thumper_> fwereade: you around? === _thumper_ is now known as thumper [20:44] thumper, yeah, in a few minutes [20:44] I get the feeling we are being shafted by the mgo txn stuff [20:44] and while I think I have grokked the problem, I'd like to discuss [20:45] fwereade: also I was very much hoping you would cast your eye over http://reviews.vapour.ws/r/2152/diff/# as you have been doing a lot of lease work [20:50] thumper, LGTM [20:51] fwereade: ok [20:52] thumper, will be back to chat in a minute [20:53] fwereade: have standup on the hour [20:53] fwereade: if you want to give us 15 or 20 minutes [20:53] I'll be a few more minutes than 1 then [20:53] it may well be worthwhile talking with menn0 and waigani then [20:53] thumper, ping me when you're out, I should still be around [20:53] kk [20:54] thumper, cool, I'll just join your hangout when I get back [20:54] thumper, I'll sit quietly ;p [20:58] * perrito666 sees fwereade quietly sitting in a corner and hand him a pr for update-status [21:45] thumper: did you have 5? [21:45] wallyworld: no [21:45] not right now [21:45] ok [21:50] menn0: is the rsyslog issue you found with that upgrade bug 1468653 an issue that needs its own bug? [21:50] Bug #1468653: jujud hanging after upgrading from 1.24.0 to 1.24.1(and 1.24.2) [21:51] wallyworld: yes it does... or perhaps just reopen one of the previous bugs that cover this problem [21:51] wallyworld: new bug is probably less confusing [21:51] menn0: yeah, could you please do that and assign to 1.24.3? [21:51] the hook execution one was fixed elsewhere right? [21:55] wallyworld: i'm not sure if the hook failure one has been dealt with in 1.24 yet. thumper? [21:55] menn0: if i recall one of the windows guys may have been working in the area or doing a different fix that would address it? not sure [21:56] wallyworld: fwereade said that bogdan had committed a fix so that a unit wouldn't indicate that it had started hook execution until it had the lock [21:56] wallyworld: that needs to be backported to 1.24 [21:58] right, that's the one thanks [21:58] or sounds like it [22:02] fwereade: just to check - the above hook execution fix is ok to backport [22:02] I think so, yes [22:02] as it appears implicated in a 1.24 upgrade issue [22:02] ok, ta [22:02] I remember it being clean, and I don't think there have been major changes in that area [22:02] is there a bug for it do you recall? [22:03] i'll do a search [22:04] huh [22:04] I have what might be good or bad news [22:04] https://github.com/juju/juju/pull/2681 [22:04] how old is the 1.24 that's experiencing this? [22:04] hmmm, might be older than the fix i hope [22:05] i need check when 1.24.2 came out [22:05] 2 july [22:07] i check the source [22:17] fwereade: damn, that code is in 1.24.2 [22:17] wallyworld, damnshit [22:17] menn0: you definitely saw the hook bug in 1.24.2? [22:18] wallyworld, fwereade: yep. after upgrading from 1.24.0 to the official 1.24.2 release, around half of the units had a hook failure [22:18] wallyworld, fwereade: mostly leadership-settings-changed and leader-elected but also config-changed and others [22:18] menn0: damn, could you add that info to a new bug? [22:19] menn0, wallyworld: ...I think that is a property of 1.24.0 rather than 1.24.2 [22:19] wallyworld, fwereade: we can probably get access to the bootstack staging env again to recreate [22:19] fwereade: i have no insight into this issue [22:19] wouldbe happy if it were 1.24 related [22:19] menn0, wallyworld: if we want 1.24.2 to recover from that situation I think we basically need to retry failed hooks automatically [22:20] menn0, wallyworld: and that, surprise surprise, may have tentacles [22:20] as part of an upgrade? [22:20] wallyworld, I think we should do it anyway [22:20] i'd be loathe to auto retry hooks [22:20] not in a point release [22:20] wallyworld, UX-wise just conditioning users to press the big red retry button is less good than just doing it ourselves [22:20] wallyworld, right [22:21] wallyworld, and an auto-retry on upgrade might be tricky? [22:21] so for 1.24.0 -> 1.24.2 we document that users need to retry [22:21] wallyworld, doable actually [22:21] hmm [22:21] well on update we don't have full api [22:21] so if we retry, hooks may fail [22:22] wallyworld, I think we just set the retry flag on any unit in an error state [22:22] wallyworld, and the units handle the retry when they get to it [22:22] fwereade: can we go back to the root cause of the issue? why is this even happening and why is it suddenly more of an issue [22:23] I think I've noticed this before but it's always just been one or 2 units in an openstack deployment [22:23] is it because with leadership we have a lot more hooks going off? [22:24] menn0, I don't *think* that accounts for it all [22:24] menn0, and indeed I don't understand why it wasn't happening before [22:24] menn0, *unless* [22:25] menn0, something about how we handle the machine lock on upgrade has changed as well [22:26] menn0, *or* I fucked up somewhere in the uniter changes of ~jan/feb, and unwittingly changed something about how the uniter handles the lock [22:27] menn0, but I might have expected to see that earlier? [22:27] menn0, perhaps not? [22:28] menn0, sorry I don't have further context, I'm flagging a bit [22:29] menn0, I think it would be easier to eliminate the change-to-locking-on-upgrade hypothesis [22:29] menn0, so if nothing happened there it's probably my fault [22:29] fwereade: i'll try to do some digging [22:30] menn0, thanks [22:30] fwereade: it should be easy enough to repro the issue and then work from there [22:31] wallyworld: did you create a ticket for the hook failures on upgrade issue? [22:31] wallyworld: i'm currently doing some repro for the rsyslog issue so we have decent details [22:32] menn0: didn't create ticket - i'll go back through emails and dig up details. i haven't actually seen the issue first hand so am unsure how to describe it exactly [22:33] wallyworld: I'll create the ticket. was just checking you hadn't already. [22:33] ok, ta [22:36] wallyworld, fwereade: the other issue that needs looking at is the leaseManager worker constantly dying and restarting due to "concurrent updates" [22:36] oh [22:36] didn't know about that one [22:37] wallyworld: that is my email as well I think [22:37] i'll go nd re-read [22:37] my brain fifo kicked in [22:38] sinzui: i created a 1.22.7 milestone [22:38] wallyworld: here's what I said: 3. The lease manger dies at least once a minute, sometimes more often due to "simultaneous lease updates occurred". [22:39] awesome :-( [22:45] wallyworld: I suspect fwereade's upcoming work will fix this but I don't know if we want to do something else in the mean time [22:46] we may need to [22:46] it's kinda broken as is [22:46] wallyworld: i'm not sure what the consequences of this error are [22:47] me either without digging [22:47] wallyworld: maybe the worker just needs to not treat this error as fatal and resync/retry instead [22:47] worth considering [22:48] Bug #1474588 opened: Many hook failures after upgrade [22:50] menn0: could you post status-history for some of those? [22:54] perrito666: that's a v good idea. i keep forgetting about that feature [22:54] perrito666: when I repro the issue I'll do that [22:56] menn0: tx [23:24] NOTICE: having an issue with prodstack and causing jujucharms.com to be unresponsive. Also means 1.24.X juju deploys will probably not be successfull atm. Working with webops to keep an eye and correcting [23:29] * thumper has in-laws arriving for lunch... [23:29] yay? [23:29] thumper: lucky dog [23:29] :) [23:42] menn0: have a sec? [23:42] * perrito666 lures menn0 into a dark alley [23:52] * menn0 pretends everything is going to be fine [23:52] perrito666: what's up [23:53] I have to report a bug and need some input from you to make sure I dont lie a lot [23:54] also, have this completely inocent looking candy I extracted from my coat [23:55] menn0: basically I have found that status for our entities has been left out of envuuid :( [23:55] ooooh candeeee [23:56] perrito666: do you mean the docs in the statuses collection? [23:56] yeah, give a man the choice between possibly dangerous food or a bug in envuuid and see where he goes [23:56] menn0: yes, statusesC [23:57] perrito666: I remember that collection have multi-env support added [23:57] perrito666: what are you seeing? [23:57] menn0: so, the entries for that collection are being created with the services [23:57] since that employs createstatusOp it all goes well because envuuid is added [23:58] ok [23:58] but every subsequent setstatus uses udpateStatusOp which does an Update: bson.D{{"$set": doc }} where doc is a new status doc lacking envuuid [23:58] so, only the first status of every entity has envuuid [23:59] * perrito666 can hear menn0 cursing from here