[00:22] error: invalid service name "tarmac-1.4" [00:22] yay [00:26] probably the do [00:26] pretty much only safe to use a-z, 0-9 and hyphen [00:26] wallyworld_: looking [00:27] k [00:27] s/do/dot [00:31] davecheney: but why... :/ [00:32] I tried with tarmac-14 and it still complained :( [00:44] thumper: i didn't mention the listener because it is orthogonal to the management of keys in state. that aside, i'll continue with the current work then [00:45] wallyworld_: I think it is worth the few days effort now to make it easier [00:45] me too [00:45] the fact that it gives you a break to do something slightly different is a bonus [00:45] if had the the road map all sorted it would be easier to know exactly what to do next === waigani_ is now known as waigani [01:37] thumper: i just ran into that @%#^%!&@!^& issue where different values of c are used in different phases of the tests. ffs [01:38] how? [01:38] storing c? [01:39] no, constructing a stub in the SetupTest, where the stub took c *gc.C as the arg and the called c.Assert [01:39] the c.Assert failed and the wrong go routine was stopped [01:39] so the test said it passed [01:39] heh [01:39] oops [01:39] well, oops smoops [01:39] the same value of c should be used [01:40] also, our commands are hard to test sometimes [01:40] cause they live in a main package [01:40] so can't easily be mocked out from another package [01:41] without refactoring the Run() [01:53] true that... [01:59] school run [02:44] * thumper is hanging out for synchronous bootstrap... [02:44] I'm waiting for canonistack to come back saying it is up [02:44] no idea what the state is right now [02:44] all I know is that it has started an instance [02:58] thumper: sinzui: We delivered a 1.16.4.1 to NEC, that I'd *really* like to be called an official 1.16.4 so we don't have to tilt our heads every time we look at it. Beyond that we're in a situation where we've got enough work queued up in the pipeline that we're afraid to release trunk, and I'd really like to get us out of that situation. [03:12] jam: why afraid? [03:18] OH FF!!!!@! [03:20] loaded invalid environment configuration: required environment variable not set for credentials attribute: User [03:20] that is on the server side, worked client side... [03:20] GRR... [03:22] * thumper cringes... [03:28] * thumper needs to talk this through with someone [03:29] axw, wallyworld_: either of you available to be a talking teddybear? [03:29] thumper: sure, brb [03:29] ok. i've never been a teddy bear before [03:31] back.. [03:31] https://plus.google.com/hangouts/_/76cpibimp3dvc1rugac12elqh8?hl=en [03:49] mramm: ta [03:49] you are most welcome [04:15] * thumper downloads failing machine log [04:15] for investigation [05:06] jam: are you happy for me to land my synchronous bootstrap branch after addressing fwereade's comments? do you have anything else to add? [05:07] axw: I haven't actually had a chance to look at it, though I never actually audited the patch, either. I just tested it out. [05:07] ok [05:07] so if someone like william has actually done code review, I'm fine with it landing [05:08] just don't count it as "I've looked closely and I'm happy with everything in there" :) [05:08] axw: if you have particular bits that you're more unsure of, you can point me to them, and I can give you directed feedback [05:08] jam: no problems [05:08] I'm just distracted with other things [05:08] ok [05:09] jam: I think it's ok, just wanted to make sure you didn't want to have a look before I went ahead [08:10] man, it creeps me out when my phone tells me I've sent a new review of something *before* the new page renders in the browser [08:13] fwereade :) [08:13] morning [08:13] jam, heyhey [08:24] jam, so, about 1.16.4 et al... I quite like the idea of making our current projected 1.16.5 into 1.18, and getting that out asap, with a view to the rest of the CLI API landing in 1.20 as soon as it's ready [08:24] jam, drawbacks? [08:25] fwereade: well we do have actual bugfixes for 1.16 as well, and I'm concerned about future precedent (what happens the next time we have to do a stop-the-line fix for a client using an older version of juju) but whatever we pick today can work [08:25] fwereade: the fact that it has been ~1month since we did any kind of a release is surprising for me [08:25] jam, well, the issue here is that the "fix" was really more or a "feature" [08:25] and also makes me wonder if we've got all the balance points correct. [08:26] fwereade: necessary work for a stable series that took precedence over other feature work [08:26] fwereade: fwiw we did the same in 1.16.2 (maas-agent-name is terribly incompatible) [08:26] jam, ha, yes [08:27] 1.16.3 is a genuine bugfix, as it is a single line [08:27] fwereade: so a gut feeling is that we are being inconsistent about the size of the meaning of an X+2 stable release [08:27] 1.16 vs 1.18 is going to be tiny, but 1.18 vs 1.20 is going to be massive [08:27] thats ok [08:28] but it is a drawback [08:28] fwereade: I do *like* the idea of stable being bugfixes only, but that assumes we have the features implemented that someone on a stable series actually needs [08:28] jam, well,it won't be any bigger than today's 1.16->1.18 already would be [08:28] fwereade: sure, but 1.12 -> 1.14 -> 1.16 -> ? were all similarly sized [08:28] jam, I do agree that's not nice in itself [08:30] jam, OTOH "size of change" is a bit subjective anyway and I'm not sure it's a great yardstick against which to measure minor version changes [08:32] fwereade: its very much a gut feeling about "hmmm, this seems strange" more than any sort of concrete thing. I'd rather have regular cadenced .X+2 releases (1/month?, 1/3months?) and then figure out the rest as we go [08:32] I *do* feel that 1.17 is building up a lot of steam pressure as we keep piling stuff into it and haven't done a release [08:32] as in, we're more likely to break lots of stuff because people haven't been able to give any incremental feedback [08:32] jam, yeah, I'd like to get a release off trunk soon [08:33] fwereade: at which point, if we had 1.17.0 released, and then we needed something for NEC, what would you prefer ? [08:33] it feels really odd to jump from 1.16.3 => 1.18 over the 1.17 series, though obviously we can do it [08:33] but 1.17 would then contain "all of" 1.18 [08:34] jam, well the problem is that 1.17 (if it existed) would contain a bunch of things that aren't in 1.18 [08:34] jam, so it's a good thing it doesn't ;) [08:34] fwereade: well 1.18 would contain lots of things not in 1.17 either [08:34] sorry [08:34] jam, ;p [08:34] 1.17 would contain lots of things not in 1.18 [08:35] as in, they are modestly unrelated [08:35] jam, yeah [08:35] and we only didn't have 1.17.0 because the release didn't go out the week before [08:35] (probably because of test suite failures) [08:35] CI [08:35] I don't think I ever heard a "we can't release 1.17.0 because of X" from sinzui [08:36] jam, honestly, in general, I would prefer that specific client fixes be individual hotfixes based on what they were already running -- but, yes, in this case it didn't come out that way [08:36] fwereade: why would you rather have client-specific branches? [08:36] it seems far more maintenance than getting the actual code out as a marked & tagged release [08:37] fwereade: as in, NEC comes back and does "juju status" wtf version are they running ? [08:37] jam, because it minimizes the risk/disruption at the client end [08:37] if we give them a real release [08:37] then they have an obvious match [08:37] and the guy in #juju-core can reason about it [08:37] jam, that problem already exists because we fucked up and let --upload-tools out into general use [08:37] * fwereade makes grumpy face [08:38] fwereade: well, we can tell it is at least a custom build, and what it is based upon [08:38] jam, well, not really [08:39] jam, eg they weren't using a custom build, but they did have a .build version, because --upload-tools [08:40] fwereade: sure, but we can tell that it is version X + epsilon (in this case epsilon = 0), that is still better than a pure hotfix that doesn't match any release at all [08:41] jam, ok, my preference is predicated on the idea that we would be able to bring clients back into the usual stream once we have a real release that includes the bits they need, and that is not necessarily realistic [08:41] fwereade: well it isn't realistic if we don't actually release their bits :) [08:41] jam, the trouble is that we have no information about epsilon [08:41] jam, ha [08:41] fwereade: and there is the "will NEC use 1.18" ? Only if forced ? [08:41] jam, yeah, that's the question [08:42] jam, I do agree that always having clients on officially released versions is optimal [08:43] fwereade: so we could have done, "we're going to need some big patches for them, bump our current 1.18, and prepare a new 1.18 based on 1.16" [08:43] a concern in the future is "what if we already have 1.18.0 out the door" because they are using an old stable [08:43] IIRC, IS is still using 1.14 in a lot of places [08:43] but I *think* we got them off of 1.13 [08:44] well, 1.13.2 from trunk before it was actually released, etc. [08:44] (I remember doing help with them and 'fixing' something that was in 1.13.2 but not in their build of it :) [08:44] fwereade: so also, "tools-url" lets you put whatever you want into "official" versions as well [08:46] jam, sure, tools-url lets you lie if you're so inclined -- but --upload-tools forces a lie on you whether you want it or not [08:46] fwereade: "want it or not" you did ask for it :) [08:47] jam, well, not so much... if you just want to upload the local jujud you don't *really* also want funky versions [08:57] mornin' all [09:00] fwereade: when you are free i'd like a quick chat, maybe you can ping me when convenient. i may have to pop out for a bit at some point so if i don't respond i'm not ignoring you [09:00] wallyworld_, sure, 5 mins? [09:01] ok [09:01] rogpeppe1, heyhey, I think you have a couple of branches LGTMed and ready to land [09:01] fwereade: ah, yes. i can think of at least one. [09:02] fwereade: BTW i spent some time with Peter Waller yesterday afternoon trying to get his juju installation up again [09:02] rogpeppe1, ah, thank you [09:02] fwereade: an interesting and illuminating exercise [09:02] * fwereade peers closely at rogpeppe1, tries to determine degree of deadpan [09:02] fwereade: gustavo's just added a feature to the txn package to help recover in this kind of situaton [09:03] fwereade: actually said totally straight [09:03] fwereade: we found lots of places that referred to transactions that did not exist [09:04] * fwereade looks nervous [09:04] fwereade: that's probably because we don't run in safe mode (we don't ensure an fsync before we assume a transaction has been successfully written) [09:05] fwereade: so when we ran out of disk space, this kind of thing can happen [09:05] wtf, I could have sworn I double-checked that *months* ago [09:05] fwereade: we should probably run at least the transaction operations safely [09:06] rogpeppe1, surely we should run safely full stop [09:06] fwereade: probably [09:06] fwereade: it'll be interesting to see how much it slows things down === rogpeppe1 is now known as rogpeppe [09:07] rogpeppe, indeed [09:08] fwereade: yeah, i don't see any calls to SetSafe in the code [09:11] rogpeppe: any understanding of why it isn't the default ? [09:11] jam: i'm just trying to work out what the default actually is [09:13] rogpeppe: you can use go and call Session.Safe() [09:14] jam: yeah, i'm just doing that [09:15] rogpeppe: "session.SetMode(Strong, true)" is called in DialWithInfo [09:16] but I don't think that changes Safe [09:16] jam: i think that's orthogonal to the safety mode [09:16] jam: the safety mode we use is the zero one [09:16] newSession calls "SetSafe(&Safe{})" [09:16] but what does that actually trigger by default? [09:17] rogpeppe: "if safe.WMode == "" => w = safe.WMode" [09:17] jam: huh? [09:17] rogpeppe: mgo session.go, ensureSafe() when creating a new session [09:17] it starts with "Safe{}" mode [09:18] then... [09:18] I'm not 100% sure [09:18] I thought gustavo claimed it was Write mode safe by default [09:18] but I'm not quite seeing that [09:18] jam: ah, you mean safe.WMode != "" [09:18] rogpeppe: oddly enough, SetSafe calls ensureSafe which means you can never decrease the safe mode from the existing value [09:20] jam: i'm not sure [09:20] jam: i think that safeOp might make a difference there [09:21] rogpeppe: so, we can set it the first time, but it looks like if we've ever set safe before (which happens when you call newSession automatically) then it gets set to the exact value [09:21] but after that point [09:22] it *might* be that the safe value must be greater than the existing one [09:22] the actual logic is hard for me to sort out [09:22] jam: yeah, i think it does - the comparison logic is only triggered if safeOp is non-nil [09:22] rogpeppe: sure, comparison only if safeOp is non nil, but mgo calls SetSafe as part of setting up the session [09:22] jam: SetSafe sets safeOp to nil before calling ensureSafe [09:22] which means by the time a *user* gets to it [09:22] ah [09:22] k [09:23] I missed that, fair point [09:24] rogpeppe: so by default mgo does read back getLastError [09:24] (as SetSafe(Safe{}) at least does the getLastError check) [09:24] however, it doesn't actually set W or WMode [09:25] jam: yeah, i think we should have WMode: "majority" [09:25] jam: and FSync: true [09:25] rogpeppe: inconsequential until we have HA, but I agree [09:25] jam: the latter is still important [09:26] rogpeppe: I would say if we have WMode: majority we may not need fsync, it depends on the catastrophic failure you have to worry about. [09:26] jam: i'm not sure. [09:26] jam: it depends if we replicate the logs to the HA servers too [09:27] jam: if we do, then all are equally likely to run out of disk space at the same time [09:27] jam: we really need a better story for that situation anyway though [09:27] rogpeppe: running out of disk space should be something we address, but I don't think it relates to FSync concern for us, does it? [09:27] jam: if you don't fsync, you don't know when you've run out of disk space [09:29] jam: so for instance you can have a transaction that gets added and then set on the relevant document, but the transaction can be lost [09:29] jam: which is the situation that seems to have happened in peter's case [09:29] rogpeppe: but we only ran out of disk space because of the log exploding, right? the mgo database doesn't grow allt hat much in our own [09:29] so if we fix logs, then we'll avoid that side of it [09:29] jam: well... [09:30] jam: there was another interesting thing that happened yesterday, and i have no current explanation [09:30] jam: the mongo database had been moved onto another EBS device (one with more space) [09:31] jam: with a symlink from /var/lib/juju to that [09:31] jam: when we restarted mongo, it started writing data very quickly [09:31] jam: and the size grew from ~3GB to ~14GB in a few minutes [09:31] jam: before we stopped it [09:32] jam: we fixed it by doing a mongodump/mongorestore [09:32] jam: (the amount of data when dumped was only 70MB) [09:32] rogpeppe: I have a strong feeling you were already in an error state that was running out of controll (for whatever reason). My 10k node setup was on the order of 600MB, IIRC [09:33] jam: quite possibly. i've no idea what kind of error state would cause that though [09:33] rogpeppe: mgo txns that don't exist causing jujud to try to fix something that isn't broken over and over ? [09:33] I don't really know either, I'm surprised dump restore fixed it [09:34] as that sounds like its a bug in mongo [09:34] jam: i'm pretty sure it wasn't anything to do with juju itself. [09:34] jam: i'm not sure it *could* grow the transaction collection that fast [09:34] jam: it's possible that it's some kind of mongo bug [09:38] rogpeppe: so I think I would set the write concern to at least 1, and the Journal value to True, rather than FSync. [09:39] jam: what are the implications of the J value? the comment is somewhat obscure to me. [09:39] rogpeppe: "write the data to the journal" vs "fsync the whole db" [09:39] rogpeppe: http://docs.mongodb.org/manual/reference/command/getLastError/#dbcmd.getLastError [09:40] "In most cases, use the j option to ensure durability..." as the doc under "fsync" [09:48] * TheMue just uses the synchronous bootstrap the first time. feels better. but the bootstrap help text still tealls it's asynchronous [09:50] rogpeppe: but I'd go for a patch that after Dial immediately calls EnsureSafe(Safe{"majority", J=True}) [10:03] TheMue, well spotted, would you quickly fix it please? ;p [10:03] jam, +1 [10:12] fwereade: yep, will do [10:12] TheMue, <3 [10:14] but I need a sed specialist ;) how do I prefix all lines of a file with a given string? [10:31] TheMue: well you could do: "sed -i .back -e 's/\(.*\)/STUFF\1/" [10:31] but I'm not sure if sed is the best fit for it [10:32] jam: thx. I'm also open for other ideas, otherwise I'll use my editor [10:33] TheMue: if you have tons of stuff, sed is fine for it, with vim gg^VG ^ISTUFF [10:33] (go top, block insert, Go bottom, Insert all, write STUFF, ESC to finish) [10:34] cool [10:34] will try after proposal [10:42] fwereade: dunno if my english is good enough: https://codereview.appspot.com/36520043/ [10:47] jam, TheMue, standup [11:46] jam: fwereade: i'm back if you wanted to discuss the auth keys plugin. or not. ping me if you do. [11:46] wallyworld_: you could rejoin hangout [11:46] wallyworld_: we're still in the hangout if you want to pop back in [11:46] * mgz wins! [11:46] * natefinch is too slow [11:46] :)\ === gary_poster|away is now known as gary_poster [13:18] * dimitern lunch [13:58] Anyone interested in reviewing my Tailer, the first component for the debug log API: https://codereview.appspot.com/36540043 === BradCrittenden is now known as bac [13:58] Is intended to do a filtered tailing of any ReaderSeeker (like a File) into a Writer === ChanServ changed the topic of #juju-dev to: https://juju.ubuntu.com | On-call reviewer: see calendar | Bugs: 8 Critical, 240 High - https://bugs.launchpad.net/juju-core/ [14:05] dimitern: ping [14:08] rogpeppe: quick look on https://codereview.appspot.com/36540043 ? [14:08] TheMue: will do [14:09] rogpeppe: thanks [14:10] rogpeppe, pong [14:13] dimitern: i'm wondering about the upgrade-juju behaviour [14:13] dimitern: in particular: when was the checking for version consistency introduced? [14:13] rogpeppe, yeah? [14:14] rogpeppe, recently [14:14] dimitern: after 1.16? [14:14] rogpeppe, yes [14:14] dimitern: the other thing is: does it ignore dead machines when it's checking? [14:15] rogpeppe, take a look at SetEnvironAgentVersion [14:16] rogpeppe, it just checks tools versions, not the life [14:16] dimitern: hmm, i think that's probably wrong then [14:16] dimitern: if an agent is dead, i think we probably don't care about its version [14:17] rogpeppe, perhaps we can unset agent version from dead machines anyway [14:17] dimitern: but it's a good thing it's not released yet, because that logic won't prevent peter waller from upgrading his environment currently [14:17] dimitern: i don't think that's necessary [14:17] dimitern: i think setting life should set life only [14:18] dimitern: and it's just possible that the agent version info could be useful to someone, somewhere, i guess [14:18] rogpeppe, well, it's not just that actually [14:18] rogpeppe, uprgade-juju does the check of version constraints before trying to change it [14:19] rogpeppe, so in fact it will have helped in peter's case not to upgrade to more than 1.14 [14:20] dimitern: what do mean by the version constraints? [14:20] rogpeppe, "next stable" logic [14:20] rogpeppe, (or current, failing that) [14:21] sinzui, hey - about to start working on 1.16.4 - I see some discussion on my observation about whether this is really a stable release - what's the outcome? what do I need todo now? [14:22] jamespage, I was just writing the list to summarise jam's argument [14:23] jamespage, from the dev's perspective this is a stable release because it address issue with how juju is currently used. Some papercuts are improvements/features, but they are always 100% compatible. [14:23] sinzui, we are going to struggle with a minor release exception if that is the case [14:24] jamespage, devel and minor version increments are bug features and version incompatibilities [14:24] * TheMue has to step out for his office appointment, will return later [14:24] s/bug/big features/ [14:24] sinzui, I'll discuss with a few SRU people [14:24] rogpeppe, and anyway the case you're describing is very unusual - dead machines with inconsistent agent versions - that never would've happened if the usual upgrade process is followed [14:24] sinzui, I did look at the changes - I think the plugin I could probably swing with as its isolated from the rest of the codebase [14:25] sinzui, the provisioner safe-mode feels less SRU'able [14:25] dimitern: why not? [14:25] jamespage, I am keen to do a release, I can make this 1.18.0 is a couple of hours. The devs are a little more reticent. [14:25] rogpeppe, well, unless you force it ofc [14:25] rogpeppe, due to the version constraints checks [14:25] dimitern: the machines have been around for a long time - their instances were manually destroyed from the aws console AFAIK [14:26] sinzui, fwiw I'm trying to SRU all 1.16.x point releases to saucy as evidence that juju-core is ready for a MRE for trusty [14:26] dimitern: those checks aren't in 1.16 though, right? [14:26] rogpeppe, no [14:26] dimitern: "no they aren't" or "no that's wrong" ? [14:27] rogpeppe, sorry, no they aren't [14:27] dimitern: ok, cool [14:28] jamespage, That is admirable. If the devs were producing smaller features to release a stable each month, would that cause pain? [14:28] dimitern: it might be a bit of a problem that one broken machine can prevent upgrade of a whole environment, but... can we manually override by specifying --version ? [14:28] * sinzui thinks enterprise customers get juju from the location CTS points to, so rapid increments is always fine [14:29] * jamespage thinks about sinzui's suggestion [14:35] jamespage: the extra plugin was actually trying to do a bug fix in a non-intrusive way... unfortunately that does mean packaging changes instead which isn't really what you want for a minor version [14:36] mgz, I guess my query is about whether a feature that allows you to backup/restore a juju environment should be landing on a stable release branch [14:37] mgz, (I appreciate the way the plugin was done does isolate it from the rest of the codebase - which avoids regression potentials) === liam_ is now known as Guest41957 [14:47] my main fuse tripped and i trips again when I turn it back on, unless i stop one of the the other ones, so now I have no power on any outlet in the living room and had to do some trickery to get it to work from the bedroom :/ [14:48] mgz, hey - any plans on bug 1241674 [14:48] <_mup_> Bug #1241674: juju-core broken with OpenStack Havana for tenants with multiple networks [14:48] its what I get most frequently asked about these days [14:48] jamespage: yeah, I should post summary to that bug [14:48] so then those people who ask have something to read [14:49] please do [15:07] sounds like dimitern is having persistent power problems and we might not see him again today [15:07] fwereade: ping [15:08] rogpeppe, pong [15:08] fwereade: would you be free for a little bit [15:08] rogpeppe, maybe, but I'll have to drop off to talk to green in max 20 mins [15:09] fwereade: that would be fine [15:09] rogpeppe, consider me at your service then [15:09] fwereade: https://plus.google.com/hangouts/_/calendar/am9obi5tZWluZWxAY2Fub25pY2FsLmNvbQ.mf0d8r5pfb44m16v9b2n5i29ig?authuser=1 [15:09] fwereade: (with peter waller) [15:29] jamespage, I replied to juju 1.16.4 conversation on the list. I think you may want to correct or elaborate on what I wrote [15:40] this is really odd [16:18] niemeyer: ping [16:20] rogpeppe: I refreshed a branch you already reviewed for the update-bootstrap tweaks btw [16:23] mgz: ok, will have a look [16:25] mgz: am currently still trying to sort out this broken environment [16:25] mgz: have you looked at mgo/txn at all, BTW? [16:26] alas no :) [16:26] just enough to add some operations to state [16:26] didn't try and understand how it was actually working [16:27] jamespage: unfortunately I cleared my traceback a bit, but I will say the "provisioner-safe-mode" is like *the key* bit that NEC actually needs, the rest is automation around stuff they can do manually. [16:27] jamespage: is there a reason cloud-archive:tools is still reporting 1.16.0? [16:27] sinzui: ^^ [16:27] jam: yes the SRU only just went into saucy - its waiting for processing in the cloud-tools staging PPA right now [16:28] along with a few other things [16:28] jamespage: k, jpds is having a problem with keyserver stuff and that is fixed in 1.16.2 [16:28] jamespage, is there anything I should be doing to speed that up? [16:28] I'll poke smoser for review [16:30] jamespage: "the SRU" of which version? [16:30] 1.16.3 [16:31] jam: any idea what might be going on here? http://paste.ubuntu.com/6515237/ [16:31] jamespage: great [16:31] jam: this is on the broken environment i mentioned in the standup [16:31] rogpeppe: context? [16:31] thx [16:31] jam: note all the calls to txn.flusher.recurse [16:32] jam: i *think* that indicates something's broken with transactions (which wouldn't actually be too surprising in this case) [16:32] rogpeppe: the 'active' frame is the top one, right? [16:33] jam: yes [16:44] rogpeppe: Heya [16:44] rogpeppe: So, problem sovled? [16:44] niemeyer: i'm not sure it is, unfortunately [16:44] rogpeppe: Haven't seen any replies since you've mailed him about it [16:44] niemeyer: i've been working with him to try and bring things up again. [16:45] niemeyer: i *thought* it was all pretty much working, [16:45] niemeyer: but there appears to be something still up with the transaction queues [16:45] niemeyer: this is the stack trace i'm seeing on the machine agent: http://paste.ubuntu.com/6515237/ [16:45] niemeyer: note the many calls to the recurse method [16:46] niemeyer: it seems that nothing is actually making any progress [16:47] rogpeppe: Seems to be trying to apply transactions [16:48] niemeyer: it does, but none seem to be actually being applied [16:48] rogpeppe: That's a side effect of having missing transactions [16:49] rogpeppe: Missing transaction documents, that is [16:49] rogpeppe: It'll refuse to make progress because the system was corrupted [16:49] niemeyer: i thought the PurgeMissing call was supposed to deal with that [16:49] rogpeppe: So it cannot make reasonable progress [16:49] rogpeppe: Yes, it is [16:49] niemeyer: so, we did that and it seemed to succeed [16:50] rogpeppe: Did it clean everything up? [16:50] rogpeppe: on the rigth database, etc [16:50] niemeyer: yes, i believe so - it made a lot more progress (no errors about missing transactions any more) [16:52] rogpeppe: Tell him to kill the transaction logs completely, run purge-txns again [16:52] niemeyer: ok [16:52] rogpeppe: Drop both txns and txns.log [16:52] rogpeppe: and txns.stash [16:53] niemeyer: ok, trying that [16:53] rogpeppe: After that, purge-txns will cause a full cleanup [16:55] niemeyer: when you say "drop", is that a specific call, or is it just something like db.txns.remove(nil) [16:55] ? [16:55] > db.test.drop() [16:55] true [16:55] > [16:56] rogpeppe: Be mindful.. there is no protection against doing major damage [16:56] niemeyer: i am aware of that [16:56] niemeyer: there is a backup though [16:57] rogpeppe: Yeah, I'm actually curious about one thing: [16:57] rogpeppe: the db dump I got.. was that the backup, or was that the one being executed live? [16:57] niemeyer: that was a backup made at my instigation [16:57] Bug #1257371 is a regression that breaks bootstrapping on aws and canonistack [16:57] <_mup_> Bug #1257371: bootstrap fails because Permission denied (publickey) [16:58] niemeyer: i.e. after the problems had started to occur [16:58] e [16:58] r [16:58] rogpeppe: Right, I'm pretty sure trying to run the system on that state would great quite a bit of churn in the database [16:58] rogpeppe: s/great/create/ [16:59] rogpeppe: Depending on the retry strategies... [17:00] rogpeppe: This might explain why the database was growing [17:00] rogpeppe: and might also explain why the system is in that state you see now [17:00] niemeyer: ok. let's hope this strategy works then [17:00] niemeyer: just about to drop. wish me luck :-) [17:00] rogpeppe: The transactions may all be fine now.. but if you put a massive number of runners trying to finalize a massive number of pending and dependent transactions at once, it won't be great [17:01] rogpeppe: The traceback you pasted seems to corroborate with that theory too [17:01] niemeyer: collections dropped [17:05] niemeyer: it's currently purged >10000 transactions [17:06] rogpeppe: There you go.. [17:06] rogpeppe: No wonder it was stuck [17:06] niemeyer: it's still going... [17:06] rogpeppe: That's definitely not the database I have here, by the way [17:07] rogpeppe: I did check the magnitude of proper transactions to be applied [17:07] niemeyer: indeed not - i think they've all been started since this morning [17:07] niemeyer: there were only a page or so this morning [17:08] rogpeppe: Well, a page of missing [17:08] rogpeppe: The problem now is a different one [17:08] niemeyer: ah yes [17:08] rogpeppe: These are not missing or bad transactions [17:08] rogpeppe: They're perfectly good transactions that have been attempted continuously and in parallel, but unable to be applied because the system was wedged with a few transactions that were lost [17:09] rogpeppe: Then, once the system was restored to a good state, there was that massive amount of pending transactions to be applied.. and due to how juju is trying to do stuff from several fronts, there was an attempt to flush the queues concurrently [17:10] rogpeppe: Not great [17:10] rogpeppe: At the same time, a good sign that the txn package did hold the mess back instead of creating havoc [17:11] niemeyer: yeah [17:11] niemeyer: 34500 now [17:12] rogpeppe: Gosh [17:13] rogpeppe: How come it was running for so long? [17:13] rogpeppe: What happens when juju panics? I guess we have upstart scripts that put it back alive? [17:14] niemeyer: it *should* all be ok [17:14] niemeyer: the main problem with panics is that when the recur continually, the logs fill up [17:14] niemeyer: and that was the indirect cause of what we're seeing now [17:15] rogpeppe: Well, that's not the only problem.. :) [17:15] niemeyer: indeed [17:15] niemeyer: 5 whys [17:15] rogpeppe: "OMG, things are broken! Fix it!" => "Try it again!" => "OMG, things are broken! Fix it!" => "Once more!" => ..... [17:16] rogpeppe: That's how we end up with tens of thousands of pending transactions :) [17:16] niemeyer: well to be fair, we only applied one fix today [17:17] rogpeppe: Hmm.. how do you mean? [17:17] niemeyer: we ran PurgeMissing [17:18] rogpeppe: Sorry, I'm missing the context [17:18] rogpeppe: I don't get the hook of "to be fair" [17:18] niemeyer: ah, i thought you were talking about human intervention [17:18] niemeyer: but perhaps you're talking about what the agents were doing [17:19] rogpeppe: No, I'm talking about the fact the system loops continuously doing more damage when we explicitly say in code that we cannot continue [17:19] niemeyer: right [17:19] niemeyer: it's an interesting question as to what's the best approach there [17:20] niemeyer: i definitely think that some kind of backoff or retry limit would be good [17:20] rogpeppe: Yeah, I think we should enable that in our upstart scripts [17:20] rogpeppe: This is a well known practice, even in systems that take the fail-and-restart approach to heart [17:20] rogpeppe: (e.g. erlang) [17:20] rogpeppe: (or, Erlang OTP, more correctly) [17:21] niemeyer: yeah [17:23] niemeyer: hmm, 70000 transactions purged so far. i'm really quite surprised there are that many [17:31] rogpeppe: Depending on how far that goes, it might be wise to start from that backup instead of that crippled live system [17:34] niemeyer: latest is that it has probably fixed the problem [17:34] niemeyer: except... [17:35] niemeyer: that now amazon has rate-limited the requests because we'd restarted too often (probably) [17:35] niemeyer: so hopefully that will have resolved by the morning [17:35] rogpeppe: Gosh.. [17:36] niemeyer: lots of instance id requests because they've got a substantial number of machines in the environment which are dead (with missing instances) [17:37] niemeyer: and if we get a missing instance, we retry because amazon might be lying due to eventual consistency [17:37] niemeyer: so we make more requests than we should [17:47] rogpeppe: Right === liam_ is now known as Guest7421 [18:22] * rogpeppe is done for the day [18:22] g'night all [18:40] hey. [18:40] before i write an email to juju-dev [18:40] can someone tell me real quick if there is some plan (or existing path) that a charm can indicate that it can or cannot run in a lxc container [18:41] and if so, any modules that it might need access to or devices (or kernel version or such) [18:52] smoser: I don't think we have any such thing today.... I don't know of a plan to include such a thing. [18:53] thanks. [20:12] morning [20:12] also, WTF? [20:12] anyone got a working environment up? [20:13] I get: ERROR when I go 'juju add-machine' [20:13] anyone else confirm? [20:13] doh [20:13] thumper: lemme give it a try, half a sec, need to switch to trunk [20:13] kk [20:15] oh, and yay [20:15] with the kvm local provider I can create nested kvm [20:15] * thumper wants to try lxc in kvm in kvm [20:15] heh... [20:15] just keep nesting until something breaks [20:15] also means I can fix the kvm provisioner code without needing to start canonistack [20:15] awesome [20:15] natefinch: I've heard from robie that three deep causes problems [20:16] but I've not tested [20:16] also, memory probably an issue... [20:16] the outer kvm would need more ram for the inner kvm to work properly [20:16] where's your sense of adventure? [20:16] but that too would allow me to test the hardware characteristics [20:16] * thumper has 16 gig of ram [20:16] lets do this [20:17] :D [20:17] after I've fixed the bug that is... [20:17] kvm container provisioner is panicing [20:19] no one on warthogs wants to talk about google compute engine evidently... [20:19] heh [20:20] * thumper goes to write a stack trace function for loggo [20:20] some day juju status will return [20:20] and then I can try add machine [20:27] thumper: add machine works for me on trunk/ec2 [20:27] no error? [20:27] correct [20:27] it could well be linked to the kvm stuff [20:27] ta, I'll keep digging [20:28] welcome [20:28] natefinch, what about it? [20:28] gce that is [20:51] I was wondering why my container in a container was taking so long to start [20:51] it seems the host is downloading the cloud image [20:51] what we really want is a squid cache on the host machine [20:51] who knows squit? [20:51] squid [20:55] ......crickets [21:06] * thumper hangs his head [21:06] damn networking [21:06] so, this kinda works... [21:08] * thumper wonders where the "br0" is coming from... [21:08] * thumper thinks... [21:09] ah [21:09] DefaultKVMBridge [21:09] * thumper tweaks the local provider to make eth0 bridged [21:10] * thumper wonders how crazy this is getting [21:10] thumper: I can't see br0 without thinking "You mad bro?" [21:10] heh [21:11] and usually, if I'm looking at br0, I'm mad [21:12] :) [21:16] thumper, varnish ftw ;-) [21:16] thumper, are we setting up lxc on a different bridge then kvm? [21:16] hazinhell: varnish? [21:17] hazinhell: well, lxc defaults to lxcbr0 and kvm to virbr0 [21:17] the config wasn't setting one [21:17] and for a container inside the local provider we need to have bridged eth0 [21:17] thumper, varnish over squid for proxy. [21:17] hazinhell: docs? [21:18] thumper, varnish-cache.org.. but if your using one of the apt proxies, afaik only squid is setup for that [21:19] hazinhell: what I wanted was a local cache of the lxc ubuntu-cloud image and the kvm one [21:19] to make creating container locally faster [21:19] as a new kvm instance needs to sync the images [21:19] to start an lxc or kvm container [21:20] thumper, lxc already caches [21:20] hazinhell: not for this use case [21:20] because it is a new machine [21:20] hazinhell: consider this ... [21:20] laptop host [21:21] has both kvm and lxc images cached [21:21] boot up kvm local provider [21:21] start a machine [21:21] uses cache [21:21] then go "juju add-machine kvm:1" [21:21] machine 1, the new kvm instance, then syncs the kvm image [21:21] this goes to the internet to get it [21:21] I want a cache on the host [21:21] similarly if the new machine 1 wants an lxc image [21:21] ah.. nesting with cache access [21:21] it goes to the internet to sync image [21:22] ack [21:22] so squit cache on the host to make it faster [21:22] thumper, what about mount the host cache over [21:22] for new machines starting containers [21:22] thumper, read mount [21:22] sounds crazy :) [21:23] it does.. you need some supervision tree to share the read mounts down the hierarchy [21:23] s/supervision/ [21:23] thumper, [21:24] thumper, you could just do the host object storage (provider storage) and link the cache into that [21:24] surely a cache on the host would be less crazy [21:24] thumper, the host already has the cache, a mount of that directly into the guests, allows all the default tools to see it without any extra work on juju's part [21:25] doing a network endpoint, means you have to interject some juju logic to pull from that endpoint into the local disk cache [21:25] and you end up with wasted space [21:27] its kinda of a shame we can't use the same for both.. [21:28] ie lxc is a rootfs and kvm is basically a disk image. [21:28] hmm [21:30] sadly can't quite loop dev the img and mount it into the cache, lxc wants a tarball there, would have to set it up as a container rootfs. [22:00] hmm... [22:06] hmm... === dspiteri is now known as DarrenS [22:27] thumper, read mount sound good? [22:28] thumper, or something else come to mind? [22:28] hazinhell: busy fixing the basics at the moment [22:28] ack [22:28] * hazinhell returns to hell [22:40] anyone have an idea why my kvm machine doesn't have the networking service running? [22:40] * thumper steps back a bit