/srv/irclogs.ubuntu.com/2013/12/03/#juju-dev.txt

bigjoolserror: invalid service name "tarmac-1.4"00:22
bigjoolsyay00:22
davecheneyprobably the do00:26
davecheneypretty much only safe to use a-z, 0-9 and hyphen00:26
thumperwallyworld_: looking00:26
wallyworld_k00:27
davecheneys/do/dot00:27
bigjoolsdavecheney: but why... :/00:31
bigjoolsI tried with tarmac-14 and it still complained :(00:32
wallyworld_thumper: i didn't mention the listener because it is orthogonal to the management of keys in state. that aside, i'll continue with the current work then00:44
thumperwallyworld_: I think it is worth the few days effort now to make it easier00:45
wallyworld_me too00:45
thumperthe fact that it gives you a break to do something slightly different is a bonus00:45
wallyworld_if had the the road map all sorted it would be easier to know exactly what to do next00:45
=== waigani_ is now known as waigani
wallyworld_thumper: i just ran into that @%#^%!&@!^& issue where different values of c are used in different phases of the tests. ffs01:37
thumperhow?01:38
thumperstoring c?01:38
wallyworld_no, constructing a stub in the SetupTest, where the stub took c *gc.C as the arg and the called c.Assert01:39
wallyworld_the c.Assert failed and the wrong go routine was stopped01:39
wallyworld_so the test said it passed01:39
thumperheh01:39
thumperoops01:39
wallyworld_well, oops smoops01:39
wallyworld_the same value of c should be used01:39
wallyworld_also, our commands are hard to test sometimes01:40
wallyworld_cause they live in a main package01:40
wallyworld_so can't easily be mocked out from another package01:40
wallyworld_without refactoring the Run()01:41
thumpertrue that...01:53
thumperschool run01:59
* thumper is hanging out for synchronous bootstrap...02:44
thumperI'm waiting for canonistack to come back saying it is up02:44
thumperno idea what the state is right now02:44
thumperall I know is that it has started an instance02:44
jam1thumper: sinzui: We delivered a 1.16.4.1 to NEC, that I'd *really* like to be called an official 1.16.4 so we don't have to tilt our heads every time we look at it. Beyond that we're in a situation where we've got enough work queued up in the pipeline that we're afraid to release trunk, and I'd really like to get us out of that situation.02:58
thumperjam: why afraid?03:12
thumperOH FF!!!!@!03:18
thumperloaded invalid environment configuration: required environment variable not set for credentials attribute: User03:20
thumperthat is on the server side, worked client side...03:20
thumperGRR...03:20
* thumper cringes...03:22
* thumper needs to talk this through with someone03:28
thumperaxw, wallyworld_: either of you available to be a talking teddybear?03:29
axwthumper: sure, brb03:29
wallyworld_ok. i've never been a teddy bear before03:29
axwback..03:31
thumperhttps://plus.google.com/hangouts/_/76cpibimp3dvc1rugac12elqh8?hl=en03:31
thumpermramm: ta03:49
mrammyou are most welcome03:49
* thumper downloads failing machine log04:15
thumperfor investigation04:15
axwjam: are you happy for me to land my synchronous bootstrap branch after addressing fwereade's comments? do you have anything else to add?05:06
jamaxw: I haven't actually had a chance to look at it, though I never actually audited the patch, either. I just tested it out.05:07
axwok05:07
jamso if someone like william has actually done code review, I'm fine with it landing05:07
jamjust don't count it as "I've looked closely and I'm happy with everything in there" :)05:08
jamaxw: if you have particular bits that you're more unsure of, you can point me to them, and I can give you directed feedback05:08
axwjam: no problems05:08
jamI'm just distracted with other things05:08
axwok05:08
axwjam: I think it's ok, just wanted to make sure you didn't want to have a look before I went ahead05:09
fwereademan, it creeps me out when my phone tells me I've sent a new review of something *before* the new page renders in the browser08:10
jam fwereade :)08:13
jammorning08:13
fwereadejam, heyhey08:13
fwereadejam, so, about 1.16.4 et al... I quite like the idea of making our current projected 1.16.5 into 1.18, and getting that out asap, with a view to the rest of the CLI API landing in 1.20 as soon as it's ready08:24
fwereadejam, drawbacks?08:24
jamfwereade: well we do have actual bugfixes for 1.16 as well, and I'm concerned about future precedent (what happens the next time we have to do a stop-the-line fix for a client using an older version of juju) but whatever we pick today can work08:25
jamfwereade: the fact that it has been ~1month since we did any kind of a release is surprising for me08:25
fwereadejam, well, the issue here is that the "fix" was really more or a "feature"08:25
jamand also makes me wonder if we've got all the balance points correct.08:25
jamfwereade: necessary work for a stable series that took precedence over other feature work08:26
jamfwereade: fwiw we did the same in 1.16.2 (maas-agent-name is terribly incompatible)08:26
fwereadejam, ha, yes08:26
jam1.16.3 is a genuine bugfix, as it is a single line08:27
jamfwereade: so a gut feeling is that we are being inconsistent about the size of the meaning of an X+2 stable release08:27
jam1.16 vs 1.18 is going to be tiny, but 1.18 vs 1.20 is going to be massive08:27
jamthats ok08:27
jambut it is a drawback08:28
jamfwereade: I do *like* the idea of stable being bugfixes only, but that assumes we have the features implemented that someone on a stable series actually needs08:28
fwereadejam, well,it won't be any bigger than today's 1.16->1.18 already would be08:28
jamfwereade: sure, but 1.12 -> 1.14 -> 1.16 -> ? were all similarly sized08:28
fwereadejam, I do agree that's not nice in itself08:28
fwereadejam, OTOH "size of change" is a bit subjective anyway and I'm not sure it's a great yardstick against which to measure minor version changes08:30
jamfwereade: its very much a gut feeling about "hmmm, this seems strange" more than any sort of concrete thing. I'd rather have regular cadenced .X+2 releases (1/month?, 1/3months?) and then figure out the rest as we go08:32
jamI *do* feel that 1.17 is building up a lot of steam pressure as we keep piling stuff into it and haven't done a release08:32
jamas in, we're more likely to break lots of stuff because people haven't been able to give any incremental feedback08:32
fwereadejam, yeah, I'd like to get a release off trunk soon08:32
jamfwereade: at which point, if we had 1.17.0 released, and then we needed something for NEC, what would you prefer ?08:33
jamit feels really odd to jump from 1.16.3 => 1.18 over the 1.17 series, though obviously we can do it08:33
jambut 1.17 would then contain "all of" 1.1808:33
fwereadejam, well the problem is that 1.17 (if it existed) would contain a bunch of things that aren't in 1.1808:34
fwereadejam, so it's a good thing it doesn't ;)08:34
jamfwereade: well 1.18 would contain lots of things not in 1.17 either08:34
jamsorry08:34
fwereadejam, ;p08:34
jam1.17 would contain lots of things not in 1.1808:34
jamas in, they are modestly unrelated08:35
fwereadejam, yeah08:35
jamand we only didn't have 1.17.0 because the release didn't go out the week before08:35
jam(probably because of test suite failures)08:35
jamCI08:35
jamI don't think I ever heard a "we can't release 1.17.0 because of X" from sinzui08:35
fwereadejam, honestly, in general, I would prefer that specific client fixes be individual hotfixes based on what they were already running -- but, yes, in this case it didn't come out that way08:36
jamfwereade: why would you rather have client-specific branches?08:36
jamit seems far more maintenance than getting the actual code out as a marked & tagged release08:36
jamfwereade: as in, NEC comes back and does "juju status" wtf version are they running ?08:37
fwereadejam, because it minimizes the risk/disruption at the client end08:37
jamif we give them a real release08:37
jamthen they have an obvious match08:37
jamand the guy in #juju-core can reason about it08:37
fwereadejam, that problem already exists because we fucked up and let --upload-tools out into general use08:37
* fwereade makes grumpy face08:37
jamfwereade: well, we can tell it is at least a custom build, and what it is based upon08:38
fwereadejam, well, not really08:38
fwereadejam, eg they weren't using a custom build, but they did have a .build version, because --upload-tools08:39
jamfwereade: sure, but we can tell that it is version X + epsilon (in this case epsilon = 0), that is still better than a pure hotfix that doesn't match any release at all08:40
fwereadejam, ok, my preference is predicated on the idea that we would be able to bring clients back into the usual stream once we have a real release that includes the bits they need, and that is not necessarily realistic08:41
jamfwereade: well it isn't realistic if we don't actually release their bits :)08:41
fwereadejam, the trouble is that we have no information about epsilon08:41
fwereadejam, ha08:41
jamfwereade: and there is the "will NEC use 1.18" ? Only if forced ?08:41
fwereadejam, yeah, that's the question08:41
fwereadejam, I do agree that always having clients on officially released versions is optimal08:42
jamfwereade: so we could have done, "we're going to need some big patches for them, bump our current 1.18, and prepare a new 1.18 based on 1.16"08:43
jama concern in the future is "what if we already have 1.18.0 out the door" because they are using an old stable08:43
jamIIRC, IS is still using 1.14 in a lot of places08:43
jambut  I *think* we got them off of 1.1308:43
jamwell, 1.13.2 from trunk before it was actually released, etc.08:44
jam(I remember doing help with them and 'fixing' something that was in 1.13.2 but not in their build of it :)08:44
jamfwereade: so also, "tools-url" lets you put whatever you want into "official" versions as well08:44
fwereadejam, sure, tools-url lets you lie if you're so inclined -- but --upload-tools forces a lie on you whether you want it or not08:46
jamfwereade: "want it or not" you did ask for it :)08:46
fwereadejam, well, not so much... if you just want to upload the local jujud you don't *really* also want funky versions08:47
rogpeppe1mornin' all08:57
wallyworld_fwereade: when you are free i'd like a quick chat, maybe you can ping me when convenient. i may have to pop out for a bit at some point so if i don't respond i'm not ignoring you09:00
fwereadewallyworld_, sure, 5 mins?09:00
wallyworld_ok09:01
fwereaderogpeppe1, heyhey, I think you have a couple of branches LGTMed and ready to land09:01
rogpeppe1fwereade: ah, yes. i can think of at least one.09:01
rogpeppe1fwereade: BTW i spent some time with Peter Waller yesterday afternoon trying to get his juju installation up again09:02
fwereaderogpeppe1, ah, thank you09:02
rogpeppe1fwereade: an interesting and illuminating exercise09:02
* fwereade peers closely at rogpeppe1, tries to determine degree of deadpan09:02
rogpeppe1fwereade: gustavo's just added a feature to the txn package to help recover in this kind of situaton09:02
rogpeppe1fwereade: actually said totally straight09:03
rogpeppe1fwereade: we found lots of places that referred to transactions that did not exist09:03
* fwereade looks nervous09:04
rogpeppe1fwereade: that's probably because we don't run in safe mode (we don't ensure an fsync before we assume a transaction has been successfully written)09:04
rogpeppe1fwereade: so when we ran out of disk space, this kind of thing can happen09:05
fwereadewtf, I could have sworn I double-checked that *months* ago09:05
rogpeppe1fwereade: we should probably run at least the transaction operations safely09:05
fwereaderogpeppe1, surely we should run safely full stop09:06
rogpeppe1fwereade: probably09:06
rogpeppe1fwereade: it'll be interesting to see how much it slows things down09:06
=== rogpeppe1 is now known as rogpeppe
fwereaderogpeppe, indeed09:07
rogpeppefwereade: yeah, i don't see any calls to SetSafe in the code09:08
jamrogpeppe: any understanding of why it isn't the default ?09:11
rogpeppejam: i'm just trying to work out what the default actually is09:11
jamrogpeppe: you can use go and call Session.Safe()09:13
rogpeppejam: yeah, i'm just doing that09:14
jamrogpeppe: "session.SetMode(Strong, true)" is called in DialWithInfo09:15
jambut I don't think that changes Safe09:16
rogpeppejam: i think that's orthogonal to the safety mode09:16
rogpeppejam: the safety mode we use is the zero one09:16
jamnewSession calls "SetSafe(&Safe{})"09:16
jambut what does that actually trigger by default?09:16
jamrogpeppe: "if safe.WMode == "" => w = safe.WMode"09:17
rogpeppejam: huh?09:17
jamrogpeppe: mgo session.go, ensureSafe() when creating a new session09:17
jamit starts with "Safe{}" mode09:17
jamthen...09:18
jamI'm not 100% sure09:18
jamI thought gustavo claimed it was Write mode safe by default09:18
jambut I'm not quite seeing that09:18
rogpeppejam: ah, you mean safe.WMode != ""09:18
jamrogpeppe: oddly enough, SetSafe calls ensureSafe which means you can never decrease the safe mode from the existing value09:18
rogpeppejam: i'm not sure09:20
rogpeppejam: i think that safeOp might make a difference there09:20
jamrogpeppe: so, we can set it the first time, but it looks like if we've ever set safe before (which happens when you call newSession automatically) then it gets set to the exact value09:21
jambut after that point09:21
jamit *might* be that the safe value must be greater than the existing one09:22
jamthe actual logic is hard for me to sort out09:22
rogpeppejam: yeah, i think it does - the comparison logic is only triggered if safeOp is non-nil09:22
jamrogpeppe: sure, comparison only if safeOp is non nil, but mgo calls SetSafe as part of setting up the session09:22
rogpeppejam: SetSafe sets safeOp to nil before calling ensureSafe09:22
jamwhich means by the time a *user* gets to it09:22
jamah09:22
jamk09:22
jamI missed that, fair point09:23
jamrogpeppe: so by default mgo does read back getLastError09:24
jam(as SetSafe(Safe{}) at least does the getLastError check)09:24
jamhowever, it doesn't actually set W or WMode09:24
rogpeppejam: yeah, i think we should have WMode: "majority"09:25
rogpeppejam: and FSync: true09:25
jamrogpeppe: inconsequential until we have HA, but I agree09:25
rogpeppejam: the latter is still important09:25
jamrogpeppe: I would say if we have WMode: majority we may not need fsync, it depends on the catastrophic failure you have to worry about.09:26
rogpeppejam: i'm not sure.09:26
rogpeppejam: it depends if we replicate the logs to the HA servers too09:26
rogpeppejam: if we do, then all are equally likely to run out of disk space at the same time09:27
rogpeppejam: we really need a better story for that situation anyway though09:27
jamrogpeppe: running out of disk space should be something we address, but I don't think it relates to FSync concern for us, does it?09:27
rogpeppejam: if you don't fsync, you don't know when you've run out of disk space09:27
rogpeppejam: so for instance you can have a transaction that gets added and then set on the relevant document, but the transaction can be lost09:29
rogpeppejam: which is the situation that seems to have happened in peter's case09:29
jamrogpeppe: but we only ran out of disk space because of the log exploding, right? the mgo database doesn't grow allt hat much in our own09:29
jamso if we fix logs, then we'll avoid that side of it09:29
rogpeppejam: well...09:29
rogpeppejam: there was another interesting thing that happened yesterday, and i have no current explanation09:30
rogpeppejam: the mongo database had been moved onto another EBS device (one with more space)09:30
rogpeppejam: with a symlink from /var/lib/juju to that09:31
rogpeppejam: when we restarted mongo, it started writing data very quickly09:31
rogpeppejam: and the size grew from ~3GB to ~14GB in a few minutes09:31
rogpeppejam: before we stopped it09:31
rogpeppejam: we fixed it by doing a mongodump/mongorestore09:32
rogpeppejam: (the amount of data when dumped was only 70MB)09:32
jamrogpeppe: I have a strong feeling you were already in an error state that was running out of controll (for whatever reason). My 10k node setup was on the order of 600MB, IIRC09:32
rogpeppejam: quite possibly. i've no idea what kind of error state would cause that though09:33
jamrogpeppe: mgo txns that don't exist causing jujud to try to fix something that isn't broken over and over ?09:33
jamI don't really know either, I'm surprised dump restore fixed it09:33
jamas that sounds like its a bug in mongo09:34
rogpeppejam: i'm pretty sure it wasn't anything to do with juju itself.09:34
rogpeppejam: i'm not sure it *could* grow the transaction collection that fast09:34
rogpeppejam: it's possible that it's some kind of mongo bug09:34
jamrogpeppe: so I think I would set the write concern to at least 1, and the Journal value to True, rather than FSync.09:38
rogpeppejam: what are the implications of the J value? the comment is somewhat obscure to me.09:39
jamrogpeppe: "write the data to the journal" vs "fsync the whole db"09:39
jamrogpeppe: http://docs.mongodb.org/manual/reference/command/getLastError/#dbcmd.getLastError09:39
jam"In most cases, use the j option to ensure durability..." as the doc under "fsync"09:40
* TheMue just uses the synchronous bootstrap the first time. feels better. but the bootstrap help text still tealls it's asynchronous09:48
jamrogpeppe: but I'd go for a patch that after Dial immediately calls EnsureSafe(Safe{"majority", J=True})09:50
fwereadeTheMue, well spotted, would you quickly fix it please? ;p10:03
fwereadejam, +110:03
TheMuefwereade: yep, will do10:12
fwereadeTheMue, <310:12
TheMuebut I need a sed specialist ;) how do I prefix all lines of a file with a given string?10:14
jamTheMue: well you could do: "sed -i .back -e 's/\(.*\)/STUFF\1/"10:31
jambut I'm not sure if sed is the best fit for it10:31
TheMuejam: thx. I'm also open for other ideas, otherwise I'll use my editor10:32
jamTheMue: if you have tons of stuff, sed is fine for it, with vim gg^VG ^ISTUFF<ESC>10:33
jam(go top, block insert, Go bottom, Insert all, write STUFF, ESC to finish)10:33
TheMuecool10:34
TheMuewill try after proposal10:34
TheMuefwereade: dunno if my english is good enough: https://codereview.appspot.com/36520043/10:42
fwereadejam, TheMue, standup10:47
wallyworld_jam: fwereade: i'm back if you wanted to discuss the auth keys plugin. or not. ping me if you do.11:46
mgzwallyworld_: you could rejoin hangout11:46
natefinchwallyworld_: we're still in the hangout if you want to pop back in11:46
* mgz wins!11:46
* natefinch is too slow11:46
natefinch:)\11:46
=== gary_poster|away is now known as gary_poster
* dimitern lunch13:18
TheMueAnyone interested in reviewing my Tailer, the first component for the debug log API: https://codereview.appspot.com/3654004313:58
=== BradCrittenden is now known as bac
TheMueIs intended to do a filtered tailing of any ReaderSeeker (like a File) into a Writer13:58
=== ChanServ changed the topic of #juju-dev to: https://juju.ubuntu.com | On-call reviewer: see calendar | Bugs: 8 Critical, 240 High - https://bugs.launchpad.net/juju-core/
rogpeppedimitern: ping14:05
TheMuerogpeppe: quick look on https://codereview.appspot.com/36540043 ?14:08
rogpeppeTheMue: will do14:08
TheMuerogpeppe: thanks14:09
dimiternrogpeppe, pong14:10
rogpeppedimitern: i'm wondering about the upgrade-juju behaviour14:13
rogpeppedimitern: in particular: when was the checking for version consistency introduced?14:13
dimiternrogpeppe, yeah?14:13
dimiternrogpeppe, recently14:14
rogpeppedimitern: after 1.16?14:14
dimiternrogpeppe, yes14:14
rogpeppedimitern: the other thing is: does it ignore dead machines when it's checking?14:14
dimiternrogpeppe, take a look at SetEnvironAgentVersion14:15
dimiternrogpeppe, it just checks tools versions, not the life14:16
rogpeppedimitern: hmm, i think that's probably wrong then14:16
rogpeppedimitern: if an agent is dead, i think we probably don't care about its version14:16
dimiternrogpeppe, perhaps we can unset agent version from dead machines anyway14:17
rogpeppedimitern: but it's a good thing it's not released yet, because that logic won't prevent peter waller from upgrading his environment currently14:17
rogpeppedimitern: i don't think that's necessary14:17
rogpeppedimitern: i think setting life should set life only14:17
rogpeppedimitern: and it's just possible that the agent version info could be useful to someone, somewhere, i guess14:18
dimiternrogpeppe, well, it's not just that actually14:18
dimiternrogpeppe, uprgade-juju does the check of version constraints before trying to change it14:18
dimiternrogpeppe, so in fact it will have helped in peter's case not to upgrade to more than 1.1414:19
rogpeppedimitern: what do mean by the version constraints?14:20
dimiternrogpeppe, "next stable" logic14:20
dimiternrogpeppe, (or current, failing that)14:20
jamespagesinzui, hey - about to start working on 1.16.4 - I see some discussion on my observation about whether this is really a stable release - what's the outcome? what do I need todo now?14:21
sinzuijamespage, I was just writing the list to summarise jam's argument14:22
sinzuijamespage, from the dev's perspective this is a stable release because it address issue with how juju is currently used. Some papercuts are improvements/features, but they are always 100% compatible.14:23
jamespagesinzui, we are going to struggle with a minor release exception if that is the case14:23
sinzuijamespage, devel and minor version increments are bug features and version incompatibilities14:24
* TheMue has to step out for his office appointment, will return later14:24
sinzuis/bug/big features/14:24
jamespagesinzui, I'll discuss with a few SRU people14:24
dimiternrogpeppe, and anyway the case you're describing is very unusual - dead machines with inconsistent agent versions - that never would've happened if the usual upgrade process is followed14:24
jamespagesinzui, I did look at the changes - I think the plugin I could probably swing with as its isolated from the rest of the codebase14:24
jamespagesinzui, the provisioner safe-mode feels less SRU'able14:25
rogpeppedimitern: why not?14:25
sinzuijamespage, I am keen to do a release, I can make this 1.18.0 is a couple of hours. The devs are a little more reticent.14:25
dimiternrogpeppe, well, unless you force it ofc14:25
dimiternrogpeppe, due to the version constraints checks14:25
rogpeppedimitern: the machines have been around for a long time - their instances were manually destroyed from the aws console AFAIK14:25
jamespagesinzui, fwiw I'm trying to SRU all 1.16.x point releases to saucy as evidence that juju-core is ready for a MRE for trusty14:26
rogpeppedimitern: those checks aren't in 1.16 though, right?14:26
dimiternrogpeppe, no14:26
rogpeppedimitern: "no they aren't" or "no that's wrong" ?14:26
dimiternrogpeppe, sorry, no they aren't14:27
rogpeppedimitern: ok, cool14:27
sinzuijamespage, That is admirable. If the devs were producing smaller features to release a stable each month, would that cause pain?14:28
rogpeppedimitern: it might be a bit of a problem that one broken machine can prevent upgrade of a whole environment, but... can we manually override by specifying --version ?14:28
* sinzui thinks enterprise customers get juju from the location CTS points to, so rapid increments is always fine14:28
* jamespage thinks about sinzui's suggestion14:29
mgzjamespage: the extra plugin was actually trying to do a bug fix in a non-intrusive way... unfortunately that does mean packaging changes instead which isn't really what you want for a minor version14:35
jamespagemgz, I guess my query is about whether a feature that allows you to backup/restore a juju environment should be landing on a stable release branch14:36
jamespagemgz, (I appreciate the way the plugin was done does isolate it from the rest of the codebase - which avoids regression potentials)14:37
=== liam_ is now known as Guest41957
dimiternmy main fuse tripped and i trips again when I turn it back on, unless i stop one of the the other ones, so now I have no power on any outlet in the living room and had to do some trickery to get it to work from the bedroom :/14:47
jamespagemgz, hey - any plans on bug 124167414:48
_mup_Bug #1241674: juju-core broken with OpenStack Havana for tenants with multiple networks <cts-cloud-review> <openstack-provider> <juju-core:Triaged> <https://launchpad.net/bugs/1241674>14:48
jamespageits what I get most frequently asked about these days14:48
mgzjamespage: yeah, I should post summary to that bug14:48
mgzso then those people who ask have something to read14:48
jamespageplease do14:49
fwereadesounds like dimitern is having persistent power problems and we might not see him again today15:07
rogpeppefwereade: ping15:07
fwereaderogpeppe, pong15:08
rogpeppefwereade: would you be free for a little bit15:08
fwereaderogpeppe, maybe, but I'll have to drop off to talk to green in max 20 mins15:08
rogpeppefwereade: that would be fine15:09
fwereaderogpeppe, consider me at your service then15:09
rogpeppefwereade: https://plus.google.com/hangouts/_/calendar/am9obi5tZWluZWxAY2Fub25pY2FsLmNvbQ.mf0d8r5pfb44m16v9b2n5i29ig?authuser=115:09
rogpeppefwereade: (with peter waller)15:09
sinzuijamespage, I replied to juju 1.16.4 conversation on the list. I think you may want to correct or elaborate on what I wrote15:29
rogpeppethis is really odd15:40
rogpeppeniemeyer: ping16:18
mgzrogpeppe: I refreshed a branch you already reviewed for the update-bootstrap tweaks btw16:20
rogpeppemgz: ok, will have a look16:23
rogpeppemgz: am currently still trying to sort out this broken environment16:25
rogpeppemgz: have you looked at mgo/txn at all, BTW?16:25
mgzalas no :)16:26
mgzjust enough to add some operations to state16:26
mgzdidn't try and understand how it was actually working16:26
jamjamespage: unfortunately I cleared my traceback a bit, but I will say the "provisioner-safe-mode" is like *the key* bit that NEC actually needs, the rest is automation around stuff they can do manually.16:27
jamjamespage: is there a reason cloud-archive:tools is still reporting 1.16.0?16:27
jamsinzui: ^^16:27
jamespagejam: yes the SRU only just went into saucy - its waiting for processing in the cloud-tools staging PPA right now16:27
jamespagealong with a few other things16:28
jamjamespage: k, jpds is having a problem with keyserver stuff and that is fixed in 1.16.216:28
sinzuijamespage, is there anything I should be doing to speed that up?16:28
jamespageI'll poke smoser for review16:28
jamjamespage: "the SRU" of which version?16:30
jamespage1.16.316:30
rogpeppejam: any idea what might be going on here? http://paste.ubuntu.com/6515237/16:31
jamjamespage: great16:31
rogpeppejam: this is on the broken environment i mentioned in the standup16:31
jamrogpeppe: context?16:31
jamthx16:31
rogpeppejam: note all the calls to txn.flusher.recurse16:31
rogpeppejam: i *think* that indicates something's broken with transactions (which wouldn't actually be too surprising in this case)16:32
jamrogpeppe: the 'active' frame is the top one, right?16:32
rogpeppejam: yes16:33
niemeyerrogpeppe: Heya16:44
niemeyerrogpeppe: So, problem sovled?16:44
rogpeppeniemeyer: i'm not sure it is, unfortunately16:44
niemeyerrogpeppe: Haven't seen any replies since you've mailed him about it16:44
rogpeppeniemeyer: i've been working with him to try and bring things up again.16:44
rogpeppeniemeyer: i *thought* it was all pretty much working,16:45
rogpeppeniemeyer: but there appears to be something still up with the transaction queues16:45
rogpeppeniemeyer: this is the stack trace i'm seeing on the machine agent: http://paste.ubuntu.com/6515237/16:45
rogpeppeniemeyer: note the many calls to the recurse method16:45
rogpeppeniemeyer: it seems that nothing is actually making any progress16:46
niemeyerrogpeppe: Seems to be trying to apply transactions16:47
rogpeppeniemeyer: it does, but none seem to be actually being applied16:48
niemeyerrogpeppe: That's a side effect of having missing transactions16:48
niemeyerrogpeppe: Missing transaction documents, that is16:49
niemeyerrogpeppe: It'll refuse to make progress because the system was corrupted16:49
rogpeppeniemeyer: i thought the PurgeMissing call was supposed to deal with that16:49
niemeyerrogpeppe: So it cannot make reasonable progress16:49
niemeyerrogpeppe: Yes, it is16:49
rogpeppeniemeyer: so, we did that and it seemed to succeed16:49
niemeyerrogpeppe: Did it clean everything up?16:50
niemeyerrogpeppe: on the rigth database, etc16:50
rogpeppeniemeyer: yes, i believe so - it made a lot more progress (no errors about missing transactions any more)16:50
niemeyerrogpeppe: Tell him to kill the transaction logs completely, run purge-txns again16:52
rogpeppeniemeyer: ok16:52
niemeyerrogpeppe: Drop both txns and txns.log16:52
niemeyerrogpeppe: and txns.stash16:52
rogpeppeniemeyer: ok, trying that16:53
niemeyerrogpeppe: After that, purge-txns will cause a full cleanup16:53
rogpeppeniemeyer: when you say "drop", is that a specific call, or is it just something like db.txns.remove(nil)16:55
rogpeppe?16:55
niemeyer> db.test.drop()16:55
niemeyertrue16:55
niemeyer>16:55
niemeyerrogpeppe: Be mindful.. there is no protection against doing major damage16:56
rogpeppeniemeyer: i am aware of that16:56
rogpeppeniemeyer: there is a backup though16:56
niemeyerrogpeppe: Yeah, I'm actually curious about one thing:16:57
niemeyerrogpeppe: the db dump I got.. was that the backup, or was that the one being executed live?16:57
rogpeppeniemeyer: that was a backup made at my instigation16:57
sinzuiBug #1257371 is a regression that breaks bootstrapping on aws and canonistack16:57
_mup_Bug #1257371: bootstrap fails because Permission denied (publickey) <bootstrap> <regression> <juju-core:Triaged> <https://launchpad.net/bugs/1257371>16:57
rogpeppeniemeyer: i.e. after the problems had started to occur16:58
rogpeppee16:58
rogpepper16:58
niemeyerrogpeppe: Right, I'm pretty sure trying to run the system on that state would great quite a bit of churn in the database16:58
niemeyerrogpeppe: s/great/create/16:58
niemeyerrogpeppe: Depending on the retry strategies...16:59
niemeyerrogpeppe: This might explain why the database was growing17:00
niemeyerrogpeppe: and might also explain why the system is in that state you see now17:00
rogpeppeniemeyer: ok. let's hope this strategy works then17:00
rogpeppeniemeyer: just about to drop. wish me luck :-)17:00
niemeyerrogpeppe: The transactions may all be fine now.. but if you put a massive number of runners trying to finalize a massive number of pending and dependent transactions at once, it won't be great17:00
niemeyerrogpeppe: The traceback you pasted seems to corroborate with that theory too17:01
rogpeppeniemeyer: collections dropped17:01
rogpeppeniemeyer: it's currently purged >10000 transactions17:05
niemeyerrogpeppe: There you go..17:06
niemeyerrogpeppe: No wonder it was stuck17:06
rogpeppeniemeyer: it's still going...17:06
niemeyerrogpeppe: That's definitely not the database I have here, by the way17:06
niemeyerrogpeppe: I did check the magnitude of proper transactions to be applied17:07
rogpeppeniemeyer: indeed not - i think they've all been started since this morning17:07
rogpeppeniemeyer: there were only a page or so this morning17:07
niemeyerrogpeppe: Well, a page of missing17:08
niemeyerrogpeppe: The problem now is a different one17:08
rogpeppeniemeyer: ah yes17:08
niemeyerrogpeppe: These are not missing or bad transactions17:08
niemeyerrogpeppe: They're perfectly good transactions that have been attempted continuously and in parallel, but unable to be applied because the system was wedged with a few transactions that were lost17:08
niemeyerrogpeppe: Then, once the system was restored to a good state, there was that massive amount of pending transactions to be applied.. and due to how juju is trying to do stuff from several fronts, there was an attempt to flush the queues concurrently17:09
niemeyerrogpeppe: Not great17:10
niemeyerrogpeppe: At the same time, a good sign that the txn package did hold the mess back instead of creating havoc17:10
rogpeppeniemeyer: yeah17:11
rogpeppeniemeyer: 34500 now17:11
niemeyerrogpeppe: Gosh17:12
niemeyerrogpeppe: How come it was running for so long?17:13
niemeyerrogpeppe: What happens when juju panics?  I guess we have upstart scripts that put it back alive?17:13
rogpeppeniemeyer: it *should* all be ok17:14
rogpeppeniemeyer: the main problem with panics is that when the recur continually, the logs fill up17:14
rogpeppeniemeyer: and that was the indirect cause of what we're seeing now17:14
niemeyerrogpeppe: Well, that's not the only problem.. :)17:15
rogpeppeniemeyer: indeed17:15
rogpeppeniemeyer: 5 whys17:15
niemeyerrogpeppe: "OMG, things are broken! Fix it!" => "Try it again!" => "OMG, things are broken! Fix it!" => "Once more!" => .....17:15
niemeyerrogpeppe: That's how we end up with tens of thousands of pending transactions :)17:16
rogpeppeniemeyer: well to be fair, we only applied one fix today17:16
niemeyerrogpeppe: Hmm.. how do you mean?17:17
rogpeppeniemeyer: we ran PurgeMissing17:17
niemeyerrogpeppe: Sorry, I'm missing the context17:18
niemeyerrogpeppe: I don't get the hook of "to be fair"17:18
rogpeppeniemeyer: ah, i thought you were talking about human intervention17:18
rogpeppeniemeyer: but perhaps you're talking about what the agents were doing17:18
niemeyerrogpeppe: No, I'm talking about the fact the system loops continuously doing more damage when we explicitly say in code that we cannot continue17:19
rogpeppeniemeyer: right17:19
rogpeppeniemeyer: it's an interesting question as to what's the best approach there17:19
rogpeppeniemeyer: i definitely think that some kind of backoff or retry limit would be good17:20
niemeyerrogpeppe: Yeah, I think we should enable that in our upstart scripts17:20
niemeyerrogpeppe: This is a well known practice, even in systems that take the fail-and-restart approach to heart17:20
niemeyerrogpeppe: (e.g. erlang)17:20
niemeyerrogpeppe: (or, Erlang OTP, more correctly)17:20
rogpeppeniemeyer: yeah17:21
rogpeppeniemeyer: hmm, 70000 transactions purged so far. i'm really quite surprised there are that many17:23
niemeyerrogpeppe: Depending on how far that goes, it might be wise to start from that backup instead of that crippled live system17:31
rogpeppeniemeyer: latest is that it has probably fixed the problem17:34
rogpeppeniemeyer: except...17:34
rogpeppeniemeyer: that now amazon has rate-limited the requests because we'd restarted too often (probably)17:35
rogpeppeniemeyer: so hopefully that will have resolved by the morning17:35
niemeyerrogpeppe: Gosh..17:35
rogpeppeniemeyer: lots of instance id requests because they've got a substantial number of machines in the environment which are dead (with missing instances)17:36
rogpeppeniemeyer: and if we get a missing instance, we retry because amazon might be lying due to eventual consistency17:37
rogpeppeniemeyer: so we make more requests than we should17:37
niemeyerrogpeppe: Right17:47
=== liam_ is now known as Guest7421
* rogpeppe is done for the day18:22
rogpeppeg'night all18:22
smoserhey.18:40
smoserbefore i write an email to juju-dev18:40
smosercan someone tell me real quick if there is some plan (or existing path) that a charm can indicate that it can or cannot run in a lxc container18:40
smoserand if so, any modules that it might need access to or devices (or kernel version or such)18:41
natefinchsmoser: I don't think we have any such thing today.... I don't know of a plan to include such a thing.18:52
smoserthanks.18:53
thumpermorning20:12
thumperalso, WTF?20:12
thumperanyone got a working environment up?20:12
thumperI get: ERROR <nil> when I go 'juju add-machine'20:13
thumperanyone else confirm?20:13
natefinchdoh20:13
natefinchthumper: lemme give it a try, half a sec, need to switch to trunk20:13
thumperkk20:13
thumperoh, and yay20:15
thumperwith the kvm local provider I can create nested kvm20:15
* thumper wants to try lxc in kvm in kvm20:15
thumperheh...20:15
natefinchjust keep nesting until something breaks20:15
thumperalso means I can fix the kvm provisioner code without needing to start canonistack20:15
natefinchawesome20:15
thumpernatefinch: I've heard from robie that three deep causes problems20:15
thumperbut I've not tested20:16
thumperalso, memory probably an issue...20:16
thumperthe outer kvm would need more ram for the inner kvm to work properly20:16
natefinchwhere's your sense of adventure?20:16
thumperbut that too would allow me to test the hardware characteristics20:16
* thumper has 16 gig of ram20:16
thumperlets do this20:16
natefinch:D20:17
thumperafter I've fixed the bug that is...20:17
thumperkvm container provisioner is panicing20:17
natefinchno one on warthogs wants to talk about google compute engine evidently...20:19
thumperheh20:19
* thumper goes to write a stack trace function for loggo20:20
natefinchsome day juju status will return20:20
natefinchand then I can try add machine20:20
natefinchthumper: add machine works for me on trunk/ec220:27
thumperno error?20:27
natefinchcorrect20:27
thumperit could well be linked to the kvm stuff20:27
thumperta, I'll keep digging20:27
natefinchwelcome20:28
hazinhellnatefinch, what about it?20:28
hazinhellgce that is20:28
thumperI was wondering why my container in a container was taking so long to start20:51
thumperit seems the host is downloading the cloud image20:51
thumperwhat we really want is a squid cache on the host machine20:51
thumperwho knows squit?20:51
thumpersquid20:51
natefinch......crickets20:55
* thumper hangs his head21:06
thumperdamn networking21:06
thumperso, this kinda works...21:06
* thumper wonders where the "br0" is coming from...21:08
* thumper thinks...21:08
thumperah21:09
thumperDefaultKVMBridge21:09
* thumper tweaks the local provider to make eth0 bridged21:09
* thumper wonders how crazy this is getting21:10
natefinchthumper: I can't see br0 without thinking "You mad bro?"21:10
thumperheh21:10
natefinchand usually, if I'm looking at br0, I'm mad21:11
thumper:)21:12
hazinhellthumper, varnish ftw ;-)21:16
hazinhellthumper, are we setting up lxc on a different bridge then kvm?21:16
thumperhazinhell: varnish?21:16
thumperhazinhell: well, lxc defaults to lxcbr0 and kvm to virbr021:17
thumperthe config wasn't setting one21:17
thumperand for a container inside the local provider we need to have bridged eth021:17
hazinhellthumper, varnish over squid for proxy.21:17
thumperhazinhell: docs?21:17
hazinhellthumper, varnish-cache.org.. but if your using one of the apt proxies, afaik only squid is setup for that21:18
thumperhazinhell: what I wanted was a local cache of the lxc ubuntu-cloud image and the kvm one21:19
thumperto make creating container locally faster21:19
thumperas a new kvm instance needs to sync the images21:19
thumperto start an lxc or kvm container21:19
hazinhellthumper, lxc already caches21:20
thumperhazinhell: not for this use case21:20
thumperbecause it is a new machine21:20
thumperhazinhell: consider this ...21:20
thumperlaptop host21:20
thumperhas both kvm and lxc images cached21:21
thumperboot up kvm local provider21:21
thumperstart a machine21:21
thumperuses cache21:21
thumperthen go "juju add-machine kvm:1"21:21
thumpermachine 1, the new kvm instance, then syncs the kvm image21:21
thumperthis goes to the internet to get it21:21
thumperI want a cache on the host21:21
thumpersimilarly if the new machine 1 wants an lxc image21:21
hazinhellah.. nesting with cache access21:21
thumperit goes to the internet to sync image21:21
thumperack21:22
thumperso squit cache on the host to make it faster21:22
hazinhellthumper, what about mount the host cache over21:22
thumperfor new machines starting containers21:22
hazinhellthumper, read mount21:22
thumpersounds crazy :)21:22
hazinhellit does.. you need some supervision tree to share the read mounts down the hierarchy21:23
hazinhells/supervision/21:23
hazinhellthumper,21:23
hazinhellthumper, you could just do the host object storage (provider storage) and link the cache into that21:24
thumpersurely a cache on the host would be less crazy21:24
hazinhellthumper, the host already has the cache, a mount of that directly into the guests, allows all the default tools to see it without any extra work on juju's part21:24
hazinhelldoing a network endpoint, means you have to interject some juju logic to pull from that endpoint into the local disk cache21:25
hazinhelland you end up with wasted space21:25
hazinhellits kinda of a shame we can't use the same for both..21:27
hazinhellie lxc is a rootfs and kvm is basically a disk image.21:28
hazinhellhmm21:28
hazinhellsadly can't quite loop dev the img and mount it into the cache, lxc wants a tarball there, would have to set it up as a container rootfs.21:30
thumperhmm...22:00
thumperhmm...22:06
=== dspiteri is now known as DarrenS
hazinhellthumper, read mount sound good?22:27
hazinhellthumper, or something else come to mind?22:28
thumperhazinhell: busy fixing the basics at the moment22:28
hazinhellack22:28
* hazinhell returns to hell22:28
thumperanyone have an idea why my kvm machine doesn't have the networking service running?22:40
* thumper steps back a bit22:40

Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!