/srv/irclogs.ubuntu.com/2013/12/03/#juju-dev.txt

bigjools	error: invalid service name "tarmac-1.4"	00:22
bigjools	yay	00:22
davecheney	probably the do	00:26
davecheney	pretty much only safe to use a-z, 0-9 and hyphen	00:26
thumper	wallyworld_: looking	00:26
wallyworld_	k	00:27
davecheney	s/do/dot	00:27
bigjools	davecheney: but why... :/	00:31
bigjools	I tried with tarmac-14 and it still complained :(	00:32
wallyworld_	thumper: i didn't mention the listener because it is orthogonal to the management of keys in state. that aside, i'll continue with the current work then	00:44
thumper	wallyworld_: I think it is worth the few days effort now to make it easier	00:45
wallyworld_	me too	00:45
thumper	the fact that it gives you a break to do something slightly different is a bonus	00:45
wallyworld_	if had the the road map all sorted it would be easier to know exactly what to do next	00:45
=== waigani_ is now known as waigani
wallyworld_	thumper: i just ran into that @%#^%!&@!^& issue where different values of c are used in different phases of the tests. ffs	01:37
thumper	how?	01:38
thumper	storing c?	01:38
wallyworld_	no, constructing a stub in the SetupTest, where the stub took c *gc.C as the arg and the called c.Assert	01:39
wallyworld_	the c.Assert failed and the wrong go routine was stopped	01:39
wallyworld_	so the test said it passed	01:39
thumper	heh	01:39
thumper	oops	01:39
wallyworld_	well, oops smoops	01:39
wallyworld_	the same value of c should be used	01:39
wallyworld_	also, our commands are hard to test sometimes	01:40
wallyworld_	cause they live in a main package	01:40
wallyworld_	so can't easily be mocked out from another package	01:40
wallyworld_	without refactoring the Run()	01:41
thumper	true that...	01:53
thumper	school run	01:59
* thumper is hanging out for synchronous bootstrap...		02:44
thumper	I'm waiting for canonistack to come back saying it is up	02:44
thumper	no idea what the state is right now	02:44
thumper	all I know is that it has started an instance	02:44
jam1	thumper: sinzui: We delivered a 1.16.4.1 to NEC, that I'd really like to be called an official 1.16.4 so we don't have to tilt our heads every time we look at it. Beyond that we're in a situation where we've got enough work queued up in the pipeline that we're afraid to release trunk, and I'd really like to get us out of that situation.	02:58
thumper	jam: why afraid?	03:12
thumper	OH FF!!!!@!	03:18
thumper	loaded invalid environment configuration: required environment variable not set for credentials attribute: User	03:20
thumper	that is on the server side, worked client side...	03:20
thumper	GRR...	03:20
* thumper cringes...		03:22
* thumper needs to talk this through with someone		03:28
thumper	axw, wallyworld_: either of you available to be a talking teddybear?	03:29
axw	thumper: sure, brb	03:29
wallyworld_	ok. i've never been a teddy bear before	03:29
axw	back..	03:31
thumper	https://plus.google.com/hangouts/_/76cpibimp3dvc1rugac12elqh8?hl=en	03:31
thumper	mramm: ta	03:49
mramm	you are most welcome	03:49
* thumper downloads failing machine log		04:15
thumper	for investigation	04:15
axw	jam: are you happy for me to land my synchronous bootstrap branch after addressing fwereade's comments? do you have anything else to add?	05:06
jam	axw: I haven't actually had a chance to look at it, though I never actually audited the patch, either. I just tested it out.	05:07
axw	ok	05:07
jam	so if someone like william has actually done code review, I'm fine with it landing	05:07
jam	just don't count it as "I've looked closely and I'm happy with everything in there" :)	05:08
jam	axw: if you have particular bits that you're more unsure of, you can point me to them, and I can give you directed feedback	05:08
axw	jam: no problems	05:08
jam	I'm just distracted with other things	05:08
axw	ok	05:08
axw	jam: I think it's ok, just wanted to make sure you didn't want to have a look before I went ahead	05:09
fwereade	man, it creeps me out when my phone tells me I've sent a new review of something before the new page renders in the browser	08:10
jam	fwereade :)	08:13
jam	morning	08:13
fwereade	jam, heyhey	08:13
fwereade	jam, so, about 1.16.4 et al... I quite like the idea of making our current projected 1.16.5 into 1.18, and getting that out asap, with a view to the rest of the CLI API landing in 1.20 as soon as it's ready	08:24
fwereade	jam, drawbacks?	08:24
jam	fwereade: well we do have actual bugfixes for 1.16 as well, and I'm concerned about future precedent (what happens the next time we have to do a stop-the-line fix for a client using an older version of juju) but whatever we pick today can work	08:25
jam	fwereade: the fact that it has been ~1month since we did any kind of a release is surprising for me	08:25
fwereade	jam, well, the issue here is that the "fix" was really more or a "feature"	08:25
jam	and also makes me wonder if we've got all the balance points correct.	08:25
jam	fwereade: necessary work for a stable series that took precedence over other feature work	08:26
jam	fwereade: fwiw we did the same in 1.16.2 (maas-agent-name is terribly incompatible)	08:26
fwereade	jam, ha, yes	08:26
jam	1.16.3 is a genuine bugfix, as it is a single line	08:27
jam	fwereade: so a gut feeling is that we are being inconsistent about the size of the meaning of an X+2 stable release	08:27
jam	1.16 vs 1.18 is going to be tiny, but 1.18 vs 1.20 is going to be massive	08:27
jam	thats ok	08:27
jam	but it is a drawback	08:28
jam	fwereade: I do like the idea of stable being bugfixes only, but that assumes we have the features implemented that someone on a stable series actually needs	08:28
fwereade	jam, well,it won't be any bigger than today's 1.16->1.18 already would be	08:28
jam	fwereade: sure, but 1.12 -> 1.14 -> 1.16 -> ? were all similarly sized	08:28
fwereade	jam, I do agree that's not nice in itself	08:28
fwereade	jam, OTOH "size of change" is a bit subjective anyway and I'm not sure it's a great yardstick against which to measure minor version changes	08:30
jam	fwereade: its very much a gut feeling about "hmmm, this seems strange" more than any sort of concrete thing. I'd rather have regular cadenced .X+2 releases (1/month?, 1/3months?) and then figure out the rest as we go	08:32
jam	I do feel that 1.17 is building up a lot of steam pressure as we keep piling stuff into it and haven't done a release	08:32
jam	as in, we're more likely to break lots of stuff because people haven't been able to give any incremental feedback	08:32
fwereade	jam, yeah, I'd like to get a release off trunk soon	08:32
jam	fwereade: at which point, if we had 1.17.0 released, and then we needed something for NEC, what would you prefer ?	08:33
jam	it feels really odd to jump from 1.16.3 => 1.18 over the 1.17 series, though obviously we can do it	08:33
jam	but 1.17 would then contain "all of" 1.18	08:33
fwereade	jam, well the problem is that 1.17 (if it existed) would contain a bunch of things that aren't in 1.18	08:34
fwereade	jam, so it's a good thing it doesn't ;)	08:34
jam	fwereade: well 1.18 would contain lots of things not in 1.17 either	08:34
jam	sorry	08:34
fwereade	jam, ;p	08:34
jam	1.17 would contain lots of things not in 1.18	08:34
jam	as in, they are modestly unrelated	08:35
fwereade	jam, yeah	08:35
jam	and we only didn't have 1.17.0 because the release didn't go out the week before	08:35
jam	(probably because of test suite failures)	08:35
jam	CI	08:35
jam	I don't think I ever heard a "we can't release 1.17.0 because of X" from sinzui	08:35
fwereade	jam, honestly, in general, I would prefer that specific client fixes be individual hotfixes based on what they were already running -- but, yes, in this case it didn't come out that way	08:36
jam	fwereade: why would you rather have client-specific branches?	08:36
jam	it seems far more maintenance than getting the actual code out as a marked & tagged release	08:36
jam	fwereade: as in, NEC comes back and does "juju status" wtf version are they running ?	08:37
fwereade	jam, because it minimizes the risk/disruption at the client end	08:37
jam	if we give them a real release	08:37
jam	then they have an obvious match	08:37
jam	and the guy in #juju-core can reason about it	08:37
fwereade	jam, that problem already exists because we fucked up and let --upload-tools out into general use	08:37
* fwereade makes grumpy face		08:37
jam	fwereade: well, we can tell it is at least a custom build, and what it is based upon	08:38
fwereade	jam, well, not really	08:38
fwereade	jam, eg they weren't using a custom build, but they did have a .build version, because --upload-tools	08:39
jam	fwereade: sure, but we can tell that it is version X + epsilon (in this case epsilon = 0), that is still better than a pure hotfix that doesn't match any release at all	08:40
fwereade	jam, ok, my preference is predicated on the idea that we would be able to bring clients back into the usual stream once we have a real release that includes the bits they need, and that is not necessarily realistic	08:41
jam	fwereade: well it isn't realistic if we don't actually release their bits :)	08:41
fwereade	jam, the trouble is that we have no information about epsilon	08:41
fwereade	jam, ha	08:41
jam	fwereade: and there is the "will NEC use 1.18" ? Only if forced ?	08:41
fwereade	jam, yeah, that's the question	08:41
fwereade	jam, I do agree that always having clients on officially released versions is optimal	08:42
jam	fwereade: so we could have done, "we're going to need some big patches for them, bump our current 1.18, and prepare a new 1.18 based on 1.16"	08:43
jam	a concern in the future is "what if we already have 1.18.0 out the door" because they are using an old stable	08:43
jam	IIRC, IS is still using 1.14 in a lot of places	08:43
jam	but I think we got them off of 1.13	08:43
jam	well, 1.13.2 from trunk before it was actually released, etc.	08:44
jam	(I remember doing help with them and 'fixing' something that was in 1.13.2 but not in their build of it :)	08:44
jam	fwereade: so also, "tools-url" lets you put whatever you want into "official" versions as well	08:44
fwereade	jam, sure, tools-url lets you lie if you're so inclined -- but --upload-tools forces a lie on you whether you want it or not	08:46
jam	fwereade: "want it or not" you did ask for it :)	08:46
fwereade	jam, well, not so much... if you just want to upload the local jujud you don't really also want funky versions	08:47
rogpeppe1	mornin' all	08:57
wallyworld_	fwereade: when you are free i'd like a quick chat, maybe you can ping me when convenient. i may have to pop out for a bit at some point so if i don't respond i'm not ignoring you	09:00
fwereade	wallyworld_, sure, 5 mins?	09:00
wallyworld_	ok	09:01
fwereade	rogpeppe1, heyhey, I think you have a couple of branches LGTMed and ready to land	09:01
rogpeppe1	fwereade: ah, yes. i can think of at least one.	09:01
rogpeppe1	fwereade: BTW i spent some time with Peter Waller yesterday afternoon trying to get his juju installation up again	09:02
fwereade	rogpeppe1, ah, thank you	09:02
rogpeppe1	fwereade: an interesting and illuminating exercise	09:02
* fwereade peers closely at rogpeppe1, tries to determine degree of deadpan		09:02
rogpeppe1	fwereade: gustavo's just added a feature to the txn package to help recover in this kind of situaton	09:02
rogpeppe1	fwereade: actually said totally straight	09:03
rogpeppe1	fwereade: we found lots of places that referred to transactions that did not exist	09:03
* fwereade looks nervous		09:04
rogpeppe1	fwereade: that's probably because we don't run in safe mode (we don't ensure an fsync before we assume a transaction has been successfully written)	09:04
rogpeppe1	fwereade: so when we ran out of disk space, this kind of thing can happen	09:05
fwereade	wtf, I could have sworn I double-checked that months ago	09:05
rogpeppe1	fwereade: we should probably run at least the transaction operations safely	09:05
fwereade	rogpeppe1, surely we should run safely full stop	09:06
rogpeppe1	fwereade: probably	09:06
rogpeppe1	fwereade: it'll be interesting to see how much it slows things down	09:06
=== rogpeppe1 is now known as rogpeppe
fwereade	rogpeppe, indeed	09:07
rogpeppe	fwereade: yeah, i don't see any calls to SetSafe in the code	09:08
jam	rogpeppe: any understanding of why it isn't the default ?	09:11
rogpeppe	jam: i'm just trying to work out what the default actually is	09:11
jam	rogpeppe: you can use go and call Session.Safe()	09:13
rogpeppe	jam: yeah, i'm just doing that	09:14
jam	rogpeppe: "session.SetMode(Strong, true)" is called in DialWithInfo	09:15
jam	but I don't think that changes Safe	09:16
rogpeppe	jam: i think that's orthogonal to the safety mode	09:16
rogpeppe	jam: the safety mode we use is the zero one	09:16
jam	newSession calls "SetSafe(&Safe{})"	09:16
jam	but what does that actually trigger by default?	09:16
jam	rogpeppe: "if safe.WMode == "" => w = safe.WMode"	09:17
rogpeppe	jam: huh?	09:17
jam	rogpeppe: mgo session.go, ensureSafe() when creating a new session	09:17
jam	it starts with "Safe{}" mode	09:17
jam	then...	09:18
jam	I'm not 100% sure	09:18
jam	I thought gustavo claimed it was Write mode safe by default	09:18
jam	but I'm not quite seeing that	09:18
rogpeppe	jam: ah, you mean safe.WMode != ""	09:18
jam	rogpeppe: oddly enough, SetSafe calls ensureSafe which means you can never decrease the safe mode from the existing value	09:18
rogpeppe	jam: i'm not sure	09:20
rogpeppe	jam: i think that safeOp might make a difference there	09:20
jam	rogpeppe: so, we can set it the first time, but it looks like if we've ever set safe before (which happens when you call newSession automatically) then it gets set to the exact value	09:21
jam	but after that point	09:21
jam	it might be that the safe value must be greater than the existing one	09:22
jam	the actual logic is hard for me to sort out	09:22
rogpeppe	jam: yeah, i think it does - the comparison logic is only triggered if safeOp is non-nil	09:22
jam	rogpeppe: sure, comparison only if safeOp is non nil, but mgo calls SetSafe as part of setting up the session	09:22
rogpeppe	jam: SetSafe sets safeOp to nil before calling ensureSafe	09:22
jam	which means by the time a user gets to it	09:22
jam	ah	09:22
jam	k	09:22
jam	I missed that, fair point	09:23
jam	rogpeppe: so by default mgo does read back getLastError	09:24
jam	(as SetSafe(Safe{}) at least does the getLastError check)	09:24
jam	however, it doesn't actually set W or WMode	09:24
rogpeppe	jam: yeah, i think we should have WMode: "majority"	09:25
rogpeppe	jam: and FSync: true	09:25
jam	rogpeppe: inconsequential until we have HA, but I agree	09:25
rogpeppe	jam: the latter is still important	09:25
jam	rogpeppe: I would say if we have WMode: majority we may not need fsync, it depends on the catastrophic failure you have to worry about.	09:26
rogpeppe	jam: i'm not sure.	09:26
rogpeppe	jam: it depends if we replicate the logs to the HA servers too	09:26
rogpeppe	jam: if we do, then all are equally likely to run out of disk space at the same time	09:27
rogpeppe	jam: we really need a better story for that situation anyway though	09:27
jam	rogpeppe: running out of disk space should be something we address, but I don't think it relates to FSync concern for us, does it?	09:27
rogpeppe	jam: if you don't fsync, you don't know when you've run out of disk space	09:27
rogpeppe	jam: so for instance you can have a transaction that gets added and then set on the relevant document, but the transaction can be lost	09:29
rogpeppe	jam: which is the situation that seems to have happened in peter's case	09:29
jam	rogpeppe: but we only ran out of disk space because of the log exploding, right? the mgo database doesn't grow allt hat much in our own	09:29
jam	so if we fix logs, then we'll avoid that side of it	09:29
rogpeppe	jam: well...	09:29
rogpeppe	jam: there was another interesting thing that happened yesterday, and i have no current explanation	09:30
rogpeppe	jam: the mongo database had been moved onto another EBS device (one with more space)	09:30
rogpeppe	jam: with a symlink from /var/lib/juju to that	09:31
rogpeppe	jam: when we restarted mongo, it started writing data very quickly	09:31
rogpeppe	jam: and the size grew from ~3GB to ~14GB in a few minutes	09:31
rogpeppe	jam: before we stopped it	09:31
rogpeppe	jam: we fixed it by doing a mongodump/mongorestore	09:32
rogpeppe	jam: (the amount of data when dumped was only 70MB)	09:32
jam	rogpeppe: I have a strong feeling you were already in an error state that was running out of controll (for whatever reason). My 10k node setup was on the order of 600MB, IIRC	09:32
rogpeppe	jam: quite possibly. i've no idea what kind of error state would cause that though	09:33
jam	rogpeppe: mgo txns that don't exist causing jujud to try to fix something that isn't broken over and over ?	09:33
jam	I don't really know either, I'm surprised dump restore fixed it	09:33
jam	as that sounds like its a bug in mongo	09:34
rogpeppe	jam: i'm pretty sure it wasn't anything to do with juju itself.	09:34
rogpeppe	jam: i'm not sure it could grow the transaction collection that fast	09:34
rogpeppe	jam: it's possible that it's some kind of mongo bug	09:34
jam	rogpeppe: so I think I would set the write concern to at least 1, and the Journal value to True, rather than FSync.	09:38
rogpeppe	jam: what are the implications of the J value? the comment is somewhat obscure to me.	09:39
jam	rogpeppe: "write the data to the journal" vs "fsync the whole db"	09:39
jam	rogpeppe: http://docs.mongodb.org/manual/reference/command/getLastError/#dbcmd.getLastError	09:39
jam	"In most cases, use the j option to ensure durability..." as the doc under "fsync"	09:40
* TheMue just uses the synchronous bootstrap the first time. feels better. but the bootstrap help text still tealls it's asynchronous		09:48
jam	rogpeppe: but I'd go for a patch that after Dial immediately calls EnsureSafe(Safe{"majority", J=True})	09:50
fwereade	TheMue, well spotted, would you quickly fix it please? ;p	10:03
fwereade	jam, +1	10:03
TheMue	fwereade: yep, will do	10:12
fwereade	TheMue, <3	10:12
TheMue	but I need a sed specialist ;) how do I prefix all lines of a file with a given string?	10:14
jam	TheMue: well you could do: "sed -i .back -e 's/\(.*\)/STUFF\1/"	10:31
jam	but I'm not sure if sed is the best fit for it	10:31
TheMue	jam: thx. I'm also open for other ideas, otherwise I'll use my editor	10:32
jam	TheMue: if you have tons of stuff, sed is fine for it, with vim gg^VG ^ISTUFF<ESC>	10:33
jam	(go top, block insert, Go bottom, Insert all, write STUFF, ESC to finish)	10:33
TheMue	cool	10:34
TheMue	will try after proposal	10:34
TheMue	fwereade: dunno if my english is good enough: https://codereview.appspot.com/36520043/	10:42
fwereade	jam, TheMue, standup	10:47
wallyworld_	jam: fwereade: i'm back if you wanted to discuss the auth keys plugin. or not. ping me if you do.	11:46
mgz	wallyworld_: you could rejoin hangout	11:46
natefinch	wallyworld_: we're still in the hangout if you want to pop back in	11:46
* mgz wins!		11:46
* natefinch is too slow		11:46
natefinch	:)\	11:46
=== gary_poster\|away is now known as gary_poster
* dimitern lunch		13:18
TheMue	Anyone interested in reviewing my Tailer, the first component for the debug log API: https://codereview.appspot.com/36540043	13:58
=== BradCrittenden is now known as bac
TheMue	Is intended to do a filtered tailing of any ReaderSeeker (like a File) into a Writer	13:58
=== ChanServ changed the topic of #juju-dev to: https://juju.ubuntu.com \| On-call reviewer: see calendar \| Bugs: 8 Critical, 240 High - https://bugs.launchpad.net/juju-core/
rogpeppe	dimitern: ping	14:05
TheMue	rogpeppe: quick look on https://codereview.appspot.com/36540043 ?	14:08
rogpeppe	TheMue: will do	14:08
TheMue	rogpeppe: thanks	14:09
dimitern	rogpeppe, pong	14:10
rogpeppe	dimitern: i'm wondering about the upgrade-juju behaviour	14:13
rogpeppe	dimitern: in particular: when was the checking for version consistency introduced?	14:13
dimitern	rogpeppe, yeah?	14:13
dimitern	rogpeppe, recently	14:14
rogpeppe	dimitern: after 1.16?	14:14
dimitern	rogpeppe, yes	14:14
rogpeppe	dimitern: the other thing is: does it ignore dead machines when it's checking?	14:14
dimitern	rogpeppe, take a look at SetEnvironAgentVersion	14:15
dimitern	rogpeppe, it just checks tools versions, not the life	14:16
rogpeppe	dimitern: hmm, i think that's probably wrong then	14:16
rogpeppe	dimitern: if an agent is dead, i think we probably don't care about its version	14:16
dimitern	rogpeppe, perhaps we can unset agent version from dead machines anyway	14:17
rogpeppe	dimitern: but it's a good thing it's not released yet, because that logic won't prevent peter waller from upgrading his environment currently	14:17
rogpeppe	dimitern: i don't think that's necessary	14:17
rogpeppe	dimitern: i think setting life should set life only	14:17
rogpeppe	dimitern: and it's just possible that the agent version info could be useful to someone, somewhere, i guess	14:18
dimitern	rogpeppe, well, it's not just that actually	14:18
dimitern	rogpeppe, uprgade-juju does the check of version constraints before trying to change it	14:18
dimitern	rogpeppe, so in fact it will have helped in peter's case not to upgrade to more than 1.14	14:19
rogpeppe	dimitern: what do mean by the version constraints?	14:20
dimitern	rogpeppe, "next stable" logic	14:20
dimitern	rogpeppe, (or current, failing that)	14:20
jamespage	sinzui, hey - about to start working on 1.16.4 - I see some discussion on my observation about whether this is really a stable release - what's the outcome? what do I need todo now?	14:21
sinzui	jamespage, I was just writing the list to summarise jam's argument	14:22
sinzui	jamespage, from the dev's perspective this is a stable release because it address issue with how juju is currently used. Some papercuts are improvements/features, but they are always 100% compatible.	14:23
jamespage	sinzui, we are going to struggle with a minor release exception if that is the case	14:23
sinzui	jamespage, devel and minor version increments are bug features and version incompatibilities	14:24
* TheMue has to step out for his office appointment, will return later		14:24
sinzui	s/bug/big features/	14:24
jamespage	sinzui, I'll discuss with a few SRU people	14:24
dimitern	rogpeppe, and anyway the case you're describing is very unusual - dead machines with inconsistent agent versions - that never would've happened if the usual upgrade process is followed	14:24
jamespage	sinzui, I did look at the changes - I think the plugin I could probably swing with as its isolated from the rest of the codebase	14:24
jamespage	sinzui, the provisioner safe-mode feels less SRU'able	14:25
rogpeppe	dimitern: why not?	14:25
sinzui	jamespage, I am keen to do a release, I can make this 1.18.0 is a couple of hours. The devs are a little more reticent.	14:25
dimitern	rogpeppe, well, unless you force it ofc	14:25
dimitern	rogpeppe, due to the version constraints checks	14:25
rogpeppe	dimitern: the machines have been around for a long time - their instances were manually destroyed from the aws console AFAIK	14:25
jamespage	sinzui, fwiw I'm trying to SRU all 1.16.x point releases to saucy as evidence that juju-core is ready for a MRE for trusty	14:26
rogpeppe	dimitern: those checks aren't in 1.16 though, right?	14:26
dimitern	rogpeppe, no	14:26
rogpeppe	dimitern: "no they aren't" or "no that's wrong" ?	14:26
dimitern	rogpeppe, sorry, no they aren't	14:27
rogpeppe	dimitern: ok, cool	14:27
sinzui	jamespage, That is admirable. If the devs were producing smaller features to release a stable each month, would that cause pain?	14:28
rogpeppe	dimitern: it might be a bit of a problem that one broken machine can prevent upgrade of a whole environment, but... can we manually override by specifying --version ?	14:28
* sinzui thinks enterprise customers get juju from the location CTS points to, so rapid increments is always fine		14:28
* jamespage thinks about sinzui's suggestion		14:29
mgz	jamespage: the extra plugin was actually trying to do a bug fix in a non-intrusive way... unfortunately that does mean packaging changes instead which isn't really what you want for a minor version	14:35
jamespage	mgz, I guess my query is about whether a feature that allows you to backup/restore a juju environment should be landing on a stable release branch	14:36
jamespage	mgz, (I appreciate the way the plugin was done does isolate it from the rest of the codebase - which avoids regression potentials)	14:37
=== liam_ is now known as Guest41957
dimitern	my main fuse tripped and i trips again when I turn it back on, unless i stop one of the the other ones, so now I have no power on any outlet in the living room and had to do some trickery to get it to work from the bedroom :/	14:47
jamespage	mgz, hey - any plans on bug 1241674	14:48
_mup_	Bug #1241674: juju-core broken with OpenStack Havana for tenants with multiple networks <cts-cloud-review> <openstack-provider> <juju-core:Triaged> <https://launchpad.net/bugs/1241674>	14:48
jamespage	its what I get most frequently asked about these days	14:48
mgz	jamespage: yeah, I should post summary to that bug	14:48
mgz	so then those people who ask have something to read	14:48
jamespage	please do	14:49
fwereade	sounds like dimitern is having persistent power problems and we might not see him again today	15:07
rogpeppe	fwereade: ping	15:07
fwereade	rogpeppe, pong	15:08
rogpeppe	fwereade: would you be free for a little bit	15:08
fwereade	rogpeppe, maybe, but I'll have to drop off to talk to green in max 20 mins	15:08
rogpeppe	fwereade: that would be fine	15:09
fwereade	rogpeppe, consider me at your service then	15:09
rogpeppe	fwereade: https://plus.google.com/hangouts/_/calendar/am9obi5tZWluZWxAY2Fub25pY2FsLmNvbQ.mf0d8r5pfb44m16v9b2n5i29ig?authuser=1	15:09
rogpeppe	fwereade: (with peter waller)	15:09
sinzui	jamespage, I replied to juju 1.16.4 conversation on the list. I think you may want to correct or elaborate on what I wrote	15:29
rogpeppe	this is really odd	15:40
rogpeppe	niemeyer: ping	16:18
mgz	rogpeppe: I refreshed a branch you already reviewed for the update-bootstrap tweaks btw	16:20
rogpeppe	mgz: ok, will have a look	16:23
rogpeppe	mgz: am currently still trying to sort out this broken environment	16:25
rogpeppe	mgz: have you looked at mgo/txn at all, BTW?	16:25
mgz	alas no :)	16:26
mgz	just enough to add some operations to state	16:26
mgz	didn't try and understand how it was actually working	16:26
jam	jamespage: unfortunately I cleared my traceback a bit, but I will say the "provisioner-safe-mode" is like the key bit that NEC actually needs, the rest is automation around stuff they can do manually.	16:27
jam	jamespage: is there a reason cloud-archive:tools is still reporting 1.16.0?	16:27
jam	sinzui: ^^	16:27
jamespage	jam: yes the SRU only just went into saucy - its waiting for processing in the cloud-tools staging PPA right now	16:27
jamespage	along with a few other things	16:28
jam	jamespage: k, jpds is having a problem with keyserver stuff and that is fixed in 1.16.2	16:28
sinzui	jamespage, is there anything I should be doing to speed that up?	16:28
jamespage	I'll poke smoser for review	16:28
jam	jamespage: "the SRU" of which version?	16:30
jamespage	1.16.3	16:30
rogpeppe	jam: any idea what might be going on here? http://paste.ubuntu.com/6515237/	16:31
jam	jamespage: great	16:31
rogpeppe	jam: this is on the broken environment i mentioned in the standup	16:31
jam	rogpeppe: context?	16:31
jam	thx	16:31
rogpeppe	jam: note all the calls to txn.flusher.recurse	16:31
rogpeppe	jam: i think that indicates something's broken with transactions (which wouldn't actually be too surprising in this case)	16:32
jam	rogpeppe: the 'active' frame is the top one, right?	16:32
rogpeppe	jam: yes	16:33
niemeyer	rogpeppe: Heya	16:44
niemeyer	rogpeppe: So, problem sovled?	16:44
rogpeppe	niemeyer: i'm not sure it is, unfortunately	16:44
niemeyer	rogpeppe: Haven't seen any replies since you've mailed him about it	16:44
rogpeppe	niemeyer: i've been working with him to try and bring things up again.	16:44
rogpeppe	niemeyer: i thought it was all pretty much working,	16:45
rogpeppe	niemeyer: but there appears to be something still up with the transaction queues	16:45
rogpeppe	niemeyer: this is the stack trace i'm seeing on the machine agent: http://paste.ubuntu.com/6515237/	16:45
rogpeppe	niemeyer: note the many calls to the recurse method	16:45
rogpeppe	niemeyer: it seems that nothing is actually making any progress	16:46
niemeyer	rogpeppe: Seems to be trying to apply transactions	16:47
rogpeppe	niemeyer: it does, but none seem to be actually being applied	16:48
niemeyer	rogpeppe: That's a side effect of having missing transactions	16:48
niemeyer	rogpeppe: Missing transaction documents, that is	16:49
niemeyer	rogpeppe: It'll refuse to make progress because the system was corrupted	16:49
rogpeppe	niemeyer: i thought the PurgeMissing call was supposed to deal with that	16:49
niemeyer	rogpeppe: So it cannot make reasonable progress	16:49
niemeyer	rogpeppe: Yes, it is	16:49
rogpeppe	niemeyer: so, we did that and it seemed to succeed	16:49
niemeyer	rogpeppe: Did it clean everything up?	16:50
niemeyer	rogpeppe: on the rigth database, etc	16:50
rogpeppe	niemeyer: yes, i believe so - it made a lot more progress (no errors about missing transactions any more)	16:50
niemeyer	rogpeppe: Tell him to kill the transaction logs completely, run purge-txns again	16:52
rogpeppe	niemeyer: ok	16:52
niemeyer	rogpeppe: Drop both txns and txns.log	16:52
niemeyer	rogpeppe: and txns.stash	16:52
rogpeppe	niemeyer: ok, trying that	16:53
niemeyer	rogpeppe: After that, purge-txns will cause a full cleanup	16:53
rogpeppe	niemeyer: when you say "drop", is that a specific call, or is it just something like db.txns.remove(nil)	16:55
rogpeppe	?	16:55
niemeyer	> db.test.drop()	16:55
niemeyer	true	16:55
niemeyer	>	16:55
niemeyer	rogpeppe: Be mindful.. there is no protection against doing major damage	16:56
rogpeppe	niemeyer: i am aware of that	16:56
rogpeppe	niemeyer: there is a backup though	16:56
niemeyer	rogpeppe: Yeah, I'm actually curious about one thing:	16:57
niemeyer	rogpeppe: the db dump I got.. was that the backup, or was that the one being executed live?	16:57
rogpeppe	niemeyer: that was a backup made at my instigation	16:57
sinzui	Bug #1257371 is a regression that breaks bootstrapping on aws and canonistack	16:57
_mup_	Bug #1257371: bootstrap fails because Permission denied (publickey) <bootstrap> <regression> <juju-core:Triaged> <https://launchpad.net/bugs/1257371>	16:57
rogpeppe	niemeyer: i.e. after the problems had started to occur	16:58
rogpeppe	e	16:58
rogpeppe	r	16:58
niemeyer	rogpeppe: Right, I'm pretty sure trying to run the system on that state would great quite a bit of churn in the database	16:58
niemeyer	rogpeppe: s/great/create/	16:58
niemeyer	rogpeppe: Depending on the retry strategies...	16:59
niemeyer	rogpeppe: This might explain why the database was growing	17:00
niemeyer	rogpeppe: and might also explain why the system is in that state you see now	17:00
rogpeppe	niemeyer: ok. let's hope this strategy works then	17:00
rogpeppe	niemeyer: just about to drop. wish me luck :-)	17:00
niemeyer	rogpeppe: The transactions may all be fine now.. but if you put a massive number of runners trying to finalize a massive number of pending and dependent transactions at once, it won't be great	17:00
niemeyer	rogpeppe: The traceback you pasted seems to corroborate with that theory too	17:01
rogpeppe	niemeyer: collections dropped	17:01
rogpeppe	niemeyer: it's currently purged >10000 transactions	17:05
niemeyer	rogpeppe: There you go..	17:06
niemeyer	rogpeppe: No wonder it was stuck	17:06
rogpeppe	niemeyer: it's still going...	17:06
niemeyer	rogpeppe: That's definitely not the database I have here, by the way	17:06
niemeyer	rogpeppe: I did check the magnitude of proper transactions to be applied	17:07
rogpeppe	niemeyer: indeed not - i think they've all been started since this morning	17:07
rogpeppe	niemeyer: there were only a page or so this morning	17:07
niemeyer	rogpeppe: Well, a page of missing	17:08
niemeyer	rogpeppe: The problem now is a different one	17:08
rogpeppe	niemeyer: ah yes	17:08
niemeyer	rogpeppe: These are not missing or bad transactions	17:08
niemeyer	rogpeppe: They're perfectly good transactions that have been attempted continuously and in parallel, but unable to be applied because the system was wedged with a few transactions that were lost	17:08
niemeyer	rogpeppe: Then, once the system was restored to a good state, there was that massive amount of pending transactions to be applied.. and due to how juju is trying to do stuff from several fronts, there was an attempt to flush the queues concurrently	17:09
niemeyer	rogpeppe: Not great	17:10
niemeyer	rogpeppe: At the same time, a good sign that the txn package did hold the mess back instead of creating havoc	17:10
rogpeppe	niemeyer: yeah	17:11
rogpeppe	niemeyer: 34500 now	17:11
niemeyer	rogpeppe: Gosh	17:12
niemeyer	rogpeppe: How come it was running for so long?	17:13
niemeyer	rogpeppe: What happens when juju panics? I guess we have upstart scripts that put it back alive?	17:13
rogpeppe	niemeyer: it should all be ok	17:14
rogpeppe	niemeyer: the main problem with panics is that when the recur continually, the logs fill up	17:14
rogpeppe	niemeyer: and that was the indirect cause of what we're seeing now	17:14
niemeyer	rogpeppe: Well, that's not the only problem.. :)	17:15
rogpeppe	niemeyer: indeed	17:15
rogpeppe	niemeyer: 5 whys	17:15
niemeyer	rogpeppe: "OMG, things are broken! Fix it!" => "Try it again!" => "OMG, things are broken! Fix it!" => "Once more!" => .....	17:15
niemeyer	rogpeppe: That's how we end up with tens of thousands of pending transactions :)	17:16
rogpeppe	niemeyer: well to be fair, we only applied one fix today	17:16
niemeyer	rogpeppe: Hmm.. how do you mean?	17:17
rogpeppe	niemeyer: we ran PurgeMissing	17:17
niemeyer	rogpeppe: Sorry, I'm missing the context	17:18
niemeyer	rogpeppe: I don't get the hook of "to be fair"	17:18
rogpeppe	niemeyer: ah, i thought you were talking about human intervention	17:18
rogpeppe	niemeyer: but perhaps you're talking about what the agents were doing	17:18
niemeyer	rogpeppe: No, I'm talking about the fact the system loops continuously doing more damage when we explicitly say in code that we cannot continue	17:19
rogpeppe	niemeyer: right	17:19
rogpeppe	niemeyer: it's an interesting question as to what's the best approach there	17:19
rogpeppe	niemeyer: i definitely think that some kind of backoff or retry limit would be good	17:20
niemeyer	rogpeppe: Yeah, I think we should enable that in our upstart scripts	17:20
niemeyer	rogpeppe: This is a well known practice, even in systems that take the fail-and-restart approach to heart	17:20
niemeyer	rogpeppe: (e.g. erlang)	17:20
niemeyer	rogpeppe: (or, Erlang OTP, more correctly)	17:20
rogpeppe	niemeyer: yeah	17:21
rogpeppe	niemeyer: hmm, 70000 transactions purged so far. i'm really quite surprised there are that many	17:23
niemeyer	rogpeppe: Depending on how far that goes, it might be wise to start from that backup instead of that crippled live system	17:31
rogpeppe	niemeyer: latest is that it has probably fixed the problem	17:34
rogpeppe	niemeyer: except...	17:34
rogpeppe	niemeyer: that now amazon has rate-limited the requests because we'd restarted too often (probably)	17:35
rogpeppe	niemeyer: so hopefully that will have resolved by the morning	17:35
niemeyer	rogpeppe: Gosh..	17:35
rogpeppe	niemeyer: lots of instance id requests because they've got a substantial number of machines in the environment which are dead (with missing instances)	17:36
rogpeppe	niemeyer: and if we get a missing instance, we retry because amazon might be lying due to eventual consistency	17:37
rogpeppe	niemeyer: so we make more requests than we should	17:37
niemeyer	rogpeppe: Right	17:47
=== liam_ is now known as Guest7421
* rogpeppe is done for the day		18:22
rogpeppe	g'night all	18:22
smoser	hey.	18:40
smoser	before i write an email to juju-dev	18:40
smoser	can someone tell me real quick if there is some plan (or existing path) that a charm can indicate that it can or cannot run in a lxc container	18:40
smoser	and if so, any modules that it might need access to or devices (or kernel version or such)	18:41
natefinch	smoser: I don't think we have any such thing today.... I don't know of a plan to include such a thing.	18:52
smoser	thanks.	18:53
thumper	morning	20:12
thumper	also, WTF?	20:12
thumper	anyone got a working environment up?	20:12
thumper	I get: ERROR <nil> when I go 'juju add-machine'	20:13
thumper	anyone else confirm?	20:13
natefinch	doh	20:13
natefinch	thumper: lemme give it a try, half a sec, need to switch to trunk	20:13
thumper	kk	20:13
thumper	oh, and yay	20:15
thumper	with the kvm local provider I can create nested kvm	20:15
* thumper wants to try lxc in kvm in kvm		20:15
thumper	heh...	20:15
natefinch	just keep nesting until something breaks	20:15
thumper	also means I can fix the kvm provisioner code without needing to start canonistack	20:15
natefinch	awesome	20:15
thumper	natefinch: I've heard from robie that three deep causes problems	20:15
thumper	but I've not tested	20:16
thumper	also, memory probably an issue...	20:16
thumper	the outer kvm would need more ram for the inner kvm to work properly	20:16
natefinch	where's your sense of adventure?	20:16
thumper	but that too would allow me to test the hardware characteristics	20:16
* thumper has 16 gig of ram		20:16
thumper	lets do this	20:16
natefinch	:D	20:17
thumper	after I've fixed the bug that is...	20:17
thumper	kvm container provisioner is panicing	20:17
natefinch	no one on warthogs wants to talk about google compute engine evidently...	20:19
thumper	heh	20:19
* thumper goes to write a stack trace function for loggo		20:20
natefinch	some day juju status will return	20:20
natefinch	and then I can try add machine	20:20
natefinch	thumper: add machine works for me on trunk/ec2	20:27
thumper	no error?	20:27
natefinch	correct	20:27
thumper	it could well be linked to the kvm stuff	20:27
thumper	ta, I'll keep digging	20:27
natefinch	welcome	20:28
hazinhell	natefinch, what about it?	20:28
hazinhell	gce that is	20:28
thumper	I was wondering why my container in a container was taking so long to start	20:51
thumper	it seems the host is downloading the cloud image	20:51
thumper	what we really want is a squid cache on the host machine	20:51
thumper	who knows squit?	20:51
thumper	squid	20:51
natefinch	......crickets	20:55
* thumper hangs his head		21:06
thumper	damn networking	21:06
thumper	so, this kinda works...	21:06
* thumper wonders where the "br0" is coming from...		21:08
* thumper thinks...		21:08
thumper	ah	21:09
thumper	DefaultKVMBridge	21:09
* thumper tweaks the local provider to make eth0 bridged		21:09
* thumper wonders how crazy this is getting		21:10
natefinch	thumper: I can't see br0 without thinking "You mad bro?"	21:10
thumper	heh	21:10
natefinch	and usually, if I'm looking at br0, I'm mad	21:11
thumper	:)	21:12
hazinhell	thumper, varnish ftw ;-)	21:16
hazinhell	thumper, are we setting up lxc on a different bridge then kvm?	21:16
thumper	hazinhell: varnish?	21:16
thumper	hazinhell: well, lxc defaults to lxcbr0 and kvm to virbr0	21:17
thumper	the config wasn't setting one	21:17
thumper	and for a container inside the local provider we need to have bridged eth0	21:17
hazinhell	thumper, varnish over squid for proxy.	21:17
thumper	hazinhell: docs?	21:17
hazinhell	thumper, varnish-cache.org.. but if your using one of the apt proxies, afaik only squid is setup for that	21:18
thumper	hazinhell: what I wanted was a local cache of the lxc ubuntu-cloud image and the kvm one	21:19
thumper	to make creating container locally faster	21:19
thumper	as a new kvm instance needs to sync the images	21:19
thumper	to start an lxc or kvm container	21:19
hazinhell	thumper, lxc already caches	21:20
thumper	hazinhell: not for this use case	21:20
thumper	because it is a new machine	21:20
thumper	hazinhell: consider this ...	21:20
thumper	laptop host	21:20
thumper	has both kvm and lxc images cached	21:21
thumper	boot up kvm local provider	21:21
thumper	start a machine	21:21
thumper	uses cache	21:21
thumper	then go "juju add-machine kvm:1"	21:21
thumper	machine 1, the new kvm instance, then syncs the kvm image	21:21
thumper	this goes to the internet to get it	21:21
thumper	I want a cache on the host	21:21
thumper	similarly if the new machine 1 wants an lxc image	21:21
hazinhell	ah.. nesting with cache access	21:21
thumper	it goes to the internet to sync image	21:21
thumper	ack	21:22
thumper	so squit cache on the host to make it faster	21:22
hazinhell	thumper, what about mount the host cache over	21:22
thumper	for new machines starting containers	21:22
hazinhell	thumper, read mount	21:22
thumper	sounds crazy :)	21:22
hazinhell	it does.. you need some supervision tree to share the read mounts down the hierarchy	21:23
hazinhell	s/supervision/	21:23
hazinhell	thumper,	21:23
hazinhell	thumper, you could just do the host object storage (provider storage) and link the cache into that	21:24
thumper	surely a cache on the host would be less crazy	21:24
hazinhell	thumper, the host already has the cache, a mount of that directly into the guests, allows all the default tools to see it without any extra work on juju's part	21:24
hazinhell	doing a network endpoint, means you have to interject some juju logic to pull from that endpoint into the local disk cache	21:25
hazinhell	and you end up with wasted space	21:25
hazinhell	its kinda of a shame we can't use the same for both..	21:27
hazinhell	ie lxc is a rootfs and kvm is basically a disk image.	21:28
hazinhell	hmm	21:28
hazinhell	sadly can't quite loop dev the img and mount it into the cache, lxc wants a tarball there, would have to set it up as a container rootfs.	21:30
thumper	hmm...	22:00
thumper	hmm...	22:06
=== dspiteri is now known as DarrenS
hazinhell	thumper, read mount sound good?	22:27
hazinhell	thumper, or something else come to mind?	22:28
thumper	hazinhell: busy fixing the basics at the moment	22:28
hazinhell	ack	22:28
* hazinhell returns to hell		22:28
thumper	anyone have an idea why my kvm machine doesn't have the networking service running?	22:40
* thumper steps back a bit		22:40

Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!