/srv/irclogs.ubuntu.com/2014/04/15/#juju-dev.txt

hazmat	smoser, interesting.. coreos guys rewrote cloudinit in go..	00:03
smoser	i hadnt' seen that.	00:04
hazmat	smoser, its very limited subset and assumes coreos /systemd https://github.com/coreos/coreos-cloudinit	00:06
hazmat	its a bit much for them to call it cloudinit... its almost zero feature set overlap	00:07
perrito666	did anyone see fwereade after this am? (and when I say AM I mean GMT-3 AM)	00:16
davecheney	perrito666: its unusual to see him online at this time	00:18
perrito666	davecheney: I know, he just said that he was taking a plane and returning later and then I got disconnected	00:19
davecheney	perrito666: ok, you probably know more than i then	00:26
perrito666	heh tx davecheney	00:27
hazmat	hmm.. odd /bin/sh: 1: exec: /var/lib/juju/tools/unit-mysql-0/jujud: not found	00:38
sinzui	hazmat, looks like the last message in juju-ci-machine-0's log. Jujud just disappeared 2 weeks ago. Since that machine is the gateway into the ppc testing, we left it where it was	00:41
sinzui	thumper, I can hangout now	00:42
hazmat	sinzui, its odd its there.. the issue is deployer/simple.go	00:42
hazmat	it removes the symlink on failure, but afaics that method never failed, the last line is install the upstart job, and the job is present on disk.	00:43
thumper	sinzui: just munching	00:44
thumper	with you shortly	00:44
* sinzui watches ci		00:44
hazmat	sinzui, ie its resolvable with sudo ln -s /var/lib/juju/tools/1.18.1-precise-amd64/ /var/lib/juju/tools/unit-owncloud-0	00:44
hazmat	hmm.. its as though the removeOnErr was firing	00:45
hazmat	even on success	00:45
* sinzui nods		00:47
thumper	sinzui: https://plus.google.com/hangouts/_/76cpik697jvk5a93b3md4vcuc8?hl=en	00:49
sinzui	wallyworld, jam: looks like all the upgrade test are indeed fixed. I disabled the local-upgrade test for thumper. I will retest when I have the time or when the next rev lands	00:50
wallyworld	\o/	00:50
thumper	sinzui: do local upgrade and local deploy run on the same machine?	00:50
thumper	sinzui: can't hear you	00:50
wallyworld	sinzui: so if thumper actually pulls his finger out, we could release 1.19.0 real soon now?	00:50
hazmat	deployer worker is a bit strange .. does it use a tombstone to communicate back to the runner?	00:53
hazmat	thumper, when you have a moment i'd like to chat as well..	00:56
thumper	hazmat: ack	00:56
wallyworld	hazmat: the deployer worker is similar to most others, it is created by machine agent but wrapping it inside a worker.NewSimpleWorker	00:59
hazmat	wallyworld, ah. thanks	01:00
wallyworld	np. that worker stuff still confuses me each time i have to re-read the code	01:01
hazmat	the pattern is a bit different	01:04
hazmat	trying to figure out why i'd get 2014-04-15 00:00:42 INFO juju runner.go:262 worker: start "1-container-watcher" .. when there are no containers.. basically my manual provider + lxc seems a bit busted with 1.18	01:04
hazmat	also trying to figure out if on a simpleworker erroring, if the runner will just ignore it and move on.	01:04
hazmat	with no log	01:04
hazmat	the nutshell being deploy workloads gets that jujud not found	01:05
* hazmat instruments		01:05
thumper	hazmat: whazzup?	01:08
hazmat	thumper, trying to debug 1.18 with lxc + manual	01:11
hazmat	thumper, mostly in the backlog	01:11
sinzui	Wow.	01:11
sinzui	abentley replace the mysql + wordpress charms with dummy charms that instrument and report what juju is up to. They have take 2-4minutes off of all the tests	01:13
sinzui	Azure deploy in under 20 minutes	01:13
sinzui	AWS is almost as fast as HP Cloud	01:14
davecheney	sinzui: \o/	01:16
waigani	wallyworld: should I patch envtools.BundleTools in a test suite e.g. coretesting? Or should I copy the mocked function to each package that is failing and patch there?	01:17
waigani	wallyworld: it's just there seem to be a lot of tests that are all effected/fixed by this patch	01:18
wallyworld	use s.PatchValue	01:18
waigani	wallyworld: yep I am	01:18
waigani	but should I do it in a more generic suite?	01:18
wallyworld	so if the failures are clustered in a particular suite, you can use that in SetUpTest	01:18
wallyworld	not sure it's worth doing a fixture for a one liner	01:19
waigani	wallyworld: that is what I'm doing now, but aready I've done that in about 4 packages, with more to go	01:19
waigani	wallyworld: oh okay, you mean just patch in each individual test?	01:19
wallyworld	possibly, depends on hwere the failures are	01:20
waigani	okay, I'll do it the verbose way and we can cleanup in review if needed	01:20
wallyworld	but if the failures are in a manageable nuber of suites, doing the patc in setuptest makes sense	01:20
waigani	okay	01:21
thumper	what the actual fuck!	01:28
sinzui	wallyworld, CI hates the unit-tests on precise. Have you seen these tests fail consistently in pains before? http://ec2-54-84-137-170.compute-1.amazonaws.com:8080/job/run-unit-tests-amd64-precise/617/console	01:33
sinzui	^ The last three runs on different precise instances has tghe same failure	01:34
thumper	sinzui: I have some binaries copying to the machine	01:34
wallyworld	sinzui: i haven't seen those. and one of them, TestOpenStateWithoutAdmin, is the test added in the branch i landed for john to make upgrades work	01:35
sinzui	thank you thumper.	01:35
wallyworld	so it seems there's a mongo/precise issue	01:35
wallyworld	thumper: were you running some tests in a precise vm?	01:36
thumper	wallyworld: I have a real life precise machine	01:37
thumper	wallyworld: that it works fine	01:37
thumper	on	01:37
thumper	I've hooked up loggo to the mgo internals logging	01:37
thumper	so we can get internal mongo logging out of the bootstrap command	01:37
thumper	uploading some binaries now	01:37
wallyworld	hmm. so what's different on jenkins then to cause the tests to fail	01:37
thumper	not sure	01:37
thumper	same version of mongo	01:38
thumper	my desktop is i386	01:38
thumper	ci is amd64	01:38
thumper	that is all I can come up with so far	01:38
wallyworld	if that is th cause then we're doomed	01:38
thumper	:-)	01:38
thumper	FSVO doomed	01:38
wallyworld	yeah :-)	01:38
thumper	the error is that something inside mgo is explicitly closing the socket	01:39
thumper	when we ask to set up the replica set	01:39
wallyworld	thumper: so, one thing it could be - HA added an admin db	01:39
thumper	hence the desire for mor logging	01:39
thumper	wallyworld: my binaries work locally	01:39
thumper	and copying up	01:39
thumper	if that is the case	01:39
thumper	and my binaries work	01:39
wallyworld	and the recently added test which i reference above tests that we can ignore unuath access to that db	01:39
thumper	it could be that	01:39
* thumper ndos		01:40
wallyworld	and that test fails	01:40
thumper	still copying that file	01:40
* thumper waits...		01:40
* wallyworld waits too....		01:40
thumper	and here I was wanting to sleep	01:40
thumper	not feeling too flash	01:40
wallyworld	:-(	01:40
hazmat	thumper, sinzui fwiw. my issue was user error around series. i have trusty containers but had registered them as precise, machine agent deployed fine, unit agents didn't like it though. unsupported usage mode.	01:41
thumper	haha	01:41
hazmat	thumper, concievably the same happens when you dist-upgrade a machine	01:41
sinzui	thumper, wallyworld: the machines the run the unit tests are amd64 m1.larges for precise and trusty. We 95% of users deploy top amd64	01:41
thumper	hmm...	01:41
thumper	sinzui: right...	01:42
sinzui	we saw numbers that showed a very small number were 1386, we assume those are clients, not services	01:42
* thumper nods		01:42
thumper	wallyworld: can I get you to try the aws reproduction?	01:42
thumper	wallyworld: are you busy with anything else?	01:42
wallyworld	i am but i can	01:43
wallyworld	what's up with aws?	01:43
thumper	just trying to replicate the issues that we are seeing on CI with the local provider not bootstrapping	01:43
thumper	it works on trusty for me	01:44
thumper	and precise/i386	01:44
thumper	but we should check real precise amd64	01:44
wallyworld	ok, so you want to spin up an aws precise amd64 and try there	01:44
thumper	right	01:45
wallyworld	okey dokey	01:45
thumper	install juju / juju-local	01:45
wallyworld	yarp	01:45
thumper	probably need to copy local 1.19 binaries	01:45
thumper	to avoid building on aws	01:45
wallyworld	right	01:45
thumper	ugh...	01:51
thumper	man I'm confused	01:51
thumper	wallyworld: sinzui: using my extra logging http://paste.ubuntu.com/7253010/	01:57
thumper	so not a recent fix issue	01:57
wallyworld	thumper: we should just disable the replica set stuff	01:58
wallyworld	it has broken so much	01:58
thumper	perhaps worth doing for the local provider at least	01:58
thumper	we are never going to want HA on local	01:58
thumper	it makes no sense	01:58
sinzui	closed explicitly? That's like the computer says no	01:59
thumper	sinzui: ack	01:59
* thumper has a call now		02:01
sinzui	axw, Is there any more I should say about azure availability sets? https://docs.google.com/a/canonical.com/document/d/1BXYrLC78H3H9Cv4e_4XMcZ3mAkTcp6nx4v1wdN650jw/edit	02:06
axw	sinzui: otp	02:12
wallyworld	thumper: sinzui: i'm going to test this patch to disable the mongo replicaset setup for local provider https://pastebin.canonical.com/108522/	02:18
wallyworld	this should revert local bootstrap to be closer to how it was prior to HA stuff being added	02:19
wallyworld	and hence it should remove the error in thumper's log above hopefully	02:19
axw	sinzui: can I have permissions to add comments?	02:20
thumper	sinzui: this line is a bit suspect 2014-04-15 02:20:44 DEBUG mgo server.go:297 Ping for 127.0.0.1:37019 is 15000 ms	02:21
thumper	sinzui: locally I have 0ms	02:21
sinzui	sorry axw I gave all canonical write access as I intended	02:22
axw	sinzui: ta	02:22
* sinzui looks in /etc		02:22
axw	sinzui: availability-sets-enabled=true by default; I'll update the notes	02:23
thumper	wallyworld: that patch is wrong	02:27
wallyworld	i know	02:27
wallyworld	found that out	02:27
wallyworld	doing it differently	02:27
thumper	wallyworld: jujud/bootstrap.go line 165, return there if local	02:28
wallyworld	yep	02:28
axw	sinzui: I updated the azure section, would you mind reading over it to see if it makes sense to you?	02:37
sinzui	Thank you axw. Looks great	02:43
thumper	sinzui: wallyworld, axw: bootstrap failure with debug mgo logs: http://paste.ubuntu.com/7253155/	02:44
thumper	sinzui: I don't know enough to be able to interpret the errors	02:44
thumper	sinzui: perhaps we need gustavo for it	02:44
sinzui	thanks for playing thumper	02:44
wallyworld	sinzui: can you re-enable local provider tests in CI? i will do a branch to try and fix it and then when landed CI can tell us if it works	02:45
thumper	sinzui: I'm done with the machine now	02:45
sinzui	I will re-enable the tests	02:45
wallyworld	thanks	02:46
wallyworld	let's see if the next branch i land works	02:46
sinzui	thumper, wallyworld . I think you had decided to disable HA on local...and how would I do HA with local...Does that other machine get proper access to my local machine that probably has died with me at the keyboard	02:47
thumper	sinzui: you wouldn't do HA with the local provider	02:47
thumper	:)	02:47
wallyworld	sinzui: we are trying to set up replicaset and other stuff which is just failing with local and for 1.19 t least, i can't see why we would want that	02:48
sinzui	:)	02:48
wallyworld	so to get 1.19 out, we can disable and think about it later	02:48
sinzui	wallyworld, really, I don't think we ever need to offer HA for local provider.	02:49
wallyworld	maybe for testing	02:49
wallyworld	but i agree with you	02:49
wallyworld	i was being cautious in case others were attached to the idea	02:49
wallyworld	axw: this should make local provider happy again on trunkhttps://codereview.appspot.com/87830044	03:45
axw	wallyworld: was afk, looking now	03:57
wallyworld	ta	04:01
axw	wallyworld: reviewed	04:05
wallyworld	ta	04:05
wallyworld	axw: everyone hates that we use lcal provider checks in jujud	04:05
wallyworld	been a todo for a while to fix	04:06
axw	yeah, I kind of wish we didn't have to disable replicasets at all though	04:06
axw	I know they're not needed, but if they just worked it would be nice to not have a separate code path	04:06
wallyworld	axw: yeah. we could for 1.19.1, but we need 1.19 out the door and HA still isn't quite ready anuway	04:07
wallyworld	it is indeed a bandaid. nate added another last week also	04:08
axw	wallyworld: yep, understood	04:09
wallyworld	makes me sad too though	04:09
sinzui	wallyworld, Your hack solved local. The last probable issue is the broken unit tests for precise. I reported bug 1307836	05:12
_mup_	Bug #1307836: Ci unititests fail on precise <ci> <precise> <test-failure> <juju-core:Triaged> <https://launchpad.net/bugs/1307836>	05:12
wallyworld	sinzui: yeah, i just saw that but didn't think you'd be awake	05:13
sinzui	I don't want to be awake	05:13
wallyworld	i didn't realise we still had the precise issue :-(	05:14
wallyworld	i'll look at the logs	05:14
wallyworld	hopefully we'll have some good news when you wake up	05:14
sinzui	wallyworld, azure-upgrade hasn't passed yet. It may not because azure is unwell this hour. We don't need to worry about a failure for azure. I can ask for a retest when the cloud is better	05:22
wallyworld	righto	05:23
* sinzui finds pillow		05:23
wallyworld	good night	05:23
davecheney	PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND	05:32
davecheney	7718 ubuntu 20 0 2513408 1.564g 25152 S 45.2 19.6 2:41.51 juju.test	05:32
davecheney	memory usage for Go tests is out of control	05:32
wallyworld	jam1: you online?	05:47
jam1	morning wallyworld	05:47
jam1	I am	05:47
wallyworld	g'day	05:47
wallyworld	jam1: so with you branch, and one i did, CI is happy for upgrades	05:47
wallyworld	but	05:47
wallyworld	a couple of tests fail under precise	05:47
wallyworld	there's the one you added for your branch, plus TestInitializeStateFailsSecondTime	05:48
jam1	wallyworld: links to failing tests ?	05:48
wallyworld	the error says that a connection to mongo is unauth	05:48
wallyworld	http://ec2-54-84-137-170.compute-1.amazonaws.com:8080/job/run-unit-tests-amd64-precise/621/consoleFull	05:48
jam1	wallyworld: and are you able to see the local provider fail with replica set stuff, because neither Tim or I could reproduce it.	05:48
wallyworld	yeah, i saw it	05:49
wallyworld	and fixed	05:49
wallyworld	i had to disable HA for local provider	05:49
jam1	and while we don't have to have replica set local, I'd prefer consistency and the ability to test out HA locally if we could	05:49
wallyworld	sure	05:49
wallyworld	but to get 1.19 out the door i went for a quick fix	05:49
wallyworld	which we can revisit in 1.19.1	05:49
wallyworld	curtis was ok with that	05:50
jam1	wallyworld: so I certainly had a WTF about why I was able to create a machine in "admin" but not able to delete it without logging in as the admin I just created.	05:50
jam1	wallyworld: so it seems like some versions of Mongo don't have that security hole	05:50
jam1	but I can't figure out how to log in as an actual admin, but I can try digging into the TestInitialize stuff a bit more for my test.	05:50
wallyworld	so we are using a different mongo on precise vs trusty?	05:51
jam1	wallyworld: 2.4.6 vs 2.4.9	05:51
wallyworld	ok, i didn't realise that	05:51
jam1	Trusty is the one that lets you do WTF stuff.	05:51
wallyworld	:-(	05:51
wallyworld	there are 2 failing tests	05:51
wallyworld	maybe more, i seem to recall previous logs showing more	05:51
wallyworld	but the latest run had 2 failures only	05:52
wallyworld	the other one was TestInitializeStateFailsSecondTime	05:52
wallyworld	jam1: i gotta run to an appointment soon, but will check back when i return. if we can this this sroted, we can at least release 1.19.0 asap and deal with the workarounds for 1.19.1	05:53
jam1	wallyworld: is your code landed?	05:54
wallyworld	yep	05:54
jam1	k	05:54
wallyworld	happy to revert it if we can find a fix	05:54
jam1	I'll pick it up	05:54
wallyworld	thanks, i can look also but ony found out about precise tests just before and sadly i gotta duck out	05:55
jam1	hmm... LP failing to load for me right now	05:56
davecheney	wallyworld: Ci is running an anchient version of mongo	05:56
davecheney	that won't help	05:56
jam1	davecheney: sinzui: I would think we should run mongo 2.4.6 which is the one you get from the cloud-archive:tools	06:02
davecheney	jam1: agreed	06:04
jam1	davecheney: are they running 2.2.4 from the PPA?	06:07
davecheney	jam1: good point, 2.0 was all that shipped in precise	06:09
jam1	I'm just trying to find a way to reproduce, and I thought there was a 2.4.0 out there for a while, but I can't find it	06:09
jam1	and it isn't clear what version they are runnig.	06:10
davecheney	jam1: Get:40 http://ppa.launchpad.net/juju/stable/ubuntu/ precise/main mongodb-clients amd64 1:2.2.4-0ubuntu1~ubuntu12.04.1~juju1 [20.1 MB]	06:10
davecheney	Get:41 http://ppa.launchpad.net/juju/stable/ubuntu/ precise/main mongodb-server amd64 1:2.2.4-0ubuntu1~ubuntu12.04.1~juju1 [5,135 kB]	06:10
davecheney	this is our fault	06:10
davecheney	remember that old ppa	06:10
jam1	yep, thanks for pointing me to it	06:10
jam1	well, I can at least test with it.	06:10
davecheney	so, that isn't he cloud archive	06:10
davecheney	:emoji concerned face	06:10
jam1	At one point we probably wanted to maintain compat with 2.2.4, but I'm not as concerned with it anymore.	06:10
davecheney	2.2.4 never shipped in any main archive	06:12
davecheney	i don't think we have a duty of compatability	06:12
davecheney	https://bugs.launchpad.net/juju-core/+bug/1307289/comments/1	06:12
davecheney	if anyone cares	06:12
davecheney	btw, go test ./cmd/juju{,d}	06:12
davecheney	takes an age because the test setup is constantly recompiling the tools	06:12
davecheney	why are the cmd/juju tests calling out to bzr ?	06:27
davecheney	FAIL: publish_test.go:75: PublishSuite.SetUpTest	06:33
davecheney	publish_test.go:86:	06:33
davecheney	c.Assert(err, gc.IsNil)	06:33
davecheney	... value *errors.errorString = &errors.errorString{s:"error running \"bzr init\": exec: \"bzr\": executable file not found in $PATH"} ("error running \"bzr init\": exec:	06:33
davecheney	\"bzr\": executable file not found in $PATH")	06:33
davecheney	what is this shit ?	06:33
rogpeppe	mornin' all	06:44
davecheney	https://bugs.launchpad.net/juju-core/+bug/1307865	06:47
davecheney	this seems like an obvious failure	06:47
davecheney	why does it only happen sporadically ?	06:47
rogpeppe	davecheney: that's been the case for over a year (tests running bzr)	06:47
davecheney	rogpeppe: fair enough	06:48
rogpeppe	davecheney: i agree, that does seem odd	06:48
jam1	rogpeppe: do we have thoughts on how we would have a Provider work that didn't have storage? I know we don't particularly prefer the HTTP Storage stuff that we have.	06:50
rogpeppe	jam1: we'd need to provide something to the provider that enabled it to fetch tools from the mongo-based storage	06:51
jam1	rogpeppe: so we'd have to do away with "provider-state" file as well, right?	06:51
rogpeppe	jam1: other than that, i don't think providers rely much on storage, do they?	06:51
jam1	rogpeppe: we use it for charms	06:51
rogpeppe	jam1: so... provider-state is supposed to be an implementation detail of a given provider	06:52
jam1	sure	06:52
jam1	it is in the "common code" path, but you wouldn't have to use it/could make that part optional	06:52
rogpeppe	jam1: we don't really rely on it much these days	06:53
jam1	rogpeppe: we'd want bootstrap to cache the API creds and then we rely on it very little	06:53
jam1	you'd lose the fallback path	06:53
rogpeppe	jam1: yeah, and we don't want to lose that entirely	06:54
rogpeppe	jam1: for a provider-state replacement, i'd like to see the fallback path factored out of the providers entirely	06:54
jam1	well, it only works because there is a "known location" we can look in that is reasonably reliable. If a cloud doesn't provide its own storage, then any other location is just guesswork	06:54
jam1	anyway, switching machines now	06:55
rogpeppe	jam1: ok	06:55
rogpeppe	axw: looking at http://paste.ubuntu.com/7252280/, in the first status machines 3 and 4 are up AFAICS.	06:56
rogpeppe	axw: and that's the status that i am presuming that ensure-availability was acting on	06:57
axw	rogpeppe: in the first one, yes, but how do you know when they went down?	06:57
axw	rogpeppe: my point was it could have changed since you did "juju status"	06:58
rogpeppe	axw: there was a very short time between the first status and calling ensure-availability. i don't see any particular reason for it to have gone down in that time period, although of course i can't be absolutely sure	06:58
axw	right, that's why I asked about the log. I'm really only guessing	06:58
rogpeppe	axw: luckily i still have all the machines up, so i can check the log	06:59
axw	rogpeppe: I see no reason why the agent would have gone down after calling ensure-availability either	06:59
axw	cool	06:59
rogpeppe	axw: it would necessarily go down after calling ensure-availability, because mongo reconfigures itself and agents get thrown out	07:00
axw	rogpeppe: for all machines? not just the shunned ones?	07:01
rogpeppe	axw: yeah	07:01
rogpeppe	axw: we could really do with some logging in ensure-availability to give us some insight into why it's making the decisions it is	07:01
axw	yeah, fair enough	07:02
=== vladk\|offline is now known as vladk
rogpeppe	axw: here's the relevant log: http://paste.ubuntu.com/7252375/	07:16
rogpeppe	axw: the relevant EnsureAvailability call is the second one, i think	07:17
rogpeppe	axw: it's surprising that the connection goes down so quickly after that call	07:17
axw	rogpeppe: wrong pastebin?	07:17
rogpeppe	axw: ha, yes: http://paste.ubuntu.com/7253848/	07:18
axw	rogpeppe: machine-3's API workers have dialled to machine-0's API server ...	07:38
axw	rogpeppe: not saying that's the cause, but it's strange I think	07:38
rogpeppe	axw: that's not ideal, but it's understandable	07:39
rogpeppe	axw: one change i want to make is to make every environ manager machine dial the API server only on its own machine	07:39
axw	yep	07:39
jam	axw: rogpeppe: right, we originally only wrote "localhost" into the agent.conf. I think the bug is that the connection caching logic is overwriting that ?	07:42
rogpeppe	jam: yeah - each agent watches the api addresses and caches them	07:42
jam	rogpeppe: I thought when we spec'd the work we were going to explicitly skip overwritting when the agents were "localhost"	07:43
rogpeppe	jam: but also, the first API address received by a new agent is not going to be localhost	07:43
jam	rogpeppe: well, the thing that monitors it could just do if self.IsMaster() => localhost	07:43
rogpeppe	jam: i don't remember that explicitly	07:44
jam	or not run the address poller if IsMaster	07:44
jam	sorry	07:44
jam	IsManager	07:44
jam	not Master	07:44
rogpeppe	jam: i don't think it's IsMaster - i think it's is-environ-manager	07:44
rogpeppe	jam: right	07:44
rogpeppe	jam: i've been thinking about whether to run the address poller if we're an environ manager	07:45
rogpeppe	s/poller/watcher/	07:45
rogpeppe	jam: my general feeling is that it is probably worth it anyway	07:45
rogpeppe	jam: because machines can lose their environment manager status	07:46
rogpeppe	jam: even though we don't fully support that yet	07:46
jam	rogpeppe: won't they get bounced under that circumstance?	07:47
jam	anyway, we can either simplify it by what we write in agent.conf, or we could detect that we are IsManager and if so force localhost at api.Open time.	07:47
rogpeppe	jam: they'll get bounced, but if they do we want them to know where the other API hosts are	07:48
rogpeppe	jam: i was thinking of going for your latter option above	07:48
axw	rogpeppe: I can't really see much from the logs, I'm afraid. there is one interesting thing: "dialled mongo successfully" just after FullStatus and before EnsureAvailability	07:51
rogpeppe	axw: i couldn't glean much from them either	07:51
rogpeppe	axw: i'm just doing a branch that adds some logging to EnsureAvailability	07:51
rogpeppe	axw: then i'll try the live tests again to see if i can see what's going on	07:52
axw	rogpeppe: any idea why agent-state shows up as "down" just after I bootstrap? should FullStatus be forcing a resynchronisation of state?	07:55
rogpeppe	axw: i think it's because the presence data hasn't caught up	07:55
axw	rogpeppe: oh. I wonder if that's it? FullStatus may be reporting wrong agent state in your test too	07:55
rogpeppe	axw: we should definitely look into that	07:55
rogpeppe	axw: i think that FullStatus probably sees the same agent state that the ensure availability function is seeing	07:56
axw	rogpeppe: yeah, true	07:56
axw	rogpeppe: https://codereview.appspot.com/88030043	08:09
rogpeppe	axw: nice one! looking.	08:09
axw	jam: I've reverted your change from last night that eats admin login errors; this CL adds machine-0 to the admin db if it isn't there already	08:10
jam	axw: any chance that we could get the port from mongo rather than passing it in?	08:11
axw	rogpeppe: this is just the bare minimum, will follow up with maybeInitiateMongoServer, etc.	08:11
axw	jam: can do, but it requires parsing and I thought it may as well get passed in since it's already known to the caller	08:11
jam	axw: well we can have mongo start on port "0" and dynamically allocate, rather than our current ugly hack of allocating a port, and then closing it and hoping we don't race.	08:12
axw	jam: I assume you are referring to the EnsureAdminUserParams.Port field	08:12
jam	axw: if it is clumsy to parse, then we can pass it in.	08:12
axw	oh I see what you mean	08:12
axw	umm. dunno. I will take a look	08:13
jam	we can just start on port 37017, but that means other goroutines will also think that mongo is up, and for noauth stuff, we really want as little as possible to connect to it.	08:13
jam	axw: I always get thrown off by "upstart.NewService" because no we don't want to create a new upstart service	08:14
jam	but that is just "create a new memory representation of an upstart service"	08:14
axw	jam: heh yeah, it is a misleading name	08:15
jam	axw: I'm not sure why upstart specifically throws me off.	08:15
jam	as I certainly know the pattern.	08:15
jam	axw: can "defer cmd.Process.Kill()" do bad things if the process has already died ?	08:16
jam	axw: is it possible to do EnsureAdminUser as an upgrade step rather than doing it on all boots?	08:16
axw	jam: if the pid got reused very quickly, yes I think so	08:17
jam	axw: I'm not particularly worried about PID reuse that fast	08:17
axw	jam: not really feasible as an upgrade step, as they require an API connection	08:17
jam	I'm more wondering about a panic because the PID didn't exist	08:17
axw	then there's all sorts of horrible interactions with workers dying and restarting all the others, etc.	08:17
axw	jam: I'm pretty certain it's safe, but I'll double check	08:19
wallyworld	jam: hi, any update on the precise tests failures?	08:19
axw	jam: late Kill does not cause a panic	08:22
jam	wallyworld: they pass with mongo 2.4.6 from cloud-archive:tools, they fail with 2.2.4 from ppa:juju/stable	08:23
jam	on all machines that matter we use cloud-archive:tools	08:23
jam	wallyworld: so CI should be using that one	08:23
wallyworld	great, so we can look to release 1.19	08:23
jam	wallyworld: and axw has a patch that replaces my change anyway.	08:23
jam	wallyworld: the replicaset failure isn't one that I could reproduce...	08:23
jam	since it is flaky	08:24
wallyworld	hmmm. i hate those	08:24
wallyworld	CI could reproduce it	08:24
jam	wallyworld: it is possible we just need to wait longer, but I hate those as well :)	08:24
axw	jam: this is what happens if you try to use "--port 0" in mongod: http://paste.ubuntu.com/7254007/	08:24
jam	axw: bleh.... ok	08:25
jam	I don't think we want to use the "default mongo port of 27017" so we might as well use our own since we know we just stopped the machine	08:25
jam	stopped the service	08:25
rogpeppe	axw: reviewed	08:33
axw	thanks	08:34
rogpeppe	jam: using info.StatePort seems right to me (at least in production).	08:37
jam	rogpeppe: for "bring this up in single user mode so we can poke at secrets and then restart it" I'd prefer it was more hidden than that, but I can live with StatePort being good-enough.	08:38
rogpeppe	jam: if there's someone sitting on localhost waiting for the fraction of a second during which we add the admin user, i think the user is probably not going to be happy anyway	08:39
rogpeppe	jam: note that the vulnerability is only to processes running on the local machine	08:40
rogpeppe	jam: and if there are untrusted processes running on the bootstrap machine, they're in trouble anyway	08:40
jam	rogpeppe: I'm actually more worried about the other goroutines in the existing process waking up, connecting, thinking to do work, and then getting shut down again.	08:41
jam	rogpeppe: more from a cleanliness than a "omg we broke security" perspective	08:42
rogpeppe	jam: what goroutines would those be?	08:42
jam	rogpeppe: so this is more about "lets not force ourselves to think critically about everything we are doing and be extra careful that we never run something we thought we weren't". Vs "just don't expose something we don't want exposed so we can trust nothing can be connected to it."	08:43
rogpeppe	jam: AFAIK there are only two goroutines that connect to the state - the StateWorker (which we're in, and which hasn't started anything yet) and the upgrader (which requires an API connection, which we can't have yet because the StateWorker hasn't come up yet.	08:44
rogpeppe	jam: even if we are allowed to connect to the mongo, i don't think we can do anything nasty accidentally	08:45
rogpeppe	jam: well, i suppose we could if were malicious	08:46
axw	rogpeppe: I tested by upgrading from 1.18.1. that's good enough right?	08:47
rogpeppe	axw: i think so, yeah	08:47
waigani	wa	09:26
waigani	wallyworld: branch is up: https://codereview.appspot.com/87130045 :)	09:26
waigani	lbox didn't update the description on codereview, but did on lp??	09:27
waigani	anyway, bedtime for me.	09:27
waigani	night all	09:27
natefinch	morning all	09:44
jam	morning natefinch	09:47
rogpeppe	axw: https://codereview.appspot.com/88080043	09:48
rogpeppe	axw: a bit of a refactoring of EnsureAvailability - hope you approve	09:48
wallyworld	wallyworld: ok	09:48
axw	rogpeppe: cooking dinner, will take a look a bit later	09:49
rogpeppe	jam, natefinch, mgz: review of above would be appreciated	09:52
natefinch	rogpeppe: sure	09:54
rogpeppe	natefinch: have you pushed your latest revision of 041-moremongo ?	10:00
rogpeppe	natefinch: (i want to merge it with trunk, but i don't want us to duplicate that work, as wallyworld's recent changes produce fairly nasty conflicts)	10:01
wallyworld	rogpeppe: if you can fix local provider, feel free to revert my work	10:01
wallyworld	i only landed it to get 1.19 out the door	10:01
wallyworld	and local provider + HA (mongo replicaets) = fail :-(	10:02
rogpeppe	wallyworld: it seemed to work ok for me actually	10:02
wallyworld	not for me or CI sadly	10:02
natefinch	rogpeppe: it's pushed now	10:03
rogpeppe	wallyworld: how did it fail?	10:03
wallyworld	CI has been broken for days	10:03
wallyworld	mongo didn't start	10:03
rogpeppe	pwd	10:03
wallyworld	hence machine agent didn't come up	10:03
rogpeppe	wallyworld: what was the error from mongo?	10:03
jamespage	sinzui, I think I just got an ack to use 1.16.6 via SRU to support the MRE for juju-core	10:04
wallyworld	um, can't recall exactly, it will be in the CI logs	10:04
jamespage	sinzui, I'll push forwards getting it into proposed this week	10:04
wallyworld	my local dir is now blown away	10:04
rogpeppe	wallyworld: np, just interested	10:04
wallyworld	sorry, i should have taken beter notes	10:05
wallyworld	rogpeppe: i think that there wasn't much in the mongo logs from memory, tim had to enable extra logging	10:05
wallyworld	he was debugging why stuff failes on precise	10:05
wallyworld	but we know now that's due to 2.26 vs 2.49	10:06
natefinch	rogpeppe: are there tests for that EnsureAvailability code?	10:11
rogpeppe	natefinch: yes	10:11
natefinch	rogpeppe: cool	10:11
rogpeppe	natefinch: the semantics are unaffected, so the tests remain the same	10:11
natefinch	rogpeppe: awesome, that's what I figured.	10:12
axw	rogpeppe: reviewed. thanks, it's a little clearer now	10:19
rogpeppe	axw: thanks a lot	10:22
jam	wallyworld: rogpeppe: The error I saw in CI was when Initiate went to do a replicaSet operation, it would get a Explicitly Closed message.	10:29
jam	Note, though, that CI has been testing with mongo 2.2.4 for quite some time.	10:29
jam	(and still is today, AFAIK, though I'm trying to push to get them to upgrade)	10:30
rogpeppe	jam: interestin	10:30
rogpeppe	g	10:30
jam	rogpeppe: https://bugs.launchpad.net/juju-core/+bug/1306212	10:30
_mup_	Bug #1306212: juju bootstrap fails with local provider <bootstrap> <ci> <local-provider> <regression> <juju-core:In Progress by jameinel> <https://launchpad.net/bugs/1306212>	10:30
wallyworld	yes, i do recall that was one of the errors	10:30
jam	2014-04-10 04:57:43 INFO juju.replicaset replicaset.go:36 Initiating replicaset with config replicaset.Config{Name:"juju", Version:1, Members:[]replicaset.Member{replicaset.Member{Id:1, Address:"10.0.3.1:37019", Arbiter:(bool)(nil), BuildIndexes:(bool)(nil), Hidden:(bool)(nil), Priority:(float64)(nil), Tags:map[string]string(nil), SlaveDelay:(time.Duration)(nil), Votes:(int)(nil)}}} 2014-04-10 04:58:18 ERROR juju.cmd supercommand.go:299 cannot initiat	10:30
jam	rogpeppe: natefinch: I wrote this patch https://code.launchpad.net/~jameinel/juju-core/log-mongo-version/+merge/215656 to help us debug that sort of thing if anyone wants to review it	10:31
wallyworld	although i'm running 2.4.9 locally and still has issues	10:31
wallyworld	had	10:31
jam	wallyworld: interesting, as neither myself nor tim were able to reproduce it	10:31
jam	and I tried 2.4.9 on Trusty and 2.4.6 on Precise	10:31
jam	local bootstrap always just worked	10:32
rogpeppe	natefinch: i've merged trunk now - you can pull from lp:~rogpeppe/juju-core/natefinch-041-moremongo	10:32
wallyworld	all i know is that it didn't work, and then i disableded --replSet from the upstart script and it worked	10:32
jam	though... hmmm. I did run into godeps issues once, so it is possible juju bootstrap wasn't actually the trunk I thought it was.	10:32
wallyworld	and that also then fixed CI	10:32
natefinch	jam: I think I've seen the explicitly closed bug once or twice.	10:33
jam	natefinch: CI has apparently been seeing it reliably for 4+ days	10:33
jam	wallyworld: CI passed local-deploy in r 2628 http://ec2-54-84-137-170.compute-1.amazonaws.com:8080/job/local-deploy/	10:34
jam	and now even with axw's 2629 patch	10:34
natefinch	jam: google brings up a 2012 convo with gustavo about it where the culprit seemed to be load on mongo, but not definitively. We should mention it to him	10:35
jam	natefinch: given this is during bootstrap, there should be 0 load on mongo	10:35
wallyworld	jam: 2628 was my atch to make it work	10:35
jam	wallyworld: certainly, just mentioning CI saw it and was happy again	10:36
wallyworld	don't worry, i was watching it :-)	10:36
jam	natefinch: so I just checked the previous 6 attempts, and all of them failed with replica set: Closed explicitly.	10:36
natefinch	rogpeppe: thanks	10:37
jam	natefinch: note that 2.2.4 failed other tests that using TestInitiate	10:37
natefinch	jam: the important part was: we should talk to Gustavo	10:37
natefinch	(where we probably means me :)	10:37
jam	with being unable to handle admin logins.	10:37
natefinch	jam: interesting	10:37
natefinch	jam: any chance we can abandon 2.2.4?	10:38
* natefinch loves dropping support for things		10:38
jam	natefinch: hopefully. It shouldn't be used in the field. It is only in our ppa, which means only Quantal gets it.	10:38
jam	http://docs.mongodb.org/v2.4/reference/method/db.addUser/	10:38
jam	says it was changed in 2.4	10:38
jam	and is superseded in 2.6 by createUser	10:39
jam	natefinch: and the mgo docs say we should be using UpsertUser: http://godoc.org/labix.org/v2/mgo#Database.UpsertUser	10:39
mgz	we drop quantal support... tomorrow	10:40
jam	natefinch: seems like mongo's security model is instable over 2.2/2.4/2.6 which doesn't bode very well for us managing compatibility	10:40
mgz	no, end of the week	10:40
jam	mgz: well it wouldn't be hard to just put 2.4.6 into ppa:juju/stable	10:40
jam	regardless of Q	10:40
natefinch	jam: that seems wise	10:41
jam	mgz: that would also "fix" CI, because they seem to install it from the PPA as well	10:41
mgz	well, we'd have upgrade questions like that	10:41
mgz	but yeah	10:41
jam	jamespage: ^^ is it possibel to get 2.4.6 into ppa:juju/stable ?	10:41
natefinch	rogpeppe: if you want to work on that moremongo branch, I can try to get that localhost stateinfo branch in a testable state.	10:43
rogpeppe	natefinch: ok	10:43
rogpeppe	natefinch: what more needs to be done in the moremongo branch?	10:43
jamespage	jam: context?	10:43
jam	jamespage: CI and Quantal users will install MongoDB from ppa:juju/stable, but it is currently 2.2.4 which is "really old" now.	10:44
jam	So if we could just grab the one in cloud-archive:tools (2.4.6) it would make our lives more consistent.	10:44
jam	I believe that is the version in Saucy, and Trusty has 2.4.9	10:44
jamespage	jam: I've pushed a no-change backport of 2.4.6 for 12.04 and12.10 into https://launchpad.net/~james-page/+archive/juju-stable-testing	10:54
jamespage	just to see if it works	10:54
jamespage	I have a suspicion that its not a no-change backport	10:54
jam	jamespage: we only really need it for P for the CI guys	10:54
jam	since Q is going EOL	10:55
jam	jamespage: we can potentially just point them at cloud-archive:tools if it is a problem	11:06
jamespage	jam: that might be better	11:28
jamespage	jam: that way they will get the best mongodb that has been released with ubuntu	11:28
jam	jamespage: well, we need them to be testing against the version that we'll be installing more than "just the best", but given that we install from their ourselves it seems to fit.	11:30
jam	There is a question about U	11:30
jam	given that it won't be in cloud-tools	11:30
jam	so we may have to do another PPA trick	11:30
jam	natefinch: the recent failure of rogpeppe's branch in TestAddRemoveSet is interesting. It seems to be spinning on: attempting Set got error: replSetReconfig command must be sent to the current replica set primary.	11:32
jam	context: https://code.launchpad.net/~rogpeppe/juju-core/548-destroy-environment-fix/+merge/215697	11:33
jamespage	jam: what about U?	11:34
jam	jamespage: in the version after Trusty, how do we install the "best for U". For Q we had to use the ppa:juju/stable because P was the only thing in cloud-archive:tools	11:35
jam	which then got out of date	11:35
jam	We didn't have to for S because the "best" was the same thing as in cloud-archive:tools	11:35
jamespage	jam: the best for U will be in U	11:35
jam	jamespage: well, it wasn't in Q	11:36
jam	and when V comes out, it may no longer be the best for U, right?	11:36
jamespage	jam: that's probably because 18 months ago this was all foobar	11:36
jamespage	go juju did not exist in any meaningful way	11:36
jam	jamespage: sure. I can just see that 2.6 is released upstream, and we may encounter another "when do we get 2.6 in Ubuntu" where the threshold is at an incovenient point	11:37
jamespage	jam: you must maintain 2.4 compat as that's whats in 14.04	11:37
rogpeppe	jam, natefinch, mgz: how about this? http://paste.ubuntu.com/7254781/	11:42
mgz	rogpeppe: seems reasonable	11:43
mgz	I prefer interpreted values to raw dumps of fields in status	11:43
mgz	as it's the funny mix between for-machines markup and textual output for the user	11:44
natefinch	rogpeppe: when does a machine get into n, n?	11:44
rogpeppe	natefinch: when it's deactivated by the peergrouper worker	11:45
natefinch	but why would that happen?	11:45
rogpeppe	natefinch: ok, here's an example:	11:45
rogpeppe	natefinch: we have an active server (wantvote, hasvote)	11:46
rogpeppe	natefinch: it dies	11:46
rogpeppe	natefinch: we run ensure-availability	11:46
rogpeppe	natefinch: which sees that the machine is inactive, and marks it as !wantsvote	11:46
rogpeppe	natefinch: the peergrouper worker sees that the machine no longer wants the vote, and removes its vote	11:47
rogpeppe	natefinch: and sets its hasvote status to n	11:47
rogpeppe	natefinch: so our machine now has status (!wantsvote, !hasvote)	11:47
rogpeppe	natefinch: if we then run ensureavailability again, that machine is now a candidate for having its environ-manager status removed	11:48
rogpeppe	natefinch: alternatively, the machine might come back up again	11:48
natefinch	I see, so hasvote is actual replicaset status, and wants vote is what we want the replicaset status to be	11:48
rogpeppe	natefinch: yes	11:48
natefinch	sorry gotta run, forgot it's tuesday	11:49
jam	mgz: so the branch up for review (which is approved) actually has the errors as a prereq	11:49
mgz	jam: yeah, I was sure there was something like that	11:50
rogpeppe	natefinch: i've dealt with a bunch more conflicts merging trunk and pushed the result: ~rogpeppe/juju-core/natefinch-041-moremongo	11:54
rogpeppe	natefinch: ping	12:20
=== wesleymason is now known as wes_
=== wes_ is now known as wesleymason
* rogpeppe goes for lunch		12:32
natefinch	rogpeppe: sorry, just got back	12:42
axw	rogpeppe natefinch: I'll continue looking at HA upgrade - upstart rewriting and MaybeInitiateMongoServer in the machine agent. Let me know if there's anything else I should look at	12:54
natefinch	axw: that seems like a good thing to do for now. the rewriting should work as-is, once we remove the line that bypasses it	12:55
axw	natefinch: it doesn't quite, because the replset needs to be initiated too	12:56
axw	natefinch: and that's slightly complicated because that requires the internal addresses from the environment	12:56
natefinch	axw: you should be able to get the addresses off the instance and pass it into SelectPeerAddress, and get the right one. That's what jujud/bootstrap.go does. Should work in the agent, too, I'd think	12:58
axw	natefinch: yep, the only problem is getting the Environ. the bootstrap agent gets a complete environ config handed to it; the machine agent needs to go to state	12:59
axw	natefinch: anyway, I will continue on with that. if you think of something else I can look at next, maybe just send me an email	13:00
natefinch	axw: will do, and thanks	13:00
axw	nps	13:00
rogpeppe	natefinch: that's ok	13:29
rogpeppe	natefinch: how's localstateinfo coming along?	13:30
rogpeppe	mgz, jam, natefinch: trivial (two line) code review anyone? fixes a sporadic test failure. https://codereview.appspot.com/88130044	13:35
natefinch	rogpeppe: haven't gotten far this morning. My wife should be back any minute to take the baby off my hands, which will makes things go faster	13:35
rogpeppe	natefinch: k	13:35
jam	rogpeppe: shouldn't there be an associated test sort of change ?	13:36
natefinch	rogpeppe: how does that change fix the test failure?	13:36
rogpeppe	jam: the reason for the fix is a test failure	13:37
rogpeppe	jam: i can add another test, i guess	13:37
natefinch	ideally a test that otherwise always fails :)	13:37
jam	rogpeppe: so this is that sometimes during teardown we would hit this and then not restart because it was the wrong type ?	13:37
rogpeppe	jam: the test failure was this: http://paste.ubuntu.com/7255340/	13:38
rogpeppe	jam: i'm actually not quite sure why it is sporadic	13:39
natefinch	I see, we always expect it to be errterminateagent, but we were masking that along with other failures	13:40
rogpeppe	natefinch: yes	13:40
natefinch	rogpeppe: how does the defer interact with locally scoped err variables inside if statements etc?	13:41
natefinch	maybe that's the problem? It's modifying the outside err, but we're returning a different one	13:41
rogpeppe	natefinch: the return value is assigned to before returning	13:41
natefinch	ahgh right	13:41
rogpeppe	natefinch: from http://golang.org/ref/spec#Return_statements: "A "return" statement that specifies results sets the result parameters before any deferred functions are executed."	13:42
jam	rogpeppe: so it looks like you only run into it if you get ErrTerminate before init actually finishes	13:44
rogpeppe	jam: i'm not sure why the unit isn't always dead for this test on entry to Uniter.init	13:46
rogpeppe	jam: tbh i don't want to take up the rest of my afternoon grokking the uniter tests - i'll leave this alone until i have some time.	13:46
rogpeppe	jam: (i agree that it indicates a lacking test in this area)	13:47
jam	rogpeppe: so LGTM for the change, though it does raise the question that if we wrapped errors without dropping context it might have worked as well :)	13:52
rogpeppe	jam: yeah, i know	13:52
rogpeppe	jam: but i'd much prefer it if we have a wrap function that explicitly declares the errors that can pass back	13:53
rogpeppe	jam: then we can actually see what classifiable errors the function might be returning	13:53
rogpeppe	jam: there are 9 possible returned error branches in that function - it's much easier to modify the function if you know which of those might be relied on for specific errors	13:56
natefinch	rogpeppe: that would be pretty useful in a defer statement, since it would then be right next to the function definition, as well.	13:56
rogpeppe	natefinch: perhaps	13:56
rogpeppe	natefinch: tbh i'm not keen on ErrorContextf in general	13:56
rogpeppe	natefinch: it just adds context that the caller already knows	13:57
natefinch	rogpeppe: yes, I wouldn't have it change the message, just filter the types. I don't want to have to troll through the code in a function to figure out what errors it can return	13:57
rogpeppe	natefinch: the doc comment should state what errors it can return	13:58
rogpeppe	natefinch: and i'd put a return errgo.Wrap(err) on each error return	13:59
rogpeppe	natefinch: (errgo.Wrap(err, errgo.Is(worker.ErrTerminateAgent) for paths where we care about the specific error)	14:00
rogpeppe	natefinch: i know what you mean about having the filter near the top of the function though	14:01
natefinch	rogpeppe: btw for want/hasvote, what about : non-member, pending-removal, pending-add, member? I feel like inactive and active sound too ephemeral, like it could change at any minute, when in fact, it's likely to be a very stable state. But maybe I'm over thinking it.	14:17
jam	natefinch: fwiw I like your terms better	14:18
rogpeppe	natefinch: those terms aren't actually right, unfortunately.	14:18
rogpeppe	natefinch: there's no way of currently telling if a machine's mongo is a member of the replica set	14:19
rogpeppe	natefinch: even if a machine has WantVote=false, HasVote=false, it may still be a member	14:20
rogpeppe	natefinch: basically, every state server machine will be a member unless it's down	14:21
rogpeppe	natefinch: how about "activated" and "deactivated" instead of "active" and "inactive" ?	14:22
natefinch	rogpeppe: isn't the intended purpose that those with y/y that they're in the replicaset? I guess if it doesn't reflect the replicaset, what does it reflect?	14:24
rogpeppe	natefinch: the intended purpose of those with y/y is that they are voting members of the replica set	14:24
natefinch	I see	14:25
rogpeppe	natefinch: we can have any number of non-voting members	14:25
rogpeppe	natefinch: (and that's important)	14:25
natefinch	member-status: non-voting, pending-unvote, pending-vote, voting? I know unvote is not a word, but pending-non-voting is too long and confusing.	14:29
sinzui	jamespage, I sent a reply about 1.16.4.	14:31
jamespage	sinzui, so the backup/restore bits are not actually in the 1.16 branch?	14:32
sinzui	jamespage, no backup	14:33
jamespage	hmm	14:33
natefinch	rogpeppe: check my last msg	14:33
sinzui	restore aka update-bootstrap worked for customers who had the bash script	14:33
sinzui	jamespage, by not installing juju-update-bootstrap, I think we can show that no new code was introduced to the system	14:35
rogpeppe	natefinch: "not voting", "adding vote", "removing vote", "voting" ?	14:36
perrito666	jamespage: sinzui I assigned myself https://bugs.launchpad.net/juju-core/+bug/1305780?comments=all just fyi	14:36
_mup_	Bug #1305780: juju-backup command fails against trusty bootstrap node <backup-restore> <juju-core:Triaged by hduran-8> <https://launchpad.net/bugs/1305780>	14:36
natefinch	rogpeppe: sure, that's good	14:36
rogpeppe	natefinch: although i'm not entirely happy with the vote/voting difference	14:37
rogpeppe	natefinch: how about: "no vote", "adding vote", "removing vote", "has vote" ?	14:37
sinzui	jam, wallyworld: the precise unit tests are now running with mongo from ctools. I have a set of failures. They are different from before. CI is automatically retesting. I am hopeful	14:37
natefinch	rogpeppe: "voting, pending removal" "not voting, pending add"? That makes it a little more clear that even though the machine is not going to have the vote in a little bit, it actually still does right now	14:38
natefinch	(and vice versa)	14:38
rogpeppe	natefinch: i think that it's reasonable to assume that if something says "removing x" that x is currently there to be removed	14:39
rogpeppe	natefinch: likewise for adding	14:39
natefinch	rogpeppe: fair enough	14:39
rogpeppe	ha, i've just discovered that if you call any of the Patch functions in SetUpSuite, the teardown functions never get called.	14:43
mgz	rogpeppe: heh, yeah, another reason teardown is generally dangerous	14:44
hazmat	jam, thanks for the scale testing reports	14:44
rogpeppe	mgz: i think we should change CleanUpSuite so it just works if you do a suite-level patch	14:44
natefinch	rogpeppe: whoa, really? I assymed they'd do the right thing	14:44
rogpeppe	natefinch: uh uh	14:44
natefinch	rogpeppe, mgz: yes, definitely. Totally unintuitive otherwise	14:45
rogpeppe	i won't do it right now, but i'll raise a bug	14:45
rogpeppe	where are we supposed to raise bugs for github.com/juju/testing ?	14:46
rogpeppe	on the github page, or on juju-core?	14:46
natefinch	last I heard we were keeping bugs on launchpad	14:49
natefinch	(not my idea)	14:51
rogpeppe	natefinch: done. https://bugs.launchpad.net/juju-core/+bug/1308101	14:52
_mup_	Bug #1308101: juju/testing: suite-level Patch never gets restored <juju-core:New> <https://launchpad.net/bugs/1308101>	14:52
natefinch	rogpeppe: I very well may have made that mistake myself recently.	14:53
rogpeppe	natefinch: that was what caused me to investigate	14:53
rogpeppe	natefinch: i knew it was an error, but i thought it would get torn down at the end of the first test	14:53
rogpeppe	natefinch: i wondered how it was working at all	14:54
natefinch	rogpeppe: yeah, of the two likely behaviors, never getting torn down is definitely the worse of the two	14:54
natefinch	rogpeppe: but also the one least likely to be obvious	14:54
rogpeppe	natefinch: yup	14:55
alexisb	jam, sinzui any news on the bootstrap issue? https://bugs.launchpad.net/juju-core/+bug/1306212	14:55
_mup_	Bug #1306212: juju bootstrap fails with local provider <bootstrap> <ci> <local-provider> <regression> <juju-core:In Progress by jameinel> <https://launchpad.net/bugs/1306212>	14:55
sinzui	alexisb, thumper and wallyworld landed a hack to remove HA from local to make tests pass...	14:56
sinzui	alexisb, I think devs hope to fix the real bug...	14:56
natefinch	alexisb, sinzui: jam looked into it some, and it may be due to an old version of mongo (2.2.x) that we don't really need to support anyway... cloud archive has 2.4.6 I believe, which may solve the problem	14:56
sinzui	natefinch, already update the test	14:56
alexisb	sinzui, natefinch: can we add those updates to the bug?	14:57
sinzui	natefinch, This is the current run with mongo from ctools: http://ec2-54-84-137-170.compute-1.amazonaws.com:8080/job/run-unit-tests-amd64-precise/633/console	14:57
alexisb	sinzui, any other critical bugs blocking the 19.0 release?	14:58
natefinch	sinzui: that looks good	14:58
sinzui	alexisb, I don't think the of HA removal of local as a hack. It is insane to attempt HA for a local env. I make close the bug instead of deferring it to the next release	14:59
natefinch	sinzui: but that's with wallyworld's hack, right? We'll need to remove that hack at some point, and it would be good not to use an old version of mongo anyway.	14:59
sinzui	natefinch, there is a previous run with failure, but different failures thatn before. CI choose to retest assuming the usual intermittent failure	14:59
alexisb	sinzui, understood	15:00
natefinch	sinzui: the devs discussed it this morning. Having HA on local may not actually give "HA", but it can be useful to show how juju works with HA, like you can kill a container and watch juju recover, re-ensure and watch a new one spin up, etc	15:00
sinzui	natefinch, wallyworld's hack was about not try to do HA setup for local.	15:00
natefinch	sinzui: it's basically just like the rest of local.... it's not actually useful for much other than demos and getting your feet wet.... but it's really useful for that.	15:01
sinzui	natefinch, alexisb unit tests are all pass	15:01
sinzui	\0/	15:01
sinzui	natefinch, alexisb azure is ill today and the azure tests failed. I am asking for a retest. the current revision will probably be blessed for release today	15:02
alexisb	sinzui, awesome!	15:02
natefinch	sinzui: what version of mongo is that running?	15:02
* natefinch doesn't know what ctools means		15:03
* sinzui reads the test log		15:04
natefinch	sinzui: ahh, I see, I missed it somehow, looks like 2.4.6	15:05
sinzui	natefinch, 1:2.4.6-0ubuntu5~ctools1	15:05
natefinch	sinzui: happy	15:05
jamespage	sinzui, I'm not adverse to introducing a new feature - the plugins are well isolated but afaict its not complete in the codebase	15:06
jamespage	sinzui, if that is the case then I agree not shipping the update-bootstrap plugin does make sense	15:06
jamespage	otherwise afaict I have no real way of providing backup/restore to 1.16 users right?	15:06
sinzui	jamespage, They aren't complete since we know that a script is needed to get tar and mongodump to do the right thing	15:07
jamespage	sinzui, OK - I'll drop it then	15:08
natefinch	sinzui: is that the version of mongo we were running before wallyworld's hack? I'd like to know if the version of mongo is the deciding factor	15:11
sinzui	natefinch, for the unittests. that version of mongo is the fact	15:11
sinzui	natefinch, for the local deploy, wallyworld's hack was the fix	15:11
sinzui	and jam's fix for upgrades fixed all upgrdes	15:12
natefinch	ok	15:12
natefinch	oh right, it was the version of mongo for upgrades that was changing how we add/remove users.	15:13
sinzui	natefinch, CI has mongo from ctools though. all I did for test for the test harness was ensure that precise hosts add the same PPA as CI itself	15:18
natefinch	sinzui: ok, I thought someone had mentioned this morning that CI adds the juju/stable PPA for mongo, but I may have misunderstood or they may have been wrong	15:21
sinzui	natefinch, If added the juju stable ppa, I ensure the ctools archive is added and then manually install it before we rung make install-dependencies	15:22
natefinch	sinzui: I believe you know what your tools are running better than some random juju dev :)	15:23
natefinch	(and my memory thereof)	15:24
sinzui	Azure is very ill.	15:26
sinzui	The best I can do is manually retest hoping that I catch azure whe it is better	15:26
natefinch	poor azure	15:28
rogpeppe	natefinch: ping	15:31
natefinch	rogpeppe: yo	15:32
rogpeppe	natefinch: i've just pushed lp:~rogpeppe/juju-core/natefinch-041-moremongo/	15:32
rogpeppe	natefinch: all tests pass	15:32
rogpeppe	natefinch: could you pull it and re-propose -wip, please?	15:32
natefinch	rogpeppe: sure	15:32
rogpeppe	natefinch: i guess i could make a new CL, but it seems nicer to use the current one	15:33
natefinch	rogpeppe: yeah	15:33
natefinch	rogpeppe: wiping	15:34
natefinch	rogpeppe: that should be wip-ing	15:34
natefinch	rogpeppe: done	15:35
rogpeppe	natefinch: ta	15:35
rogpeppe	natefinch: i've pushed again. could you pull and then do a proper propose, please?	15:52
rogpeppe	natefinch: and then i think we can land it	15:53
natefinch	rogpeppe: sure	15:53
natefinch	rogpeppe: one sec, running tests on the other branch	15:53
jam1	natefinch: sinzui: so I don't know if changing mongo would have made CI happy without wallyworld's hack. wallyworld was the only one who has reproduced the replicaset Closed failure, and he did so on trusty running 2.4.9, so it seems like it could still be necessary.	15:56
jam1	natefinch: the concern is that the code is actually not different in Local, so if it is failing there it should be failing elsewhere	15:56
jam1	and maybe we just aren't seeing it yet	15:56
natefinch	jam1: yep, also a good reason not to have local be different	15:56
jam1	natefinch: I believe my stance on local HA, it doesn't provide actual HA, it is good for demos, I like common codebase, but I'm willing to slip it if we have to.	15:57
sinzui	jam1 I think you are forgetting the unit test failed days before upgrade failed and days before local deploy failed.	15:57
natefinch	jam1: yeah. perfect is the enemy of good	15:57
mgz	so, I fixed the test suite hang... but still don't understand why it actually did that.	15:57
sinzui	I went to sleep with unit tests and azure failing (the latter is azure, not code)	15:58
rogpeppe	i cannot get this darn branch to pass tests in the bot: https://code.launchpad.net/~rogpeppe/juju-core/548-destroy-environment-fix/+merge/215697	15:58
rogpeppe	it's failed on replicaset.TestAddRemoveSet three times in a row now, and the changes it makes cannot have anything to do with that	15:58
rogpeppe	let's try just one more time	15:58
sinzui	jam, I changed the db used in http://ec2-54-84-137-170.compute-1.amazonaws.com:8080/job/run-unit-tests-amd64-precise/ #632 I got a pass in #633. The errors were different between dbs	15:59
jam1	sinzui: so TestInitialize failing was the mongo version problem. The fact that adding --replicaSet to the mongo startup line caused local to always fail with Closed is surprising, but might be a race/load issue that local triggers that others don't.	15:59
jam1	natefinch: a thought, we know it takes a little while for replicaset to recover	15:59
jam1	IIRC, mongo defaults to having a 15s timeout	15:59
jam1	natefinch: is it possible that Load pushes local over the 15s timeout?	16:00
rogpeppe	natefinch: could you push your 043-localstateinfo branch please, so I can try to get the final branch ready for proposal?	16:01
sinzui	jam1 I changed the db for unit tests because it didn't match the db used by local/ci the one from ctool is the only one will trust now for precise	16:01
jam1	sinzui: right, the only place we actually use juju:ppa/stable is for Quantal and Raring	16:02
natefinch	jam: yes, possible. Mongo can be sporadically really slow	16:02
jam1	I forgot about R	16:02
natefinch	rogpeppe: reproposed moremongo	16:02
jam1	but I don't think 2.4 landed until Saucy	16:02
rogpeppe	natefinch: thanks	16:02
rogpeppe	natefinch: i'll approve it	16:02
sinzui	jam1, the makefile disagrees with your statement	16:03
mgz	R is no longer supported	16:03
sinzui	jam1 the make files doesn't know about ctools	16:03
jam1	sinzui: so what I mean is, when you go "juju bootstrap" and Q and R is the target, we add the juju ppa	16:03
jam1	sinzui: ctools doesn't have Q and R builds	16:03
jam1	only P	16:03
natefinch	rogpeppe: pushed	16:03
rogpeppe	natefinch: ta	16:03
jam1	sinzui: but as mgz points out, Q is almost dead, and R is dead, so we can punt	16:04
rogpeppe	natefinch: how's it going, BTW?	16:04
sinzui	jam1, good, because I don't test with r (obsolete) and q (obsoelete in 3 days)	16:04
jam1	sinzui: otherwise the better fix is to get 2.4.6 into the ppa	16:04
sinzui	jam1, +1	16:04
sinzui	jam1, I was not aware the versions were different until this morning	16:04
jam1	sinzui: I wasn't that aware either	16:05
jam1	sinzui: I just saw the failing and davecheney pointed out CI was using an "old" version, which I tracked down	16:05
sinzui	I saw it too, but I haven't gotten enough sleep to see how the old version was selected	16:06
jam1	sinzui: it is nice to see so much blue on http://ec2-54-84-137-170.compute-1.amazonaws.com:8080/	16:06
sinzui	jam1. I see CI has started next revision will I was trying to get lucky with azure.	16:07
natefinch	rogpeppe: I think most tests pass on 043-localstateinfo now. saw some failures from state, but they looked sporadic, haven't checked them out yet	16:07
rogpeppe	natefinch: cool. "most" ?	16:08
mgz	rogpeppe: can I request a re-look at https://codereview.appspot.com/87540043	16:08
rogpeppe	mgz: looking	16:08
mgz	rogpeppe: I don't like that that fixed the hang... pretty sure it means the test is relying on actually dialing 0.1.2.3 and that failing	16:09
natefinch	rogpeppe: I didn't let the worker tests finish because I was impatient and they were taking forever, so possible there are failures there too	16:09
natefinch	rogpeppe: state just passed for me	16:09
rogpeppe	natefinch: cool	16:09
sinzui	jam1, everyone. This is a first, lp:juju-core r2630 passed 7 minutes after CI cursed the rev because I was forcing the retest of azure.	16:10
rogpeppe	mgz: the code in AddStateServerMachine is kinda kooky	16:11
sinzui	jam1. I will start preparation for the release while the current rev is tested. I will use the new rev if it gets a natural blessing	16:11
mgz	AddStateServerMachine should probably just be removed	16:11
rogpeppe	mgz: probably.	16:11
mgz	it's not a very useful or used helper	16:11
mgz	and its doc comment is wonky	16:12
mgz	I probably shouldn't have touched it, as stuff passes without poking there	16:12
mgz	but it came up in my grep	16:12
natefinch	rogpeppe: worker tests pass except for a peergrouper test - workerJujuConnSuite.TestStartStop got cannot get replica set status: cannot get replica set status: not running with --replSet	16:13
rogpeppe	it should probably be changed to SetStateServerAddresses	16:13
rogpeppe	mgz: as that's the reason it was originally put there	16:13
mgz	right, something like that	16:13
rogpeppe	mgz: but let's just land it and think about that stuff later	16:14
mgz	I'll try it on the bot	16:14
rogpeppe	mgz: bot is chewing on it...	16:15
natefinch	rogpeppe: peergrouper failure was sporadic too, somehow. All tests pass on that branch.	16:19
sinzui	alexisb, I will release in a few hours. CI is testing a new rev that I expect to pass without intervention. I can release the previous rev which pass with extra retests	16:25
alexisb	sinzui, awesome, thank you very much!	16:25
mattyw	folks, has anyone seen this error before when trying to deploy a local charm? juju resumer.go:68 worker/resumer: cannot resume transactions: not okForStorage	16:26
rogpeppe	natefinch: any chance we might get it landed today?	16:26
rogpeppe	natefinch: i'm needing to stop earlier today	16:27
alexisb	go rogpeppe go! :)	16:27
rogpeppe	alexisb: :-)	16:27
* rogpeppe has spent at least 50% of today dealing with merge conflicts		16:28
natefinch	rogpeppe: yes	16:31
rogpeppe	natefinch: cool	16:33
rogpeppe	natefinch: BTW the bot is running 041-moremongo tests right now... fingers crossed	16:38
rogpeppe	natefinch: i've just realised that it would be quite a bit nicer to have a func stateInfoFromServingInfo(info params.StateServingInfo) *state.Info, and just delete agent.Config.StateInfo	16:45
natefinch	rogpeppe: yeah	16:45
rogpeppe	natefinch: n'er mind, we'll plough on, i think.	16:46
natefinch	rogpeppe: that was my thinking :)	16:46
natefinch	awww, I think my old company finally cancelled my MSDN subscription	16:47
rogpeppe	natefinch: i'm needing to stop in 20 mins or so. any chance of that branch being proposed before then?	16:56
natefinch	rogpeppe: https://codereview.appspot.com/88200043	16:56
rogpeppe	natefinch: marvellous :-)	16:56
natefinch	:)	16:57
rogpeppe	natefinch: reviewed	17:05
natefinch	rogpeppe: btw, I had to rename params to attrParams because params is a package that needed to get used in the same function	17:06
rogpeppe	natefinch: i know	17:06
natefinch	rogpeppe: oh, I misunderstood the comment, ok	17:07
rogpeppe	natefinch: i just suggested standard capitalisation	17:07
natefinch	rogpeppe: yep, cool	17:07
natefinch	rogpeppe: why is test set password not correct anymore? It still does that, I think?	17:09
rogpeppe	natefinch: oh, i probably missed it	17:09
rogpeppe	natefinch: you're right, i did	17:11
natefinch	rogpeppe: cool	17:11
rogpeppe	natefinch: 41-moremongo is merged...	17:26
natefinch	rogpeppe:awesome	17:27
natefinch	43-localstateinfo should be being merged now	17:28
natefinch	rogpeppe: what's left?	17:28
rogpeppe	natefinch: it's actually retrying mgz's apiaddresses_use_hostport	17:28
rogpeppe	natefinch: i'm trying to get tests passing on my final integration branch	17:29
natefinch	rogpeppe: nice	17:29
rogpeppe	natefinch: currently failing because of the StateInfo changes (somehow we have a StateServingInfo with a 0 StatePort)	17:29
rogpeppe	natefinch: would you be able to take it over from me for the rest of the day	17:30
rogpeppe	?	17:30
natefinch	rogpeppe: yeah definitely	17:30
rogpeppe	natefinch: it needs a test that peergrouper is called (i'm already mocking out peergrouper.New)	17:30
natefinch	rogpeppe: what's the branch name?	17:31
rogpeppe	natefinch: i haven't pushed it yet, one mo	17:32
rogpeppe	natefinch: bzr push --remember lp:~rogpeppe/juju-core/540-enable-HA	17:33
rogpeppe	natefinch: there are some debugging relics in there that need to be removed too	17:34
rogpeppe	natefinch: in particular, revno 2355 (cmd/jujud: print voting and jobs status of machines) needs to be reverted and proposed separately as discussed in the standup	17:34
natefinch	rogpeppe: ok	17:35
=== vladk is now known as vladk\|offline
rogpeppe	natefinch: i'd prioritise the other branches though	17:39
rogpeppe	natefinch: i have to go now	17:39
rogpeppe	g'night all	17:39
natefinch	rogpeppe: g'night	17:39
sinzui	natefinch, Do you have a moment to review https://codereview.appspot.com/88170045	18:05
natefinch	sinzui: done	18:06
sinzui	thank you natefinch	18:06
BradCrittenden	sinzui: would you have a moment for a google hangout?	18:34
=== BradCrittenden is now known as bac
sinzui	bac: yes	18:35
bac	sinzui: cool. let me set one up and invite you in a couple of minutes	18:35
bac	sinzui: https://plus.google.com/hangouts/_/canonical.com/daily-standup	18:39
=== Ursinha is now known as Ursinha-afk
=== Ursinha-afk is now known as Ursinha
davecheney	good morning worker ants	23:09
perrito666	davecheney: my window says otherwise	23:16
davecheney	perrito666: one of us is wrong	23:21
davecheney	i'll roshambo you for it	23:21
* perrito666 turns a very strong light on outside and says good morning to davecheney		23:21
davecheney	perrito666: it helps with the jet lagg	23:22
perrito666	davecheney: I traveled under 20km today, I dont have that much jetlag :p	23:22

Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!