[00:09] <perrito666> sinzui: news?
[01:21] <perrito666> ok, something is very very broken with aws
[01:45] <waigani> axw: do you mean just setting state-server: true in tests where it is set to false?
[01:45] <axw> waigani: yes, the ones where it matters anyway
[01:45] <axw> leaking details of the dummy provider into the command is not great
[01:46] <axw> AFAIK, the state-server thing is just there to speed up the tests
[01:46] <axw> but if we need it, then the tests should change
[01:46] <waigani> axw: yeah for sure, it's ugly. I assumed the state-sever was set to false in tests for a reason
[01:47] <axw> waigani: there's lots of wrong in the code, don't be afraid to question it ;)
[01:47] <waigani> axw: so you're saying the state-server setting should be deprecated?
[01:48] <axw> no, just change it to true for the tests where we run the bootstrap command
[01:48] <axw> alternatively, do the TODO now... not sure which one is less work really
[01:49] <waigani> sigh
[01:49] <waigani> kludges are less work ;)
[01:49] <waigani> I had to look that up
[01:49] <axw> yes, often less work in the short term
[01:49] <waigani> and found this image: http://en.wikipedia.org/wiki/Kludge#mediaviewer/File:Miles_Glacier_Bridge,_damage_and_kludge,_1984.jpg
[01:50] <axw> lol
[01:50] <waigani> sums it up pretty nicely haha
[01:50] <waigani> I should put that on my coding profile ;)
[02:43]  * wwitzel3 flips his desk
[02:48] <axw> wwitzel3: what'cha doing?
[02:49] <wwitzel3> trying to get rsyslog rotation for all-machines.log working
[02:49] <axw> ah, fun times
[02:49] <wwitzel3> since all day
[02:49] <wwitzel3> :/
[02:50] <wwitzel3> I was in bed attempting to go to sleep and I was have weird dreams about rsyslog in human for bullying and taunting me.
[02:50] <wwitzel3> s/for/form
[02:50] <wwitzel3> lol
[02:51] <axw> well, that's not cool...  I don't recall having personified a software package before
[02:51] <wwitzel3> haha, happens to me all the time
[02:52] <axw> did it look like this guy? https://plus.google.com/+RainerGerhards
[02:52] <axw> :)
[02:53] <wwitzel3> though my favorite dream was I was travel through code and I kept hitting this unhandled exception (which was the same one that was happening in production). And when I woke up I had instant thought of how to fix it and got it resolves in 5 minutes.
[02:53] <menn0_> axw: bad news... the "State remove setmongopassword" change is what's causing this CI blocker: bug #1355320
[02:53] <axw> god damnit
[02:53] <menn0_> axw: I've been bisecting my way through recent changes and that's the one
[02:53] <axw> thanks menn0_
[02:54] <axw> I'll take it over if you like
[02:54] <menn0_> sure
[02:54] <menn0_> I'll update the ticket with the best repro details I have
[02:54] <menn0_> give me a few minutes to see if I can simplify them a bit further
[02:55] <axw> menn0_: thanks
[03:09] <axw> weird, I'm certain I tested this...
[03:10] <waigani> menn0_: how do you bisect your way through the changes?
[03:11] <menn0_> axw: after issuing ensure-availability did you wait for the new state servers to hit "started"?
[03:12] <axw> menn0_: pretty sure I did, but it was a little while ago and my memory isn't great
[03:12] <menn0_> I'm seeing the new machine agents getting stuck in pending because they can't connect to the API
[03:12] <axw> yeah, I see the same thing now
[03:15] <menn0_> axw: I've updated the ticket and assigned it to you. All yours!
[03:16] <axw> menn0_: thanks
[03:16] <menn0_> waigani: the process I used is:
[03:16] <menn0_> reproduce the problem with master
[03:16] <menn0_> use git log to see the changes between 1.20.1 and master
[03:17] <menn0_> git checkout <some rev in between>
[03:17] <menn0_> git install ./..
[03:17] <menn0_> try to reproduce the problem
[03:17] <menn0_> does it exist?
[03:17] <menn0_> no: the problem revision is after this on
[03:17] <menn0_> yes: the problem revision is before this one (or it IS this one)
[03:17] <menn0_> you can do a straight binary search
[03:18] <menn0_> but if you have some idea of where the problem lies, like in this case, you can be a bit smarter about picking revs to narrow it down more quickly.
[03:19] <menn0_> keeping good notes on what you've done helps a lot too
[03:19] <waigani> menn0_: ah right, thanks for the explanation
[03:20] <menn0_> if the reproduction steps were completely automated (which they can be but I didn't bother) then you could get "git bisect" to do all the work for you.
[03:21] <menn0_> what i've described is exactly what it does
[03:23] <menn0_> actually, looking at the docs "git bisect" can also be used to assist with manual searches. I should have used it.
[03:23] <menn0_> previously I've only ever used "git bisect run" which does automated searching
[03:24] <waigani> menn0_: yeah I was thinking that should be able to be automated
[03:26] <menn0_> waigani: the fiddly part in this case is writing some code to wait at the right times and parse the status output
[03:27] <menn0_> waigani: all doable though and I'm sure the helper functionality in juju-ci-tools would help too.
[03:28] <waigani> menn0_: setting up a process that could automatically find the offending commit for a CI bug would save us a lot of time
[03:28] <waigani> potentially reverting it too
[03:29] <menn0_> I think a semi-automatic process is probably best
[03:29] <menn0_> the CI tests often fail for infrastructure related reasons
[03:30] <waigani> right, well it could add suspicious commits to the bug comments
[03:30] <menn0_> you wouldn't want these to trigger reverts or long running investigations
[03:30] <menn0_> yes
[03:36] <stokachu> is there any examples where juju actions are being used in practice?
[03:37] <menn0_> axw: you associated bug 1347715 with the "don't enter upgrade mode unnecessarily" PR
[03:37] <axw> menn0_: yep
[03:37] <menn0_> that PR wasn't intended to solve that bug :)
[03:38] <axw> menn0_: see the latest error message at the bottom of the bug
[03:38] <axw> that LP bug has morphed
[03:38] <axw> the latest one was due to "upgrade in progress"
[03:38] <menn0_> all that PR does is allow the machine agent workers to start up a little faster, and avoid some unnecessary log noise
[03:39] <menn0_> I can see how it might help with that bug
[03:39] <menn0_> but I wonder if that's the whole story
[03:39] <menn0_> if it is then, great
[03:39] <stokachu> there doesnt seem to be an action-set either is this feature not complete yet?
[03:39] <menn0_> but I'm not sure
[03:40] <axw> stokachu: still a work in progress AFAIK
[03:40] <stokachu> axw, ok cool
[03:42] <menn0_> axw: is it worth reverting 3b6da1d429bff627a636ca4512c1ae8230f26539 or do you think you'll have this sorted quickly
[03:42]  * menn0_ would like to get CI unblocked
[03:44] <axw> menn0_: I guess it's no big deal to revert it in the mean time
[03:44] <axw> I'll do that...
[03:45] <jcw4> stokachu: yes, it's still in-progress
[03:45] <jcw4> stokachu: bodie_ is primarily working on action-set and action-get
[04:15] <axw> menn0: reverted
[04:15] <menn0> axw: yep, saw that. cheers
[04:15] <menn0> axw: does the bug get marked as "Fix committed" now?
[04:16] <axw> menn0: yeah I've done that
[04:16] <axw> I'll keep looking into it tho
[04:16] <menn0> cool
[04:16] <menn0> of course
[04:16] <axw> this is one time where I really wish I hadn't rebased
[04:17] <axw> pretty sure I tested it and it all worked, then I rebased and it's broken since then
[04:19] <menn0> axw: so you think it's some interaction with this change and some other change that got rebased in?
[04:20] <axw> I think so
[04:20] <menn0> axw: in other news, that manual provider problem is still there despite the "don't enter upgrade mode unnecessarily" change
[04:20] <axw> sigh
[04:20] <axw> all good news today :)
[04:21]  * menn0 reopens the bug
[04:21] <axw> hold up
[04:21] <axw> it's still publishing?
[04:21] <axw> menn0: ^^
[04:21] <axw> or did you test it manually?
[04:21] <axw> eh, never mind
[04:21] <axw> hadn't refreshed
[04:22] <menn0> http://juju-ci.vapour.ws:8080/job/manual-deploy-trusty-ppc64/523/
[04:26] <menn0> axw: shall I take 1347715 for a bit?
[04:27] <axw> menn0: that'd be great, thank you
[04:27] <menn0> axw: cool
[04:27] <menn0> I need to go out and pick up some furniture but I'll keep going with it once I'm back
[05:26] <menn0> axw: I'm back but need to deal with the kids for a bit
[05:26] <axw> menn0: no worries
[05:26] <menn0> I'll continue with this manual provider issue later this evening
[05:27]  * axw hopes it isn't something else he's broken
[06:25] <axw> menn0: I'm getting "auth fails" even without my change
[06:25] <axw> :/
[06:41] <axw> hrm, 2nd shot worked
[06:41]  * axw scratches head
[08:16] <mattyw> morning all
[08:26] <TheMue> heya
[08:26] <dimitern> morning
[08:28] <voidspace> TheMue: morning
[08:28] <voidspace> dimitern: morning
[08:55] <menn0> axw: I'm back again (kids were a nightmare to get to bed)
[08:56] <axw> menn0: heya. fun :(
[08:56] <axw> menn0: so I've reworked my previous PR and got something working now
[08:56] <axw> ignore my comment from before, I think I hadn't uploaded the right tools
[08:56] <menn0> axw: ok that's good to hear
[08:57] <menn0> axw: let me know if you need a review
[08:57] <axw> the problem with the old one was that mongo.SetAdminMongoPassword was also doing Login straight after
[08:57] <dimitern> so is anyone working on the blocker bug https://bugs.launchpad.net/juju-core/+bug/1355324 ?
[08:57] <axw> menn0: it can wait, it's well after EOD there...
[08:57] <axw> not I
[08:58] <menn0> dimitern: no I don't think so
[08:58] <menn0> other blockers have been keeping us busy
[08:58] <dimitern> menn0, how about the other one assigned to you?
[08:58] <voidspace> launchpad won't even show me the bug
[08:59] <dimitern> voidspace, oh, which one?
[08:59] <voidspace> I don't think it's my internet  being rubbish this time
[08:59] <voidspace> 1355324
[08:59] <voidspace> page won't load
[08:59] <dimitern> voidspace, :( aw
[08:59] <voidspace> just launchpad being slow I think
[08:59] <menn0> voidspace: weird. it's working for me
[08:59] <voidspace> odd
[09:00] <menn0> I'm going to look at #1347715 for a bit longer
[09:00] <menn0> but someone else might have to take over depending how far I get with it
[09:03] <mattyw> voidspace, I'm having trouble with lp, canonical.com and ubuntu.com this morning
[09:03] <mattyw> voidspace, you on bt?
[09:03] <voidspace> mattyw: over here in Romania? I doubt it :-)
[09:03] <voidspace> mattyw: don't know who the isp is to be honest
[09:04] <voidspace> but today most of the internet works well, which is better than previous days, *except* for launchpad
[09:04] <mattyw> voidspace, cool - how long you there for?
[09:04] <voidspace> I'll try canonical.com and ubuntu.com as well
[09:04] <voidspace> mattyw: till the end of August
[09:04] <voidspace> mattyw: visiting wife's family
[09:04] <mattyw> voidspace, nice, hope you have a great time
[09:04] <voidspace> canonical.com loads fine
[09:04] <voidspace> mattyw: thanks
[09:05] <voidspace> mattyw: I'm mostly working
[09:05] <voidspace> mattyw: we're taking a week at the end of August to visit seaside and mountains though
[09:05] <voidspace> should be good
[09:05] <voidspace> yep, launchpad not loading for me at all at the moment
[09:05] <voidspace> tried two browsers
[09:11] <voidspace> I wonder if it's a dns issue, I can't ping it either
[09:13] <mattyw> voidspace, I was able to get to canonical.com by changing dns servers but everything else is down for me
[09:43] <menn0> axw: well this looks like a problem. using the manual provider:
[09:43] <menn0> 2014-08-12 09:36:54 INFO juju.mongo open.go:104 dialled mongo successfully on address "127.0.0.1:37017"
[09:43] <menn0> 2014-08-12 09:36:54 DEBUG juju.worker.logger logger.go:45 reconfiguring logging from "<root>=DEBUG" to "<root>=WARNING;unit=DEBUG"
[09:43] <menn0> 2014-08-12 09:36:55 ERROR juju.worker runner.go:219 exited "machiner": machine-0 failed to set status started: cannot get machine 0: EOF
[09:43] <axw> menn0: yep, but what would be causing the EOF?
[09:44] <menn0> that's the last thing in the machine-0 log following bootstrap
[09:44] <axw> I saw it, but it's not a very enlightening error message ;)
[09:44] <menn0> I'm going to crank up the log level during bootstrap and see what I can see
[09:44] <axw> cool
[09:44] <menn0> the error happens after the root logger goes to WARNING
[09:44] <axw> ah right
[09:45] <axw> yeah, I just set my environments' logging-config="<root>=DEBUG"
[09:45] <menn0> and I might have to do the bisect thing again to see if I can track down the rev
[09:46] <menn0> this one has been broken for a while so there could be a lot of revs...
[10:00] <mattyw> folks - is someone able to help me out with a couple of things around the unit type?
[10:00] <mattyw> ^^ it appears (at least in some tests) that the CharmURL in a unit doesn't have a value (== nil)
[10:01] <menn0> axw: alright
[10:01] <menn0> now I have a nice long panic
[10:01] <axw> menn0: cool
[10:02] <axw> I have to run
[10:02] <perrito666> morning
[10:02] <menn0> axw: no problems
[10:02] <menn0> I need to stop soon too
[10:02] <axw> menn0: if you get sick of looking at that, can you please attach the panic to the bug and I'll take a look tomorrow
[10:02] <menn0> axw: I'll try to get someone else to look in the mean time too
[10:03] <menn0> but I will add everything I have to the bug either way
[10:03] <axw> cool
[10:38] <TheMue> dimitern: a cable technician just came, could be that I’m a bit later at our hangout
[10:38] <dimitern> TheMue, ok, np
[10:50] <voidspace> dimitern: are we postponing?
[10:50] <dimitern> voidspace, let's wait for TheMue a few minutes
[10:50] <voidspace> dimitern: sure
[10:56] <TheMue> dimitern: I’m there, are you done?
[10:57] <voidspace> TheMue: we didn't start
[11:24] <menn0> right... I'm done
[11:24] <menn0> who wants to take over this CI blocker: 1347715
[11:24] <menn0> I've done a lot of the legwork. There's repro instructions and other findings attached to the ticket.
[11:25] <menn0> I'm pretty close but I haven't quite nailed the actual bug yet.
[11:43] <menn0> no takers regarding the above?
[11:51] <tasdomas> could somebody take a look at https://github.com/juju/juju/pull/490 ?
[11:52] <mgz> menn0: I read through the bug, did you narrow down when it was introduced, or not get that far yet?
[11:52] <menn0> mgz: well as per my last comment, it's definitely since 1.20.2
[11:52] <menn0> but I haven't narrowed it further than that
[11:53] <menn0> given that we have a fair simple repro, git bisect could help?
[11:54] <mgz> yeah, just a bit slow, right, as it involves reprovisioning a machine?
[11:54] <mgz> or can you do it on a dirty one?
[11:54] <menn0> focussing on changes that involved the manual provider might also help narrow things down more quickly
[11:54] <menn0> no you don't need to reprovision
[11:54] <menn0> because it's the manual provider
[11:54] <menn0> just make sure you do a juju destroy-environment --force --yes after each run
[11:55] <menn0> that's pretty good at getting rid of most of the previous run
[11:55] <menn0> it's not exactly fast (the bootstrap still takes a while) but it's manageable
[11:56] <menn0> limiting the upload series to just trusty also helps a bit
[11:56] <menn0> it's midnight. I really need to go to bed
[11:57] <mgz> menn0: okay, good night
[11:57] <menn0> have a good one
[11:59] <fwereade> tasdomas, LGTM
[12:20] <tasdomas> fwereade, thanks
[13:41] <fwereade> natefinch, ping -- when you have a moment I'd like to chat about transactions
[13:41] <natefinch> fwereade: how about now? :)
[13:42] <fwereade> natefinch, perfect :)
[13:42] <fwereade> natefinch, in irc for a bit in case anyone else sees something ineresting go past?
[13:42] <natefinch> fwereade:  sure
[13:42] <fwereade> natefinch, basically I'm worried about interactions between the two systems as we switch
[13:43] <natefinch> fwereade: yep
[13:43] <fwereade> natefinch, (oh wait, it might actually not be a problem, let's keep going for a bit)
[13:43] <fwereade> natefinch, in particular, assertions
[13:44] <fwereade> natefinch, now I think that we will write to every affected document --including those we just assert on -- as part of a transaction
[13:44] <fwereade> natefinch, if that's the case I think we're good
[13:44] <fwereade> natefinch, the problem isn't at crossover time
[13:44] <fwereade> natefinch, it's with using toku on its own
[13:45] <fwereade> natefinch, in particular, assertions about particular document states
[13:45] <fwereade> natefinch, I think that within any transaction we need to write to any document on whose state we depend
[13:47] <natefinch> fwereade: I haven't actually done a lot of research into how toku transactions work.... do you not get a consistent view of the database once you enter a transaction?  Or only the ones that you write to?
[13:47] <fwereade> natefinch, and that, best-case, this is going to lead to frequent document deadlocks
[13:48] <fwereade> natefinch, consistent state doesn't help though, I think
[13:48] <natefinch> ahh right, if someone else changes the doc out from under you
[13:48] <fwereade> natefinch, it's about needing all that state to still satisfy certain conditions when the txn lands
[13:48] <natefinch> right
[13:49] <fwereade> natefinch, writing to the document will help there, it pulls it into the transaction if you like
[13:49] <fwereade> natefinch, but I think it *seriously* increases the already-present risks of deadlock in the DB
[13:50] <fwereade> natefinch, I read docs X and Y, you read Y and X
[13:50] <natefinch> right
[13:50] <natefinch> classic deadlock
[13:51] <natefinch> fwereade: have you read up on their transactions?  I don't feel qualified to talk about them without doing some reading.  I'd just been doing some testing to make sure it was even possible to use their DB.... but hadn't actually researched the way their transactions worked.  hazmat had done the work there, I believe.
[13:52] <fwereade> natefinch, I've read like 1 small white paper
[13:52]  * natefinch is reading the white paper now
[13:54] <fwereade> natefinch, so anyway -- I think it will work fine with mgo txns underneath, for a given value of fine, if not actually very performant
[13:54] <fwereade> natefinch, although
[13:55] <hazmat> fwereade, its mvcc fwiw
[13:55] <hazmat> consistent reads, so a simple query satisifes mgo conditions
[13:56] <hazmat> fwereade, more of concern is the use of write hotspots in the current codebase
[13:56] <fwereade> hazmat, are you thinking more of "presence" or of things like service documents?
[13:56] <hazmat> fwereade, service docs
[13:57] <fwereade> hazmat, right, I think those things are a nexus of fuckery in either txn system
[13:57] <hazmat> fwereade, they exist due to async gc ? or..
[13:58] <hazmat> fwereade, latent gc can use simple queries afaics.. i tended to regard them as offshots of the extant txn sys in conjuction w/  lifecycle management
[13:58] <fwereade> hazmat, well service documents get cleaned up with the last unit to be removed from a dying service
[13:58] <fwereade> hazmat, but that's managed by a refcount on the service document
[13:59] <hazmat> fwereade, why not just a async gc cleaner job in the state server?
[13:59] <Beret> wallyworld, any update on that flag to prevent a failed bootstrap from tearing itself down?
[14:00] <hazmat> fwereade, doesn't that mean machine loss of last unit leaves to stuck service otherwise?
[14:01] <hazmat> minus using destroy-machine --force to try and post clean up after the fact from the api server
[14:01] <fwereade> hazmat, no? when a unit gets removed either by the machine agent or by forced removal of the machine, the last one cleans up anything that depends on it
[14:01] <fwereade> hazmat, or indeed "yes" depending on perspective
[14:02] <natefinch> Beret: we've been fighting some critical bugs lately, and most of the team leads were on a sprint last week, so it hasn't gotten done yet, AFAIK.  But I think it could be done this week if we have someone free to work on it.
[14:03] <Beret> natefinch, ok, thanks
[14:03] <natefinch> Beret: we'll try to get it in... it's something we want for ourselves, too.
[14:03] <Beret> natefinch, it's not more important than real bugs, I just wanted to make sure it hadn't gotten in and we just didn't know it
[14:03] <fwereade> hazmat, to step back a moment
[14:03]  * hazmat nods
[14:03] <hazmat> meeting.. bbiab
[14:04] <fwereade> hazmat, in either txn system, our performance is largely dependent on the areas of document space touched by a given txn -- thus designing db operations that minimise overlap is important either way
[14:04] <hazmat> fwereade, agreed
[14:05] <fwereade> hazmat, we have occasionally done a really poor job of designing these things in the past
[14:05] <fwereade> hazmat, but now we have schema changes we have a path to mitigate these, and we can expect to see benefits from doing this in *either* system
[14:06] <fwereade> hazmat, however, we won't be able to eliminate overlaps
[14:07] <fwereade> hazmat, our job is structuring things such that overlaps are (1) rare and (2) not too disruptive
[14:09] <fwereade> hazmat, and I am at the moment mainly thinking about the impact on how we actually have to write code
[14:10] <fwereade> hazmat, natefinch: so if we wrap a mgo txn in a toku txn...
[14:12] <fwereade> hazmat, natefinch: let's pretend for a moment we get all our info from a consistent snapshot, and craft a txn assuming the truth of that stuff, and then we execute that txn
[14:13] <fwereade> hazmat, natefinch: only when we execute that txn do we hit the docs and grab the locks, and if we deadlock at that point we may have trouble
[14:14] <fwereade> hazmat, natefinch: if the txn runner fails out for this reason, what exactly is the impact on the code trying to craft the txn and retry in the face of contention?
[14:16] <fwereade> hazmat, go back to the beginning of the whole set of operations and start crafting txns again from scratch?
[14:17] <fwereade> hazmat, in particular I fear *that* is going to be a major source of gray goo transactions that end up locking large chunks of the DB, deadlocking, failing out, and doing the same thing over and over again
[14:18] <fwereade> natefinch, sorry, I am also interested in your thoughts on the above
[14:18] <natefinch> hah ok
[14:18] <natefinch> I was going to give my thoughts anyway ;)
[14:18] <fwereade> natefinch, just realised I'd stopped badging you at some point
[14:20] <fwereade> (but that doesn't mean I'm going to stop badgering you, ho ho)
[14:20]  * fwereade looks shamefaced
[14:20] <natefinch> haha
[14:21] <natefinch> fwereade: I gotta run in a couple minutes unfortunately..... but... do we need mgo transactions if we have toku transactions?  Can we just stop using mgo transactions?
[14:22] <fwereade> natefinch, yes, *but* that's a lot of stuff to rewrite rather than wrap
[14:22] <natefinch> right
[14:22] <fwereade> natefinch, there will certainly be things that are easier to do in a pure-toku world
[14:22] <natefinch> yep
[14:23] <fwereade> natefinch, I'm just worried that mixing the two will actually have a pathologically awful impact
[14:23] <natefinch> I gotta run, sorry, I'll keep thinking and will read the channel history.. Back in an hour-ish
[14:23] <fwereade> natefinch, cheers, take care
[14:24] <hazmat> fwereade, per the acid doc, we could divert mgo transactions to toku
[14:27] <fwereade> hazmat, I can sort of see it but it's a bit fuzzy in my mind
[14:27] <fwereade> hazmat, we still have to do what I said, I think?
[14:27] <fwereade> hazmat, we can be a bit more proactive about writing to docs to take their locks a bit earlier than we could otherwise
[14:28] <fwereade> hazmat, and I daresay we can convert failure to do so into ErrRefresh, and make all the transactions handle it?
[14:29] <fwereade> hazmat, but wouldn't we have to actually start a fresh toku txn *anyway*?
[14:30] <fwereade> hazmat, so in a multi-step toku operation, if the Nth step fails we back *everything* out and start again?
[14:33] <fwereade> hazmat, heh, maybe even track and try to grab locks on everything we thought we needed last time through?
[14:34] <fwereade> hazmat, it all feels potentially *really* yucky
[14:40] <fwereade> natefinch, more blithering above ^^
[14:44] <hazmat> fwereade, sorry still in meeting, bbiab
[14:44] <fwereade> hazmat, np, I'm around for a bit I think
[16:01] <natefinch> fwereade: back
[16:23] <fwereade> natefinch, I need to be off soon, but read back and see if anything resonates or inspires a response
[16:24] <fwereade> natefinch, just type at me in here if I'm gone, I'll see it soon
[16:26] <hazmat> fwereade, sorry my meeting ran over time.. and into another meeting.. i'll capture notes for discussion tomorrow if your not around in a bit
[16:29] <hazmat> fwereade, nutshell locks are implicit with mods to a doc in a txn.. basically we just take a op runner and do it in toku txn (begin, end txn) with catch on lock and retry behavior.
[16:30] <natefinch> fwereade: no problem.  Trying to do a few things at once here.   I don't know that grabbing locks early has any effect in toku... if two threads are both trying to grab locks early, it doesn't really buy us much.  My preference would be to just make the code straightforward and do what it's supposed to do, and if something else gets into the DB first, the TX will have to retry.
[16:30] <hazmat> fwereade, re backout, its implicit at that commit, failed lock is rollback, and error to app to handle, at which point we retry similiar to runner loop now
[16:30] <hazmat> there is no grabbing a lock, its implied by writing to a doc during a txn.
[16:31] <fwereade> hazmat, natefinch: ok, this sounds broadly sane, I am worried that we have some funky layering issues to work around wrt actually managing rollbacks
[16:31] <natefinch> that's certainly possible
[16:31] <fwereade> hazmat, natefinch: well I am contending that we do need document locks for those docs that in mgo/txn we would assert on
[16:31] <fwereade> hazmat, natefinch: but that we will have to lock them ourselves by, as you say, writing to them in a transaction
[16:32] <hazmat> fwereade, we don't because we're operating in a mvcc world with implicit write level locks.
[16:32] <fwereade> hazmat, right
[16:32] <hazmat> ie read committed, or serialized if preferred (both options avail)
[16:32] <fwereade> hazmat, which is great for ensuring that conditions hold for the documents we're writing
[16:32] <hazmat> fwereade, right.. toku should be a nice runner for the extant mgo transactions since conditions are explicitly specified and re-runnable
[16:32] <natefinch> fwereade: right... my reading of MVCC is... you don't have to worry about basing your behavior on reads that get out of date, because if they get out of date, one of the two transactions aborts and retries
[16:34] <fwereade> hazmat, natefinch: so once you start a transaction, toku tracks everything you read? and aborts your txn if someone writes to one of those docs?
[16:34] <fwereade> hazmat, natefinch: that's the only way I could see it aborting a transaction in that situation
[16:35] <hazmat> fwereade, again depends on isolation level.. read serialized you'll read the copy that was current at the time of txn begin
[16:35] <hazmat> fwereade, read committed, means you read other docs current as of the time of read
[16:35] <hazmat> as opposed to our read dirty model now.. ie. read partially committed state
[16:36] <fwereade> hazmat, natefinch: I don't see how either read serialized *or* read committed actually helps us -- I'm not interested in either of those two states
[16:36] <hazmat> rephrasing .. current == latest commmitted revision of the doc
[16:37] <hazmat> fwereade, their rather important .. but perhaps we should take a step back.. what's the concern?
[16:37] <fwereade> hazmat, if I have a txn that should only go ahead if a property continues to hold for some doc not in the txn
[16:38] <fwereade> hazmat, then it's philosophically impossible for me to read a document in either of those situations and be certain that it can't change under me
[16:38] <fwereade> hazmat, but if I *write* to that document I can
[16:38] <fwereade> hazmat, because I take a lock
[16:39] <fwereade> hazmat, and guarantee the failure of either myself or the other txn that wanted to write it
[16:39] <fwereade> hazmat, (sucks if that one just wanted to read it too, but anyway)
[16:39] <fwereade> hazmat, making sense?
[16:39] <hazmat> fwereade, serializable would give you that semantic
[16:40] <fwereade> hazmat, ok, so that does take locks on read?
[16:40] <fwereade> hazmat, (hopefully happy smart read locks not nasty actually-triggering-serialization write locks?)
[16:40] <hazmat> fwereade, it does.. but it may not cover every use case.. ie inserting new doc in collection.. since your asserting a negative read
[16:41] <fwereade> hazmat, yeah, we depend on that a bit
[16:41]  * hazmat attention is gripped by tosca meeting
[16:42] <fwereade> hazmat, natefinch: ok, modulo d- asserts, we sound like we'll be fine with the serializable level
[16:43]  * fwereade needs to go out anyway
[16:43] <fwereade> thanks for clearing that up, I didn't get that impression from the descriptions
[16:43] <fwereade> I guess we need to be careful about what we *do* read when we're in a txn then..?
[16:44] <natefinch> fwereade: well, generally you don't just read stuff for no reason.  You read stuff because the logic depends on the contents
[16:44] <fwereade> natefinch, ok, but sometimes you read an easy-to-express superset of what you need and extract in code
[16:44] <hazmat> fwereade, it maybe we need to do separate collection level locks for insert
[16:44] <natefinch> it's funny, Toku has an office in Lexington.. I could like, drop down there and say hi
[16:45] <fwereade> natefinch, if that's going to quietly take a bunch of locks we should be aware of it
[16:45] <natefinch> fwereade: that's true
[16:45] <hazmat> fwereade, i'd like to define the common scenarios and patterns we have
[16:46] <fwereade> hazmat, I can live with that, I'm just fretting because I want to be sure we've got answers for all these things ;)
[16:46] <hazmat> and then verify solutions.. common idioms for them with tokumx
[16:46] <fwereade> hazmat, that would probably be the smart thing to do, indeed -- natefinch, are you ok to try to build those up?
[16:46] <fwereade> natefinch, would be happy to discuss in more detail
[16:46] <fwereade> natefinch, ...but not now
[16:47]  * fwereade disappears in a puff of smoke
[16:47] <hazmat> fwereade, natefinch should i setup a call for tomorrow?
[16:47] <natefinch> fwereade: see ya
[16:47] <natefinch> fwereade: yeah
[16:47] <hazmat> fwereade, cheers
[16:48] <hazmat> done
[17:01] <jcw4> mgz, cmars fix for lp-1355521 https://github.com/juju/juju/pull/500 , ptal
[17:03] <mgz> jcw4: you can jfdi that through if you want
[17:03] <ericsnow> natefinch: one-on-one?
[17:03] <jcw4> mgz: tx
[17:06] <jcw4> mgz tx again :)
[17:07] <mgz> :)
[17:12] <natefinch> ericsnow: sorry, coming
[17:31] <wwitzel3> woohoo! I'm down to permission issues
[17:41] <wwitzel3> victory is mine!
[17:50] <natefinch> wwitzel3: nice!
[17:53] <perrito666> sinzui: I would prett much guess that the cause for restore to no longer finish is https://github.com/juju/juju/commit/55a9507924dea63658598361797ec864b9879e84#diff-d41d8cd98f00b204e9800998ecf8427e
[17:54] <wwitzel3> natefinch: yeah, the outchannel command cannot take arguments, so you have to do it as a script. also if outchannel encounters a non-zero exit code from the script, it assumes the channel is bad and stops sending to it. Until you restart/reload.
[17:54] <natefinch> huh ok, so your script wasn't returning 0 I guess?
[17:54] <wwitzel3> natefinch: and lastly, the logrotate conf itself, must have the right set of permissions.
[17:55] <wwitzel3> natefinch: right, it wasn't because of a permission issue, which was just being thrown away
[17:55] <natefinch> ahh
[17:55] <sinzui> perrito666, I just got permission to merge my log recovery changes. The next runs of the recovery tests will try to get logs from that machine that restore created
[17:56] <sinzui> perrito666, yes, the commit does look like it relates to the vague error in the jenkins console
[17:56] <wwitzel3> natefinch: so it had 3 issues, but I just had to discover them in the right order.
[17:57] <wwitzel3> natefinch: for example, at first I was calling logrotate as my command and passing it arguments ... logrotate was being called and returning 0 so logging continued, but the actual conf wasn't being passed.
[17:58] <perrito666> voidspace: care to shed some light on the commit?
[17:58] <perrito666> it has your name
[17:58] <wwitzel3> natefinch: then finally burried in the documentation for outchannel I found an small italic note about the fact outchannel command can't take any arguments.
[17:59] <wwitzel3> oh and the issue with logrotates default state file path, that was easy though once I was actually able to start getting debug output from the rsyslog outchannel executing the command.
[17:59] <natefinch> yeesh
[18:00] <natefinch> I think you should write all that up and put it on juju-dev, so that someone else has a chance of understanding everything later.... and actually, a big comment in the code about it wouldn't hurt either.
[18:01] <wwitzel3> natefinch: so the rsyslog cert and key, live in the log folder even though they aren't logs.
[18:01] <natefinch> or in the docs somewhere... something
[18:01] <natefinch> wwitzel3: nice
[18:01] <wwitzel3> natefinch: so I assume it is ok for the logrotate.conf to live there too?
[18:02] <wwitzel3> natefinch: also the logrotate helper script that runs logrotate with the juju conf.
[18:03] <wwitzel3> natefinch: yeah I was planning to document the behavior of outchannel and ref the docs as well as the permission requirements for the logrotate.conf and state file.
[18:04] <wwitzel3> natefinch: since I also need to document the existence of all-machine.log.1 and the max size, etc..
[18:04] <natefinch> ahh yep, definitely
[18:04] <natefinch> and yeah, we can put more stuff in there if there's already non-logs in there
[18:05] <tych0> h
[18:23] <hazmat> wwitzel3, you get your rsyslog issues resolved?
[18:53] <wwitzel3> hazmat: yep, thanks :)
[19:13]  * sinzui kicks CI to test ha and restore now
[19:15] <natefinch> sinzui: I have a possible fix for the other bug too
[19:15] <natefinch> sinzui: https://github.com/juju/juju/pull/501
[19:18] <natefinch> perrito666, ericsnow, wwitzel3: one of you want to review? ^^  really simple change with a test and everything.  I don't actually know why the code worked before and then suddenly stopped working, but at the very least this is a change that won't break anything and has a test to make sure it continues working.
[19:19] <natefinch> super simple change
[19:19] <ericsnow> natefinch: sure
[19:20] <natefinch> test verified to panic on the old code, just like in comment #18 here: https://bugs.launchpad.net/juju-core/+bug/1347715
[19:20] <natefinch> (and test verified not to panic with the new code)
[19:21] <natefinch> aside:  I wish the comments on launchpad bugs were anchors, so I could do <bug-url>#comment-18 and have it jump directly to the comment. Instead the comment numbers just open up a separate window with no context which is completely useless.
[20:48] <thumper> morning folks
[20:53] <thumper> sinzui: you around?
[20:54] <sinzui> I am
[20:56] <thumper> sinzui: what's the tl;dr on CI status?
[20:57] <sinzui> thedac, 2 blockers, natefinch's fix for 1 is queue to play in the next hour
[20:58] <natefinch> thumper: here's the fix... though I admittedly don't know why it worked before: https://github.com/juju/juju/pull/501
[20:58] <sinzui> thumper, my effort to get more logs from the failed restore tests also failed. no new data
[20:58]  * thumper takes a quick look
[20:58]  * sinzui wishes --debug didn't leak certs, keys, users, and passwords
[20:59] <natefinch> sinzui: me too
[20:59] <natefinch> I gotta run
[20:59] <natefinch> good luck everyone, and good night
[21:00] <thumper> sinzui: so the remaining CI failure is a restore one?
[21:01] <sinzui> thumper, yes https://bugs.launchpad.net/juju-core/+bug/1355324
[21:01] <sinzui> thumper, the test is playing right now. I hope to ssh in at the right moment and cat the logs
[21:01] <thumper> kk
[21:13] <waigani> thumper: welcome back!
[21:13] <menn0> thumper, waigani: good morning
[21:13] <thumper> o/
[21:13] <waigani> morning :)
[21:16]  * thumper headdesks
[21:51] <sinzui> perrito666, I added logs to bug 1355324. Tomorrow I am going to investigate redirecting all stderr to a private location on disk to try --debug
[21:59] <perrito666> sinzui: the rev I pointed is the culprit, it appears that we no longer have mongo listening on StatePort and therefore restore can not connect to It I need to get voidspace or fwereade to find out more about it
[22:00] <thumper> perrito666: why isn't mongo listening?
[22:00] <sinzui> perrito666, okay. that right, I have too much going on to keep my work list clear
[22:00] <perrito666> thumper: I dont think its not listening, we are no long publicizing the port I believe
[22:01] <perrito666> thumper: https://github.com/juju/juju/pull/449/files
[22:01] <sinzui> thumper, we agreed some months ago that nothing is allowed to talk to mongo directly...but restore does
[22:01] <ericsnow> perrito666: from the PR it sounds like it's not even listening on external ports
[22:02] <perrito666> ericsnow: I really need to dig more into this otherwise I am talking in educated guesses
[22:03] <thumper> ah...
[22:03] <thumper> didn't I see a change where mongo only listened internally?
[22:03] <ericsnow> perrito666: https://github.com/juju/juju/pull/449#issuecomment-50825367
[22:03] <thumper> like only on localhost?
[22:03] <ericsnow> perrito666: "I've also confirmed that with current master I can telnet to port 37017, and that with this branch I can't."
[22:03] <perrito666> not even a day running ubuntu on this machine and I already have unity behaving oddly.. that must be a record for a fresh install
[22:03] <thumper> ericsnow: that's the one
[22:04] <perrito666> thumper: sounds to me that not even, which puzzles me a bit unlessss
[22:04] <perrito666> api is using direct connection
[22:04] <perrito666> I dont know if direct is the right name
[22:04] <ericsnow> perrito666: yeah, that's what I find weird
[22:04] <ericsnow> thumper: yep, that's the PR for the changeset that broke restore
[22:05] <perrito666> thinkpads ability to swap fn/ctrl on the bios is marvelous
[22:06] <thumper> ericsnow, perrito666: simplest thing IMO is to revert PR 449
[22:07] <thumper> and take a fresh look later at closing the port
[22:07] <thumper> better than trying to work out now how to selectively open it
[22:09] <perrito666> I guess there is no other option, also by doing this I can get the original author to grep again for the use of StatePort :p
[22:10] <ericsnow> perrito666: yeah, tell him about that grep thing ;)
[22:12] <ericsnow> thumper, perrito666: +1 on reverting
[22:12] <ericsnow> (too bad another blocker will show up before I have a chance to merge anything tomorrow <wink>)
[22:15] <perrito666> ok, let me have a late evening snack and Ill propose the revert
[22:15] <perrito666> btw, did I mention I have the eating habits of a hobbit?
[22:40] <perrito666> so, to revert we do a reverse pr or use the "sorry I screwed up, please undo" feature from github
[22:44] <thumper> perrito666: um... there is a github feature?
[22:44] <thumper> perrito666: I'd have just done a reverse PR
[22:44]  * thumper goes to make a coffee
[22:44] <perrito666> thumper: I saw github offering me to undo with a button the other day
[23:02] <thumper> waigani: standup hangout time
[23:11] <perrito666> how do I reference a ticket/pr on a pr description
[23:11] <perrito666> ?
[23:16] <perrito666> sinzui: or anyone
[23:16] <perrito666> the $$fixes tag is to be added into the $$merge or in the pr description?
[23:17] <sinzui> perrito666, fixes-nnnnn
[23:17] <sinzui> perrito666, in a comment by itself or with other text
[23:17] <perrito666> sinzui: yup, but, that is to be added into the pr body or into the merge comment?
[23:17] <perrito666> my question is, will that trigger the merge?
[23:17] <sinzui> merge comment perrito666
[23:18] <sinzui> perrito666, it doesn't trigger a merge.
[23:18] <sinzui> perrito666, $$merge$$ to trigger the merge and fixes-nnnnnn ti explain why the merge is valid
[23:19] <perrito666> tx sinzui
[23:19] <perrito666> ok anyone ptal https://github.com/juju/juju/pull/503
[23:29] <ericsnow> I'm signing off, but if anyone is able, ptal @ https://github.com/juju/utils/pull/16 https://github.com/juju/utils/pull/19 https://github.com/juju/juju/pull/462 https://github.com/juju/juju/pull/453
[23:29] <sinzui> perrito666, I just discovered the first comment is not a comment according to github api. I can see ericsnow is the first comment, and my test comment is the second
[23:30] <perrito666> sinzui: that is why I did not add the $$ fixes part, I will add it in the merge comment
[23:40] <sinzui> perrito666, okay. I am a little disappoint because I think it is nice to state in the first comment why the branch should be merged :/
[23:52] <menn0> testing... please ignore: bug 1347715
[23:52] <mup> Bug #1347715: Manual provider does not respond after bootstrap <bootstrap> <ci> <regression> <juju-core:In Progress by natefinch> <https://launchpad.net/bugs/1347715>
[23:52] <menn0> \o/ (fixed!)
[23:53] <perrito666> thumper: that is not going to work
[23:53] <perrito666> you did not add the $$fixes-###$$