thumperdavecheney: I'm looking at the peergrouper tests, but they now seem to pass when run individually, but fail when I run all juju tests01:03
thumperdavecheney: I'm expecting it is impacted by load01:03
thumperdavecheney: using your stress.sh script01:03
thumperbut I'm needing something to stress either cpu or disk01:03
thumperdo you have something?01:03
thumperwallyworld: we need to chat, re: simplestreams and lxd01:04
wallyworld"we need to talk"01:04
wallyworldi hate those 4 words01:04
thumper"please come to the office?"01:05
wallyworldthumper: did you want to talk now?01:05
thumpernot necessarily now01:05
wallyworldthumper: ok, give me 15?01:05
thumpersure, np01:05
wallyworldthumper: talk now?01:33
thumperI have found the race in the peergrouper02:28
davecheneythumper: ??02:33
davecheneydo tell02:33
thumperthere are timing issues between the various go routines it starts up02:34
davecheneyyup, so when you run go est ./...02:34
davecheneyyou have 4/8 other test jobs running at one time02:34
davecheneytiming goes off02:34
thumpersometimes under heavy load, the peer grouper will attempt to determine the leader before it realises it has any machines02:34
thumperon successful runs, the machine watchers have fired before the other change02:35
thumperso it knows about the machines02:35
thumperon unsuccessful runs, it doesn't02:35
thumperso all the machines are "extra"02:35
thumperand have nil vote, so fail02:35
thumperI'm now attempting to work out the best place to sync the workers...02:38
thumperand best way how to...02:38
thumperpretty sure that's a bit bollocks02:42
thumperdavecheney: using a *machine as a key in a map?02:42
davecheneycould be reasonable02:46
davecheneyasumuing that nobody every creates a machine02:46
davecheneywhich could be a problem02:47
thumperI can't work out how to sync these things03:06
thumperdavecheney: got a few minutes?03:06
davecheneythumper: hey04:21
davecheneysorry, i was at the shops04:22
davecheneystill there ?04:22
thumperyeah, but sent an email04:22
thumperI've given up on the peergrouper04:22
thumperit is a big pile of assumptions I don't understand04:22
davecheney /o\04:22
davecheneyi sounds like it needs more synchornisation04:23
davecheneyif parts of the peergrouper assume something04:23
thumperI added what I thought would be enough04:23
thumperbut no04:23
davecheneythat needs to be replaced with expiliit coordination04:23
thumperyes, I agree with that last statement04:23
davecheneythe worrying part is i think we can assume it will fail ~100% of the time in the field04:23
davecheneygiven it only just passes under controlled conditions04:23
davecheneythis should proibably be a build blocker04:24
thumperthe big problem as best as I can tell, is that it assumes that whenever the timer goes off for it to update itself, it assumes it knows the current state of the machines04:24
thumperwhich it does not04:24
davecheneythat's impossible04:24
thumperbecause those changes come in asyncronously04:24
thumperand it isn't querying04:25
davecheney /me facepalm04:25
thumperI think that what it should do, is explicitly query all machines at the point of trying to decide04:25
thumperand not rely on just change notifications04:25
davecheneyi think it's worse than that04:26
davecheneyyou cannot query a machine04:26
davecheneythen do something with that information04:26
davecheneyan unlimited amount of time can pass between statements04:26
davecheneyany information you retrieve has to be assumed to be stale04:26
thumperwell, in practice, it isn't infinite04:26
davecheneyyou have a distributed locking problem04:26
thumperbut it certainly isn't zero04:26
thumperI think that for any point in time, it should ask for the current state of the machines it cares about, and use that consistent information to make the decision04:27
thumperto the best of its ability04:27
thumperrather than the inconsistent picture it currently has04:27
thumperbut I have no more fucks to give04:28
davecheneyis there a way to query the state of all machines atomically04:29
davecheneyor is it N+1 style ?04:29
thumperyes, I belive there is an API call to get the machine info04:29
thumperif there isn't it is easy to add one04:29
thumperas atomically as mongo gives us04:30
thumperdirty reads and all that :)04:30
davecheneyso, this is going to work 99% of the time04:31
davecheneyexcept the time when it fails because everyuthing is going up and down like a yoyo04:31
davecheneyin the 99% case, you don't need atomics or any of that jazz 'cos it's approximately steady state04:31
davecheneyin the 1% case, when we _really_ need it towkr04:32
davecheneyto work04:32
davecheneyit's not going to04:32
davecheneyat all04:32
davecheneythis is a poor outcome04:32
* thumper nods04:33
thumperdavecheney: the problem is, as I see it, that any server under load, as it probably will be at startup, the peergrouper will fail the first time through its loop, and get restarted04:35
thumpereventually it'll probably get settled04:35
thumperbut geez04:35
thumperhow not to do something04:35
davecheneyyeah, that's what I was grasping at04:37
davecheneyunder steady state, it'll work just fine04:37
davecheneywhich is useless04:37
davecheneyand under load, it'll freak out04:37
davecheneywhich is useless04:37
* thumper is done04:45
wallyworldaxw_: if you have time at any point, could you take a look at http://reviews.vapour.ws/r/3046/ and http://reviews.vapour.ws/r/3104 for me? not urgent, just if/when you have some time05:44
axw_wallyworld: ok, probably not till later on05:45
wallyworldnp, no rush05:45
axw_just wrapping up azure changes to support legacy05:45
wallyworldawesome, can definitely wait till after that05:45
=== urulama__ is now known as urulama
axw_wallyworld: are you around?08:24
axw_wallyworld: never mind, self approving my merge of master into azure-arm-provider08:27
axw_mgz_: are you able to add "azure-arm-provider" as a feature branch to CI?08:28
axw_mgz_: or is it automatic...?08:29
wallyworldaxw_: it's automatic, but unless yu ask, it won't get to the top of the queue08:38
wallyworldaxw_: sory, was eating08:38
axw_wallyworld: thanks08:44
axw_wallyworld: FYI, PR to merge the azure-arm provider into the feature branch: https://github.com/juju/juju/pull/370109:07
axw_wallyworld: warning, it's extremely large09:07
wallyworldaxw_: ty, will look09:11
mwhudsonoh, not that arm09:29
frobwaredimitern, ping 1:1?09:34
dimiternfrobware, hey, oops - omw09:34
dimiternvoidspace, jam, fwereade, dooferlad, standup?10:00
dimiternjamespage, gnuoy, juju/os call?10:31
jamespagedimitern, 2 mins10:31
frobwaredimitern, I moved the openstack meeting to 16:30, but that may be too late for you.11:15
dimiternfrobware, it's fine for me as scheduled11:24
frobwaredimitern, thanks & appreciated11:24
dimiternvoidspace, reviewed11:31
frobwaredimitern, voidspace, dooferlad: http://reviews.vapour.ws/r/3102/11:56
dimiternfrobware, looking11:56
dimiternfrobware, btw updated http://reviews.vapour.ws/r/3088/ to fix the mac address issue with address-allocation enabled for kvm12:07
dimiternand tested it to work12:07
frobwaredimitern, just saw it. checking my change against voidspace's change at the moment.12:07
dimiternfrobware, hmm, so you decided to go for the full mile there - always using addresses instead of hostnames if possible, even in status12:11
frobwaredimitern, if there's an address that is not resolvable you cannot connect to the machine.12:12
dimitern(rather than just for mongo peer host/ports)12:12
frobwaredimitern, we're fixing the wrong bug, IMO. We need to fix maas.12:12
frobwaredimitern, see the commit message for why you need to drop unresolvable names12:13
dimiternfrobware, yeah, fair enough12:13
dimiternfrobware, there is however a ResolveOrDropHostnames that does almost the same thing in hostport.go12:14
frobwaredimitern, the trouble is that resolves12:14
frobwaredimitern, let's chat instead. HO?12:15
dimiternfrobware, ok, I'm joining the standup one12:15
frobwaredimitern, voidspace, dooferlad: "maas-spaces" feature branch created12:53
dimiternfrobware, awesome! let's get cranking :)12:53
frobwaredimitern, T-3 weeks...12:58
dimiternfrobware, yeah, it's not a lot, is it :/13:06
=== akhavr1 is now known as akhavr
mupBug #1514857 opened: cannot use version.Current (type version.Number) as type version.Binary <juju-core:Incomplete> <juju-core lxd-provider:Triaged> <https://launchpad.net/bugs/1514857>14:31
mupBug #1514857 changed: cannot use version.Current (type version.Number) as type version.Binary <juju-core:Incomplete> <juju-core lxd-provider:Triaged> <https://launchpad.net/bugs/1514857>14:34
mupBug #1514857 opened: cannot use version.Current (type version.Number) as type version.Binary <juju-core:Incomplete> <juju-core lxd-provider:Triaged> <https://launchpad.net/bugs/1514857>14:37
dimiterndamn...why did I spend almost a week fixing 1.2414:49
dimiternno 1.24.8 :(14:49
voidspacedimitern: yeah, shame15:04
voidspacedimitern: and there won't be a version of 1.24 with containers and "ignore-machine-addresses" working15:04
dimiternvoidspace, if 1.24 dies quickly, that won't be a big deal :)15:05
voidspacedimitern: hopefully15:05
mupBug #1502306 changed: cannot find package gopkg.in/yaml.v2 <blocker> <ci> <regression> <juju-core:Invalid> <juju-core lxd-provider:Fix Released> <https://launchpad.net/bugs/1502306>15:19
mupBug #1514874 opened: Invalid entity name or password error, causes Juju to uninstall <sts> <juju-core:New> <https://launchpad.net/bugs/1514874>15:19
mupBug #1514877 opened: Env not found immediately after bootstrap <blocker> <ci> <regression> <test-failure> <juju-core:Incomplete> <juju-core controller-rename:Triaged> <https://launchpad.net/bugs/1514877>15:19
katcofwereade: hey ran into bug 1503039 last friday while writing a reactive charm. any reason not to set that env. variable all the time?15:21
mupBug #1503039: JUJU_HOOK_NAME does not get set <charms> <docs> <hooks> <juju-core:Triaged> <https://launchpad.net/bugs/1503039>15:21
mupBug #1514874 changed: Invalid entity name or password error, causes Juju to uninstall <sts> <juju-core:New> <https://launchpad.net/bugs/1514874>15:22
mupBug #1514877 changed: Env not found immediately after bootstrap <blocker> <ci> <regression> <test-failure> <juju-core:Incomplete> <juju-core controller-rename:Triaged> <https://launchpad.net/bugs/1514877>15:22
mupBug #1502306 opened: cannot find package gopkg.in/yaml.v2 <blocker> <ci> <regression> <juju-core:Invalid> <juju-core lxd-provider:Fix Released> <https://launchpad.net/bugs/1502306>15:22
mupBug #1502306 changed: cannot find package gopkg.in/yaml.v2 <blocker> <ci> <regression> <juju-core:Invalid> <juju-core lxd-provider:Fix Released> <https://launchpad.net/bugs/1502306>15:25
mupBug #1514874 opened: Invalid entity name or password error, causes Juju to uninstall <sts> <juju-core:New> <https://launchpad.net/bugs/1514874>15:25
mupBug #1514877 opened: Env not found immediately after bootstrap <blocker> <ci> <regression> <test-failure> <juju-core:Incomplete> <juju-core controller-rename:Triaged> <https://launchpad.net/bugs/1514877>15:25
fwereadekatco, nah, go ahead and set it always15:26
fwereadekatco, it was originally just for debug-hooks, when you wouldn't know15:27
fwereadekatco, and you *can* always look at argv[0]15:27
fwereadekatco, but better just to be consistent across the board15:27
katcofwereade: kk ty just wanted to check15:27
fwereadekatco, cheers15:28
* fwereade gtg out, back maybe rather later15:28
* katco waves15:28
mupBug #1514874 changed: Invalid entity name or password error, causes Juju to uninstall <sts> <juju-core:New> <https://launchpad.net/bugs/1514874>15:31
mupBug #1514877 changed: Env not found immediately after bootstrap <blocker> <ci> <regression> <test-failure> <juju-core:Incomplete> <juju-core controller-rename:Triaged> <https://launchpad.net/bugs/1514877>15:31
mupBug #1502306 opened: cannot find package gopkg.in/yaml.v2 <blocker> <ci> <regression> <juju-core:Invalid> <juju-core lxd-provider:Fix Released> <https://launchpad.net/bugs/1502306>15:31
katcoericsnow: did you use git mv for your cleanup patch?15:33
ericsnowkatco: yep15:34
ericsnowkatco: the GH diff is a little easier to follow15:34
mupBug #1502306 changed: cannot find package gopkg.in/yaml.v2 <blocker> <ci> <regression> <juju-core:Invalid> <juju-core lxd-provider:Fix Released> <https://launchpad.net/bugs/1502306>15:34
mupBug #1514874 opened: Invalid entity name or password error, causes Juju to uninstall <sts> <juju-core:New> <https://launchpad.net/bugs/1514874>15:34
mupBug #1514877 opened: Env not found immediately after bootstrap <blocker> <ci> <regression> <test-failure> <juju-core:Incomplete> <juju-core controller-rename:Triaged> <https://launchpad.net/bugs/1514877>15:34
katcoericsnow: i wish RB would detect that and show just the diffs instead of all green15:34
ericsnowkatco: yep, me too15:35
perrito666ahh RB the source of most of our wishes :p15:35
marcoceppi_alexisb: is anyone working on this? https://bugs.launchpad.net/juju-core/+bug/1488139 will it actually make it to alpha2?15:38
mupBug #1488139: juju should add nodes IPs to no-proxy list <network> <proxy> <juju-core:Triaged> <https://launchpad.net/bugs/1488139>15:38
alexisbcherylj, ^^^15:38
voidspacedimitern: ping15:59
voidspacedimitern: for "pick provider first" for addresses the upgrade step is AddPreferredAddressesToMachine16:00
voidspacedimitern: that's the same upgrade function used to add preferred addresses to machines in the first place16:00
dooferladpro tip: if you uninstall maas, make sure that you get rid of maas-dhcp16:00
voidspacedimitern: 1.25 already calls this as an upgrade step, so I assert that the backport to 1.25 doesn't need to add a new upgrade step...16:00
voidspacedooferlad: :-)16:00
dooferladtwo DHCP servers on the same network results in such fun :-(16:00
voidspacedooferlad: there are about seven billion maas packages16:00
dooferladvoidspace: indeed. I think it didn't get maas-dhcp when I uninstalled because by default it isn't installed with the maas metapackage16:01
dimiternvoidspace, that sounds good16:19
perrito666well, in a whole new way of creepyness, google now adds the flight to your personal calendar when you get your plane tickets via email16:33
perrito666even though, the email was not the usual plain text reservation16:34
marcoceppi_we need help, our websocket connectino keeps dying during a deployment tanking charm testing for power8.16:41
marcoceppi_these are the last few lines of the log16:41
marcoceppi_INFO juju.rpc server.go:328 error closing codec: EOF16:42
marcoceppi_what does that mean^?16:42
natefinchmarcoceppi_: I think in this case, EOF should be treated like "not an error"16:48
natefinchmarcoceppi_: yeah, looking at the code, that just means it probably was already closed16:48
marcoceppi_well, we've been wrestling with this for a few days now, and we're suck in that every time after a few mins, the websocket abruptly closes and tanks the python websocket library, which kills python-jujuclient, which kills amulet16:51
marcoceppi_so we're unable to run charm tests on our power8 maas16:51
marcoceppi_I'm prepared to provide anyone willing to help logs or whatever else is needed. I've exhausted my troubleshooting16:52
natefinchalexisb: ^16:54
alexisbmarcoceppi_, is this related to the bug you pointed at earlier?16:58
marcoceppi_alexisb: it's a machine running behind the great canonical firewall, we've got some things punched through and the rest we're using an http proxy. It seems this breakage always happens around the same time so we're removing as much of the proxy to test further16:59
marcoceppi_alexisb: long story short, not sure if this is related, we've manually no-proxy listed /everything/ for the environment so while that bug will help, it's not likely going to resolve whatever we're hitting17:00
alexisbso marcoceppi_ do you have a system we can triage?17:07
marcoceppi_alexisb: yes, but it's behind the vpn and some special grouping, though I may be able to get someone access if they aren't in that group17:08
marcoceppi_alexisb: ignore, yes we have a system to triage17:08
alexisbkatco, can you get someone on your team to work w/ marcoceppi_ please17:09
alexisbmarcoceppi_, we will need to make sure there is a bug open to track status17:09
katcoalexisb: yep17:10
katcomarcoceppi_: is there already a bug for this?17:10
marcoceppi_alexisb: I'll file a bug though I'm not sure the problem name17:10
marcoceppi_we're not even able to diagnose the source of the problem17:10
katcomarcoceppi_: that's ok, we can iterate on the title :)17:11
marcoceppi_katco: https://bugs.launchpad.net/juju-core/+bug/151492217:12
mupBug #1514922: Deploying to maas ppc64le with proxies kills websocket <juju-core:New> <https://launchpad.net/bugs/1514922>17:12
katcomarcoceppi_: can you also update the bug with the details of what you've been discussing here, and any relevant logs?17:12
katcomarcoceppi_: (ty for filing a bug)17:13
rick_h__urulama: frankban ^ did we see something with the websockets closing on us?17:14
rick_h__urulama: frankban please see if this souds familiar at all and with our 'ping' and such17:14
urulamawell, it was through apache ... not sure what is meant by proxy in the bug? apache reverseproxy?17:15
marcoceppi_katco: updated17:15
katcomarcoceppi_: ty sir17:15
frankbanrick_h__, urulama it does not look familiar17:16
rick_h__frankban: ok, thanks17:16
marcoceppi_fwiw, I can connect and deploy just fine to the environment, it's when we keep a persistent websocket connection open that it tanks after a few mins of websocketing, or whatever websockets do17:17
marcoceppi_this build script works without issue on all other testing substrates17:17
katcomarcoceppi_: so it's *just* ppc?17:18
marcoceppi_katco: well it's the only maas we haver access to17:18
marcoceppi_it just so happens to be ppc64le17:18
katcomarcoceppi_: gotcha... what do you mean when you say it works on all other testing substrates?17:18
marcoceppi_katco: gce, aws, openstack, etc17:19
marcoceppi_katco: this job runs all our other charm testing substrates, which are public clouds and local17:19
katcomarcoceppi_: ah ok17:19
natefinchit's unfortunate that our only MAAS environment is also on a wacky architecture17:20
marcoceppi_well, it's not the only maas environment for testing, juju ci has a few they use. It's the only maas environ we have for charm testing and it's maas because no public cloud have power8 yet17:20
natefinchmarcoceppi_: it's a shame it's the only MAAS environment *you* have for testing, then :)17:21
marcoceppi_hah, yes.17:21
mupBug #1514922 opened: Deploying to maas ppc64le with proxies kills websocket <juju-core:New> <https://launchpad.net/bugs/1514922>17:25
marcoceppi_katco: this seems to be related to http-proxy juju environment stuff. We remove all but the apt-*-proxy keys and the websocket didn't die17:44
katcomarcoceppi_: hm ok thanks that helps17:45
mupBug #1514616 opened: juju stateserver does not obtain updates to availability zones <kanban-cross-team> <landscape> <juju-core:New> <https://launchpad.net/bugs/1514616>17:55
marcoceppi_katco: it appears setting apt-http-proxy and other env variables does not do what is expected17:59
marcoceppi_katco alexisb this isn't a priority for today, there are too many sharp sticks in our eyes to get a clear enough vision on this18:08
marcoceppi_for today anymore*18:09
marcoceppi_but it's very much a problem we will need fixed for 1.2618:09
marcoceppi_If getting on a hang out to explain this more helps, lmk18:09
cmarsperrito666, can I get a review of http://reviews.vapour.ws/r/3041/ ? it's a bugfix for LP:#1511717 backported to 1.2518:24
mupBug #1511717: Incompatible cookie format change <blocker> <ci> <compatibility> <regression> <juju-core:Fix Released by cmars> <juju-core 1.25:In Progress by cmars> <juju-core 1.26:Fix Committed by cmars> <https://launchpad.net/bugs/1511717>18:24
katcomarcoceppi_: just lmk when you get a better idea of what's going on18:31
marcoceppi_katco: we have no idea what's going on. We just know it's not getting resolved in 2 hours time18:32
cheryljericsnow: ping?18:34
ericsnowcherylj: hey18:34
cheryljhey ericsnow :)  got a question for you about systemd18:34
ericsnowcherylj: sure18:34
cheryljericsnow: was there a reason you linked the service files, rather than copying them over?  just out of curiosity18:35
ericsnowcherylj: was trying to stick just to the systemd API rather than copying any files18:36
cheryljericsnow: okay, I was just wondering.  I've seen 2 bugs of people doing things we wouldn't expect that causes problems with just using links.18:36
cheryljI'm okay with making those special cases work around juju :)18:37
ericsnowcherylj: sounds good18:37
ericsnowcherylj: np :)18:37
perrito666cmars: sure you can18:55
perrito666sorry was afk for a moment18:55
perrito666cmars: shipit18:58
cmarsperrito666, thanks!19:12
=== urulama is now known as urulama__
natefinchand.... master is blocked, dangit20:56
natefinchericsnow, wwitzel3: can you guys review http://reviews.vapour.ws/r/3103/ real quick?  It's best to look at the PR (https://github.com/juju/juju/pull/3698) rather than reviewboard, because 99% of the code has already been reviewed, only a few small tweaks need to be reviewed (everything but the cherry-picked merge).21:02
ericsnownatefinch: looking21:04
natefinchI just made the worker into a singular worker and updated a test to check that.  The last commit is really just redoing work in the first commit, because I cherry-picked the merge afterward (a result of me doing things in the wrong order, but seemed like not worth the trouble to redo it in the right order)21:06
ericsnownatefinch: LGTM21:06
natefinchericsnow: thanks :)21:07
natefinchkatco, ericsnow: ug, looking at the failures on the lxd branch, I think it's just that some stuff changed out from underneath us... but when I rebase, I get 332 merge conflicts :/21:37
ericsnownatefinch: the patch I have up for review fixes most of those errors21:38
ericsnownatefinch: http://reviews.vapour.ws/r/3101/21:38
katconatefinch: we rebased off the last bless of master21:38
katconatefinch: i.e. we're intentionally behind master21:38
katcoericsnow: the patch you have up fixes the things that cursed our branch?21:43
ericsnowkatco: several of them21:43
ericsnowkatco: oh21:44
natefinchthe one I was looking at was this one: https://bugs.launchpad.net/juju-core/+bug/151485721:44
ericsnowkatco: not that cursed our branch21:44
mupBug #1514857: cannot use version.Current (type version.Number) as type version.Binary <blocker> <ci> <regression> <test-failure> <juju-core:Incomplete> <juju-core lxd-provider:Triaged> <https://launchpad.net/bugs/1514857>21:44
ericsnowkatco: rather, the Wily test failures (which will curse our branch soon enough)21:44
katcoericsnow: ah ok. natefinch looks like you're still good to look at the curses21:45
natefinchkatco: ok21:45
natefinchmy problem is figuring out why there's a compile issue. Seems like we got half of a change or something21:50
ericsnownatefinch: looks like katco didn't use the merge bot :P21:54
katcoericsnow: i did not. is it causing problems?21:55
ericsnowkatco: yeah, the merge broke some code21:55
ericsnowkatco: the merge bot would have caught it21:55
katcoericsnow: oh oops :( sorry natefinch21:57
katcoericsnow: natefinch: the idea was to get a bless anyway, so skipped the bot. shouldn't have done that21:57
natefinchkatco, ericsnow: http://reviews.vapour.ws/r/3110/21:59
ericsnownatefinch: LGTM22:00
natefinchI gotta run, it's time o'clock, as my 2 year old would say.   But I can land this later, or someone else can $$merge$$ as they wish22:01
natefincheverything compiles now.. . tehre's some maas timeouts, but I'm guessing those are spurious.22:02
=== natefinch is now known as natefinch-afk
natefinch-afkback later.  have a lot of work time left for today.22:02
mupBug #1515016 opened: action argument with : space is incorrectly interpreted as json <juju-core:New> <https://launchpad.net/bugs/1515016>22:20
=== akhavr1 is now known as akhavr
katcoericsnow: wwitzel3: natefinch-afk: please don't forget to update your bugs with status for the day22:37
ericsnowkatco: will do22:37
wwitzel3katco: rgr22:38
wallyworldaxw_: perrito666: give me a minute23:15
anastasiamacwallyworld: k23:15

Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!