jcw4 | wallyworld_: looking at the error messages I was wondering if it was an issue with stale .a files on the test machine... is that possible? | 00:34 |
---|---|---|
wallyworld_ | jcw4: maybe, but i can reproduce locally using compiler=gccgo | 00:34 |
wallyworld_ | i get lots of segfaults as well | 00:34 |
jcw4 | wallyworld_: ah.. I misunderstood... I thought you couldn't repro locally. | 00:34 |
wallyworld_ | but i can get the test failure | 00:34 |
jcw4 | wallyworld_: is there any way to debug if you don't have a ppc machine? | 00:35 |
wallyworld_ | i can't reproduce unless i just run one test at a time | 00:35 |
wallyworld_ | you have to know how gccgo works i think | 00:35 |
wallyworld_ | i have no idea :-( | 00:35 |
jcw4 | I see | 00:35 |
jcw4 | :) | 00:35 |
wallyworld_ | davecheney: you around? | 00:35 |
davecheney | wallyworld_: ack | 00:37 |
wallyworld_ | davecheney: bug 1365480 is blocking ci. it appears to be a gccgo issue because it fails to run the hooks used to mock out method calls | 00:38 |
mup | Bug #1365480: ppc64el unit tests fail in many ways <ci> <ppc64el> <regression> <juju-core:Triaged by wallyworld> <https://launchpad.net/bugs/1365480> | 00:38 |
wallyworld_ | i have no idea how to fix | 00:38 |
wallyworld_ | this is failing to work | 00:39 |
wallyworld_ | cleanup := s.srv.Service.Nova.RegisterControlPoint( | 00:39 |
wallyworld_ | "addFloatingIP", | 00:39 |
wallyworld_ | func(sc hook.ServiceControl, args ...interface{}) error { | 00:39 |
wallyworld_ | return fmt.Errorf("failed on purpose") | 00:39 |
wallyworld_ | }, | 00:39 |
wallyworld_ | ) | 00:39 |
wallyworld_ | the register func uses the stackframe to figure out what to do | 00:39 |
wallyworld_ | i guess it's broken again - i think it was broken before at some point? | 00:40 |
wallyworld_ | there's several tests affected | 00:40 |
davecheney | yeah, it breaks a bit | 00:42 |
davecheney | has the version of gccgo on the builder machine changed ? | 00:43 |
wallyworld_ | nfi | 00:43 |
wallyworld_ | i was running gccgo (Ubuntu 4.9.1-10ubuntu2) 4.9.1 | 00:43 |
wallyworld_ | the build machine had gccgo (Ubuntu 4.9.1-12ubuntu2) 4.9.1 | 00:43 |
wallyworld_ | i updated to -12 locally | 00:44 |
davecheney | ok, and it only repro's on ppc ? | 00:44 |
wallyworld_ | i can repo locally using -compiler=gccgo | 00:44 |
wallyworld_ | but i get LOTS of segfaults | 00:44 |
wallyworld_ | i have to specify each test one at a time | 00:44 |
wallyworld_ | and yes, ci fails when running of ppc | 00:44 |
wallyworld_ | davecheney: am i able to ask you to look into this a bit? i have no idea where to start with regard to gccgo | 00:55 |
davecheney | wallyworld_: can I fix it on monday ? | 00:59 |
wallyworld_ | davecheney: it's blocking landings sadly, unless we can get the regressionm tag removed | 01:00 |
davecheney | i recommend removing the regression tag | 01:01 |
davecheney | if this is a compiler fix | 01:01 |
davecheney | we can't do that at critical level | 01:01 |
wallyworld_ | sinzui: bug 1365480 looks like a gccgo issue, is there anyway we can remove the regression tag? | 01:01 |
mup | Bug #1365480: ppc64el unit tests fail in many ways <ci> <ppc64el> <regression> <juju-core:Triaged by wallyworld> <https://launchpad.net/bugs/1365480> | 01:01 |
wallyworld_ | davecheney can do a compiler fix but not till mondau | 01:01 |
sinzui | wallyworld_, your made | 01:01 |
davecheney | i can look at it on monday | 01:02 |
sinzui | mad | 01:02 |
davecheney | i can't promise a fix | 01:02 |
sinzui | wallyworld_, the old version of juju works, and now it doesn't. | 01:02 |
wallyworld_ | sinzui: i can prove that code which has not been touched for ages fails because gccgo does not register the monkey patch being applied | 01:03 |
sinzui | wallyworld_, we can retest an older revision, maybe the one that passed. If the test fails like the new revision then we know something other than juju changes | 01:03 |
wallyworld_ | gccgo is can be fragile when it comes to looking at the call stack | 01:04 |
wallyworld_ | which is how the monkey patching stuff works | 01:04 |
wallyworld_ | fragile = different to golanggo | 01:04 |
sinzui | I will retest the last passing revision, if it fails the same way then you are vindicated | 01:05 |
wallyworld_ | sinzui: was gccgo updated recently? | 01:05 |
wallyworld_ | on the test vm? | 01:05 |
sinzui | wallyworld_, We would see that in the first test that failed | 01:05 |
sinzui | wallyworld_, you loose, http://juju-ci.vapour.ws:8080/job/run-unit-tests-trusty-ppc64el/1213/console clearly states that gcc was already the latest version and that no packages were installed for the test | 01:06 |
wallyworld_ | and yet the tests that are failing have not changed and the failure is clearly due to gccgo not executing monkey patched code that the tests rely on to pass | 01:08 |
wallyworld_ | i put a panic in the code and it did not trigger | 01:08 |
sinzui | wallyworld_, there is a difference, but it is not see in installs... | 01:08 |
sinzui | The passing one has | 01:09 |
sinzui | go version xgcc (Ubuntu 4.9.1-10ubuntu3) 4.9.1 linux/ppc64 | 01:09 |
sinzui | The failing one has | 01:09 |
sinzui | go version xgcc (Ubuntu 4.9.1-12ubuntu2) 4.9.1 linux/ppc64 | 01:09 |
wallyworld_ | yes, that's what i used to have here till i upgrade | 01:09 |
wallyworld_ | i'm on utopic now and it doesn't give me an option to go back to -10 | 01:09 |
wallyworld_ | wait | 01:10 |
wallyworld_ | yes it does | 01:10 |
sinzui | wallyworld_, I can look into this after I avert the disaster that really cannot be averted | 01:10 |
wallyworld_ | ok | 01:10 |
wallyworld_ | i'll try testing with -10 | 01:10 |
davecheney | ok, this is not good | 01:11 |
davecheney | -12 must be the new version in proposed which fixes a different bug | 01:11 |
sinzui | FU&CKI | 01:12 |
sinzui | wallyworld_, even after using s3cmd to sync the tools that are on aws, I still get different filesizes from streams.canonical.com | 01:13 |
wallyworld_ | i can't seem to get apt to allow me to downgrade to -10 to test | 01:13 |
wallyworld_ | wot :-( | 01:13 |
wallyworld_ | sinzui: that is not good :-( | 01:14 |
davecheney | wallyworld_: juju bootstrap && juju deploy cs:ubuntu | 01:14 |
sinzui | wallyworld_, Am I experiencing this because i finally reported the versioning issue as a bug https://bugs.launchpad.net/juju-core/+bug/1365633 | 01:15 |
mup | Bug #1365633: cannot rebuild replacement tools for streams <ci> <juju-core:Triaged> <https://launchpad.net/bugs/1365633> | 01:15 |
wallyworld_ | looking | 01:15 |
sinzui | wallyworld_, We have lived with this since Fabruary, I report the bug and now I need the fix | 01:15 |
sinzui | wallyworld_, tools that should be identical are not, I cannot given then extra version information to differentiate their origin to avoid confusion or outright malign intent | 01:16 |
wallyworld_ | sinzui: simplestreams supports versioning using dates | 01:17 |
sinzui | wallyworld_, that is not helping the users | 01:17 |
wallyworld_ | new tools tarballs with different names could be uploaded | 01:17 |
wallyworld_ | and new metadata with a newer date added | 01:18 |
sinzui | wallyworld_, I am going to remake this data, and now I can expect users to complain that tools of the same name dont match | 01:18 |
wallyworld_ | the tarball name used to matter before simplestreams but it doesn't now | 01:18 |
sinzui | wallyworld_, but these tools from two different machines that should be the same have different sums | 01:18 |
wallyworld_ | the tools tarball could be called juju-1.20.6-release1-precise-amd64 and juju-1.20.6-release2-precise-amd64. | 01:19 |
wallyworld_ | which one to use comes from the simplestreams metadata | 01:20 |
wallyworld_ | the latter one would be in the metadata with a later date, so that would be be picked up if juju asks for which tools to use for series/arch/release | 01:21 |
wallyworld_ | maybe i'm missing something | 01:21 |
sinzui | wallyworld_, That would help. when I tested alternate names for tools, the metadata command ignored them :( | 01:21 |
wallyworld_ | that may be a limitation of that command :-( | 01:22 |
wallyworld_ | which needs to be fixed | 01:22 |
sinzui | I have done evil things to preserve the greater good | 01:22 |
wallyworld_ | that command i think from memory does use the filename to suck stuff in | 01:23 |
wallyworld_ | it could be made smarter | 01:23 |
sinzui | yeah, the convention is convenient for many people copying tools. | 01:23 |
wallyworld_ | or made so it can be called from a script, passing in the required tarball and params | 01:24 |
sinzui | wallyworld_, maybe... | 01:24 |
wallyworld_ | sinzui: we are moving to a shared tarball across series | 01:24 |
wallyworld_ | ie one tarball only for precise/trusty/utopic | 01:24 |
wallyworld_ | since they are the same | 01:25 |
sinzui | I have just reconciled the diffs from what was last in the CPCs and my own machine to make a json that describes what what there and what I am now uploading. | 01:25 |
wallyworld_ | so the filename will become less relvant | 01:25 |
sinzui | wallyworld_, I would like to do that. The number of tools we make and publish do take a lot of time | 01:26 |
wallyworld_ | yes indeed :-( | 01:26 |
wallyworld_ | it's sorta happening now as part of moving tools into mongo storage | 01:26 |
wallyworld_ | and removing the need for cloud storage | 01:27 |
sinzui | wallyworld_, I think if this command worked for azure, we might have prevented my misadventure | 01:27 |
sinzui | juju metadata validate-tools --juju-version 1.20.7 | 01:27 |
wallyworld_ | oh, azure doesn't currently support custom metadata | 01:28 |
wallyworld_ | i because there's no central storage we can use from memory | 01:29 |
sinzui | wallyworld_, but joyent does. I don't understand? isn't the command getting the json and answering the version question? | 01:29 |
wallyworld_ | like we have for aws and hp cloud | 01:29 |
wallyworld_ | it's been ages since i looed at that stuff - from memory it's because there's no support for a custome search path on azure, i can't recallwhy | 01:30 |
wallyworld_ | i'll have to go digging in the code | 01:30 |
wallyworld_ | and even if no custom tools location is supported, i would think the metadata command should still work | 01:30 |
wallyworld_ | don't know why it doesn't :-( | 01:31 |
sinzui | wallyworld_, oh yes, now I understand. I faced some of that using their python adk | 01:31 |
sinzui | sdk | 01:31 |
sinzui | wallyworld_, We add md5 and shasum metadata to each tool we upload to azure and manta because we wrote our own rsync tools to do what real storage systems do | 01:32 |
wallyworld_ | ok | 01:32 |
sinzui | manta still sucks though. there is a 5 minute period where we make 1000+ calls to look up the sums because it doesn't support bulk queries | 01:33 |
sinzui | well swift doesn't either, but the web/xml interface does | 01:33 |
wallyworld_ | 1000+ !! | 01:34 |
wallyworld_ | sinzui: we will soon not need cloud storage for juju | 01:34 |
=== Ursinha is now known as Ursinha-afk | ||
sinzui | wallyworld_, indeed...part of the tools problem is that each machine is downloading tools from one or more sources and that allows for mismatches | 01:35 |
wallyworld_ | yeah, so soon all machines will get tools from the state server | 01:35 |
wallyworld_ | the tools are loaded into the state server on bootstrap | 01:35 |
sinzui | wallyworld_, I am 1. starting a rebuild of the last good master rev. I am 2, looking for the old packages to revert one of the machines to | 01:39 |
wallyworld_ | ty | 01:40 |
wallyworld_ | i've updated the bug with my thoughts | 01:40 |
sinzui | wallyworld_, I might be able to go back to what was in place on Aug 31 http://ports.ubuntu.com/pool/universe/g/gcc-4.9/ | 01:42 |
wallyworld_ | sinzui: that would be great. you may also find that gcc-base and other packages need downgrading also | 01:43 |
sinzui | wallyworld_, yeah, that is what makes this hard | 01:44 |
wallyworld_ | indeed :-( | 01:44 |
=== Ursinha-afk is now known as Ursinha | ||
thumper | wallyworld_: I'm going to see if I can fix this bug: https://bugs.launchpad.net/juju-core/+bug/1348477 | 02:08 |
mup | Bug #1348477: userAuthenticatorSuite.TearDown failure <ci> <intermittent-failure> <regression> <test-failure> <juju-core:Triaged by cmars> <https://launchpad.net/bugs/1348477> | 02:08 |
thumper | wallyworld_: I have a plan | 02:08 |
wallyworld_ | thumper: awesome, can we catch up in a sec, i'm otp withj axw | 02:09 |
sinzui | wallyworld_, you are vindicated by the replay of the passing tarball | 02:27 |
sinzui | wallyworld_, I am too tired to install the old packages. Maybe I shouldn't because I am not awake enough to know that this is stupid | 02:28 |
wallyworld_ | sinzui: \o/ does that mean we can remove the regression tag and unblock landings? | 02:28 |
wallyworld_ | sinzui: we do need to fix the compiler still | 02:28 |
wallyworld_ | dave can look at that on monday | 02:28 |
sinzui | wallyworld_, I am going to take the tests voting rights away. if it starts passing, then we can assume the code or the compiler are in agreement and reatore the vore | 02:28 |
sinzui | vote | 02:28 |
wallyworld_ | great,sounds good, | 02:29 |
sinzui | I can do this now, and then add the real source for the bug | 02:29 |
wallyworld_ | sinzui: is there an eta then on landings being unblocked? | 02:29 |
sinzui | wallyworld_, I will lower the priority of of the bug because obviously we cannot do anything now that it is out of our power...let me fix the vote first | 02:30 |
wallyworld_ | ok, thank you :-) | 02:30 |
sinzui | oh, actually. I cannot go to sleep until this test completes | 02:31 |
wallyworld_ | :-( | 02:31 |
sinzui | wallyworld_, on the other hand the apiserver.metrics might actually have problems. but without a safe compiler, we wont know | 02:32 |
wallyworld_ | thumper: did you want to talk about your plan? | 02:32 |
thumper | wallyworld_: yeah, cause it isn't working | 02:32 |
wallyworld_ | ok, see you in onyx standup hangout? | 02:33 |
thumper | ok | 02:33 |
=== ChanServ changed the topic of #juju-dev to: https://juju.ubuntu.com | On-call reviewer: see calendar | Blocking bugs: None | ||
thumper | https://github.com/juju/juju/pull/683 anyone? refactoring work still from this week's mega branch being broken up | 03:24 |
thumper | bug fix coming for auth failed | 03:24 |
thumper | wallyworld_: https://github.com/juju/juju/pull/685 | 03:26 |
wallyworld_ | looking | 03:26 |
katco | wallyworld_: hey thanks for landing all my branches :) take away the fun part why don't ya! | 03:35 |
katco | and now i'm off to bed. night all. | 03:43 |
thumper | axw: could I get you to cast your eyes over https://github.com/juju/juju/pull/642 again? | 03:51 |
axw | looking | 03:51 |
thumper | axw: I've updated it based on recent changes and your suggestions | 03:51 |
axw | thumper: line 20 can be dropped I think | 03:54 |
thumper | sure | 03:54 |
thumper | will do | 03:54 |
thumper | and pushed | 03:55 |
axw | thumper: reviewed, thank you | 03:56 |
thumper | nm | 03:56 |
=== urulama_afk is now known as urulama | ||
axw | wallyworld_: https://github.com/axw/juju/compare/state-tools-take2 if you're interested in seeing the core changes | 05:43 |
axw | fixing tests again now | 05:43 |
wallyworld_ | sure, looking | 05:43 |
axw | wallyworld_: apiserver/common/tools.go and apiserver/tools.go are probably of most interest | 05:44 |
axw | wallyworld_: also cmd/jujud/bootstrap.go | 05:44 |
wallyworld_ | kk, just got a phone call, will look soon | 05:44 |
wallyworld_ | axw: ToolsStorager NOOOOOOOOOOO | 06:14 |
axw | heh | 06:15 |
wallyworld_ | not funny :) | 06:15 |
axw | ToolsStorageProvider? it's really a very minor thing, I don't really care | 06:15 |
axw | Getter is just as horrible to me | 06:16 |
wallyworld_ | is there already a "ToolStorage", can't recall | 06:16 |
axw | yes, but this is a thing that has a ToolsStorage method | 06:16 |
wallyworld_ | otherwise ToolsStorageProvider | 06:17 |
axw | ok | 06:17 |
wallyworld_ | sorry, i HATE that particular Go idiom | 06:17 |
=== liam_ is now known as Guest22638 | ||
wallyworld_ | fwereade: do you have a moment? | 07:08 |
mattyw | morning all | 07:23 |
TheMue | morning | 07:35 |
axw | wallyworld_: did you find anything obviously wrong, apart from that name? | 07:36 |
axw | morning TheMue | 07:36 |
wallyworld_ | axw: no, looked ok. i got distracted a bit by a bug report, let me just give it one more look | 07:36 |
axw | no worries | 07:37 |
axw | wallyworld_: doesn't need to be too deep, just wanted a glance over before I get too stuck into fixing tests | 07:37 |
axw | which reminds me, tests | 07:37 |
wallyworld_ | axw: nothing jumped out, but i didn't go over the find in storage logic too closely | 07:38 |
axw | ok | 07:39 |
axw | thanks | 07:39 |
=== Tribaal_ is now known as Tribaal | ||
wallyworld_ | dimitern: hi there | 08:18 |
dimitern | wallyworld_, hey | 08:20 |
wallyworld_ | dimitern: i backported your fix for allowing maas to disable network config to 1.20. the 1.20 branch is a little different to trunk. could you please review my back port? and type $$merge$$ if you are happy as i have to head to soccer https://github.com/juju/juju/pull/687 | 08:20 |
dimitern | wallyworld_, sure, looking | 08:20 |
wallyworld_ | thank you | 08:21 |
* wallyworld_ heads out to soccer | 08:21 | |
TheMue | dimitern: heya, mind another look at https://github.com/juju/juju/pull/626 ? | 09:59 |
TheMue | dimitern: it now also covers the simulation and testing of a V0 machiner API. | 10:00 |
dimitern | TheMue, cheers, will have a look | 10:00 |
TheMue | dimitern: great, thanks | 10:01 |
TheMue | dimitern, voidspace: hangout? | 10:45 |
voidspace | TheMue: omw | 10:46 |
voidspace | dimitern: after changing TIME_WAIT I haven't seen the tests fail... | 11:29 |
voidspace | dimitern: not conclusive, but they were failing regularly before | 11:30 |
voidspace | dimitern: I'll go to 2MB rate limit (used to fail every time) and see if they now pass | 11:30 |
dimitern | voidspace, good news then :) | 11:37 |
voidspace | dimitern: ah no, fail :-/ | 11:39 |
dimitern | voidspace, too bad.. but hey, it's some progress at least | 11:40 |
TheMue | so, back from lunch | 11:55 |
TheMue | dimitern: thanks for review | 11:56 |
TheMue | dimitern: only regarding the test for the providers I don't like to change | 11:56 |
TheMue | dimitern: simply so that all providers, also future ones, always follow the same approach | 11:57 |
dimitern | TheMue, well, I really don't like passing an opaque array of booleans | 11:57 |
TheMue | dimitern: I recognized it as advantage in the moment I added the testing for the V0 | 11:57 |
TheMue | dimitern: and I don't like to do everything the same way but only ... | 11:58 |
TheMue | dimitern: these exceptions always make it more difficult for later maintainers | 11:58 |
TheMue | dimitern: but I could change it that I define the standard behavior as a const (ok, it's a var), so the tests read better | 11:59 |
dimitern | TheMue, it will be difficult for anyone to see what [16]bool{true,true,true,false,false,...} actually means | 11:59 |
dimitern | TheMue, that sounds better, yes | 11:59 |
TheMue | var ExpectedStandardBehavior = [16]bool { ... } | 11:59 |
dimitern | TheMue, btw why [16]bool and not []bool ? | 12:00 |
TheMue | dimitern: OK, that's a compromise for me | 12:00 |
TheMue | dimitern: hey, we all love Go for its type safety. so why open a door to pass to few or much values? | 12:00 |
TheMue | dimitern: only to safe to chars? | 12:01 |
dimitern | TheMue, ok, as long the [16]bool is hidden behind a var, I'm fine for the time being | 12:04 |
gsamfira | hello folks. If anyone has some time, can I get a review on: https://github.com/juju/utils/pull/27/ ? | 12:05 |
TheMue | dimitern: will hide it | 12:07 |
perrito666 | natefinch: fetching aurics, brt | 13:30 |
perrito666 | ericsnow: wwitzel3 do we? | 14:04 |
voidspace | so I can confirm that CurrentStatus will report members in PrimaryState/SecondaryState even when primary renegotiation is happening and the replica set is unstable | 14:32 |
voidspace | although it looks like it sets Uptime to 0 when that happens | 14:44 |
voidspace | who wrote the replicaset code? | 14:47 |
voidspace | It's part of juju not mgo | 14:47 |
natefinch | voidspace: I wrote the replicaset code. | 15:10 |
voidspace | natefinch: ok | 15:11 |
voidspace | natefinch: I've butchered the applyRelSetConfig code | 15:11 |
voidspace | natefinch: I don't think the loop inside that does quite what it looks like it does | 15:11 |
voidspace | natefinch: however I've got rid of it anyway, so my question is now moot | 15:11 |
natefinch | voidspace: heh ok | 15:11 |
voidspace | natefinch: I have a new WaitForMajorityHealthy function which we can use to tell when the replica set is stable | 15:12 |
voidspace | natefinch: so far it's mostly working - except for the times when it doesn't... | 15:12 |
sinzui | alexisb, I am going to delay 1.21-alpha1 until Monday. There are too many changes to write up as release notes in a single day. I honestly don't know what features are in this release and how to explain to users who to use them | 15:12 |
natefinch | voidspace: That was definitely not the finest code in the world. I wish there were better ways to do pretty much everything in that code... mostly around querying mongo for "WTF are you doing right now?" | 15:12 |
voidspace | natefinch: it's the fact that you change cmd to "Ping" | 15:12 |
alexisb | sinzui, understood, no one is pinning for it today | 15:12 |
voidspace | natefinch: which is only useful if you re-enter the block "if err == io.EOF" | 15:13 |
alexisb | sinzui, you and I and Ian need to sync on release roadmap for 1.21 though | 15:13 |
voidspace | natefinch: which almost certainly isn't what Ping returns | 15:13 |
voidspace | natefinch: and even if Ping is successful we retry the loop instead of breaking | 15:13 |
voidspace | natefinch: as there's no check for err == nil | 15:13 |
voidspace | natefinch: if my function is reliable, it will look like this instead | 15:14 |
voidspace | natefinch: http://pastebin.ubuntu.com/8260487/ | 15:14 |
natefinch | voidspace: hmm yeah that's not good. That code has been tweaked by a lot of people who were trying to make it more reliable... it's quite possible there were some screw ups along the way. A lot of it was trial and error trying to figure out what mongo will do at any particular time. | 15:15 |
sinzui | alexisb, agreed | 15:15 |
natefinch | voidspace: can you show me waitformajorityhealthy? That's the key part that I had difficulty writing myself. | 15:16 |
natefinch | voidspace: also, when does session.Run return EOF? We should comment why that's an ok error to get | 15:17 |
voidspace | natefinch: http://pastebin.ubuntu.com/8260518/ | 15:17 |
voidspace | natefinch: I should add back a comment about that | 15:17 |
voidspace | natefinch: it's when changing the config causes primary re-negotiation so existing connections are dropped | 15:17 |
voidspace | natefinch: it's fine - we just need to refresh | 15:18 |
voidspace | natefinch: which WaitFor... does | 15:18 |
voidspace | natefinch: this is currently not stable - I'm sometimes seeing WaitFor... timeout, so I need to add some debugging | 15:18 |
voidspace | this is what I'm doing now | 15:18 |
voidspace | it *mostly* works | 15:18 |
natefinch | voidspace: thanks for putting in time on this, it'll make our code a lot more robust, and hopefully fix a lot of mongo related errors in the tests | 15:21 |
voidspace | maybe... :-/ | 15:21 |
voidspace | it's been dead end after dead end so far | 15:21 |
voidspace | this looks really promising, but I'm still seeing timeouts | 15:22 |
natefinch | sinzui: is amazon sick today? one of my PR's failed in a weird way: http://juju-ci.vapour.ws:8080/job/github-merge-juju/546/console | 15:35 |
sinzui | natefinch, that indeed looks like aws failed to provide an instance | 15:37 |
sinzui | natefinch, I saw messages yesterday that clearly states there weren't any instances of the size requested for the AZ :( | 15:38 |
natefinch | sinzui: I suppose AWS could just be busy | 15:39 |
perrito666 | mattyw: hey, are you around? | 15:55 |
mattyw | perrito666, yep | 16:00 |
hazmat | sinzui, that's a bug imo, juju should recover and try a different az | 16:07 |
hazmat | although that's different then what natefinch build says | 16:08 |
perrito666 | mattyw: did you see axw's last pr? | 16:09 |
mattyw | perrito666, removing the call to setadminmongopassword? | 16:10 |
perrito666 | yup, I applied that and ran with and without your patch | 16:10 |
perrito666 | that seems to at least fix half of the erorrs yet the error related to presence is still there | 16:11 |
mattyw | perrito666, my patch? | 16:11 |
perrito666 | http://paste.ubuntu.com/8227111/ | 16:11 |
perrito666 | "patch" | 16:11 |
mattyw | perrito666, does axw branch make use of the change that thumper landed overnight? | 16:12 |
perrito666 | yes | 16:12 |
perrito666 | https://github.com/juju/juju/pull/688 | 16:12 |
natefinch | how the hell are you supposed to use juju run? I can't for the life of me figure out how to get it to do anything but say "unrecognized args <stuff in the command to run>" | 16:43 |
wesleymason | natefinch: juju run --service <servicename> 'comand here' | 16:43 |
wesleymason | for example | 16:43 |
natefinch | in quotes? | 16:44 |
wesleymason | yeah, in single quotes so bash/zsh etc. doesn't interpolate first | 16:44 |
wesleymason | recommended anyway | 16:44 |
natefinch | ahh that was it. I was trying with -- to keep it from parsing flags.... we really need better help on that command | 16:45 |
natefinch | or like ONE example would be nice | 16:46 |
wesleymason | +1 | 16:46 |
natefinch | I'll work on that. bad help is a pet peeve of mine | 16:47 |
voidspace | natefinch: do you know how to debug "no reachable servers" errors? | 16:56 |
natefinch | voidspace: when initiating the replicaset? | 16:57 |
voidspace | natefinch: no, after applying a config change or during a Dial | 16:57 |
voidspace | natefinch: but in both cases I have a replicaset with several members | 16:57 |
natefinch | voidspace: either they all still trying to come up, or the addresses are internal to the cloud, not public... | 16:59 |
voidspace | natefinch: it's during tests, so not a cloud issue | 16:59 |
voidspace | natefinch: and I'd like to know *how* to tell whether or not they're trying to come up | 16:59 |
natefinch | voidspace: I wish I knew | 16:59 |
voidspace | natefinch: as I've waited five minutes and CurrentStatus is failing | 16:59 |
voidspace | because of the connection error | 16:59 |
voidspace | hah, right | 16:59 |
natefinch | niemeyer: ^^ | 16:59 |
natefinch | niemeyer: we're trying to make our code more robust with respect to Mongo, especially when initiating a replicaset and when bringing up instances of mongo during testing. We get what appear to be random failures where sometimes they either never come up or take a really long time, or initiating takes a really long time. Part of the problem is that we don't really now how to figure out what state mongo is in... all we | 17:01 |
natefinch | can do is dial and see if it responds within a timeout. Is there some better way we can do this? | 17:01 |
voidspace | I'm seeing a lot of errors like: | 17:02 |
voidspace | [LOG] 6:43.772 DEBUG juju.testing tls.Dial(127.0.0.1:35846) failed with dial tcp 127.0.0.1:35846: connection refused | 17:02 |
voidspace | Even with session.Refresh() and waiting for (up to) five minutes | 17:02 |
niemeyer | natefinch: Yes, you can always ask the server for its status | 17:03 |
voidspace | niemeyer: how specifically? | 17:03 |
voidspace | calling CurrentStatus(session) is failing with connection refused | 17:03 |
niemeyer | voidspace: http://docs.mongodb.org/manual/reference/command/replSetGetStatus/ | 17:04 |
voidspace | niemeyer: that's precisely what CurrentStatus is doing | 17:04 |
niemeyer | voidspace: If the connection is refused, you know the status :) | 17:04 |
voidspace | niemeyer: any idea *why* sometimes our connections die like that and just don't come back | 17:04 |
niemeyer | voidspace: Okay, that's not what Nate said above | 17:04 |
niemeyer | voidspace: Hmm | 17:04 |
niemeyer | voidspace: Die with connection refused? | 17:05 |
voidspace | [LOG] 6:43.772 DEBUG juju.testing tls.Dial(127.0.0.1:35846) failed with dial tcp 127.0.0.1:35846: connection refused | 17:05 |
natefinch | niemeyer: sorry... what I mean is - we tell it to initiate... and then can never get it to respond | 17:05 |
niemeyer | voidspace: The TCP port is not open.. | 17:05 |
niemeyer | natefinch: Look at the logs | 17:06 |
niemeyer | natefinch: I've never seen anything similar before | 17:06 |
niemeyer | natefinch: the test suite of mgo routinely shoot servers down and bring them back up | 17:06 |
natefinch | niemeyer: it's the single most common failure for our tests - mongo going away and never coming back | 17:06 |
natefinch | niemeyer: it's quite likely we're just doing something wrong, we just don't know what that is. | 17:07 |
niemeyer | natefinch: That makes no sense.. a connection refusal is a TCP port not open, which in general means MongoDB is not even running | 17:07 |
niemeyer | natefinch: I'd look at the logs to see why | 17:07 |
natefinch | niemeyer: it's not always connection refusal... that's the problem this time, often times the dial will just time out eventually | 17:07 |
niemeyer | natefinch: Heh.. | 17:08 |
voidspace | that particular failure was during a call to instance.MustDialDirect() - *after* waiting for CurrentStatus to report all members up | 17:08 |
niemeyer | natefinch: First thing to do is make up your mind about what the symptom is :) | 17:08 |
voidspace | well, I just did another test run and got the same symptom | 17:08 |
voidspace | [LOG] 6:43.764 DEBUG juju.testing tls.Dial(127.0.0.1:37222) failed with dial tcp 127.0.0.1:37222: connection refused | 17:08 |
niemeyer | Yeah, that's a server down.. the logs will say why | 17:09 |
natefinch | voidspace: I think you'll need to hack the code a little to prevent gocheck from cleaning up the mongo directory, so you can look at the logs | 17:10 |
voidspace | niemeyer: do you know where the logs should be? I've got a horrible feeling we redirect mongo logging somewhere useless. | 17:10 |
voidspace | natefinch: ah, right | 17:10 |
voidspace | natefinch: when we start mongo don't we get it to log to standard out so we can parse the logs... | 17:10 |
voidspace | natefinch: meaning we get no logs | 17:10 |
voidspace | natefinch: or does it log to the directory as well? | 17:11 |
niemeyer | voidspace, natefinch: -check.work will prevent it from being removed, and display it as well | 17:11 |
voidspace | niemeyer: cool, thanks | 17:11 |
natefinch | niemeyer: oh, awesome, thanks | 17:11 |
niemeyer | voidspace: But I don't know where the logs are being sent to | 17:11 |
natefinch | voidspace: I'm pretty sure mongo's logs are still written to disk, but I honestly don't remember | 17:11 |
voidspace | we're still using the launchpad version of gocheck of course | 17:15 |
voidspace | wasn't there a thread about that? | 17:15 |
voidspace | yeah, looks like we're about to update | 17:16 |
natefinch | niemeyer: is that check.work flag available on launchpad's gocheck? I can't find docs on the flags it takes | 17:18 |
niemeyer | natefinch: -gocheck.work, likely | 17:18 |
niemeyer | natefinch: -help on the test binary, or just passing a wrong flag, will print the options | 17:19 |
voidspace | I don't think it is available | 17:19 |
voidspace | we're at the latest revision of launchpad | 17:19 |
voidspace | natefinch: copying the gopkg.in one over the top of the launchpad one seems to work though | 17:22 |
voidspace | :-p | 17:22 |
natefinch | voidspace: heh, we're lucky we always rename the package, otherwise that wouldn't work | 17:33 |
voidspace | right | 17:35 |
wwitzel3 | woo, I have passing tests! | 17:35 |
natefinch | anyone know why I'd get "cannot open ports 80-80/tcp on machine 5 due to conflict" when I re-ran my install hook? Shouldn't open-port be idempotent? | 17:39 |
gsamfira | natefinch: there was a discussion on the mailing list about this a while back. Subject was "Port ranges - restricting opening and closing ranges". Not sure of the conclusion on that though | 17:44 |
gsamfira | https://lists.ubuntu.com/archives/juju-dev/2014-August/003131.html | 17:45 |
=== sebas538_ is now known as sebas5384 | ||
=== hatch__ is now known as hatch | ||
perrito666 | anyone knows the difference between using net.Listen("tcp", "localhost:0") and net.Listen("tcp", ":0") ? | 21:57 |
=== viperZ28_ is now known as viperZ28 |
Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!