[00:01] yeah, i was profiling the linker and it stabs you in the face really [00:02] it's all work that the 1.4 compilers never did [00:02] indeed [00:02] and in my testing, that acconts for 3x slowdown [00:03] GOGC=off makes things about 40% faster in my tests i think [00:03] varies a bit with case, of course [00:03] yup [00:03] not sure what a good sort of gc is for this sort of thing really [00:03] none, it's a pathlogical case [00:04] gc only works for allocations you intend to free [00:04] preferably shortly [00:04] yeah [00:04] i suspect the biggest problem to GOGC=off or using off heap memory will be political [00:04] generational would help a bit, but all the new Nodes get old->young pointers immediately [00:04] which kinda screws over the generational hypothesis [00:05] generational collectors are good for avoiding heap fragmentation [00:05] apart from that, they actaully don't help [00:05] as anything which uses a lot of memory, by defintition has big data structures which are long lived [00:05] think reddis, memcache, casandra [00:05] they help if you generate piles of short lived garbage [00:05] all of those get promotoed, or are reachable from the promoted set [00:05] e.g. if you are writing python [00:06] they are good for request/response servers [00:06] * mwhudson spots some java experience [00:06] where they genrate a lot of allocations relative to the incoming connection, then free them at the end [00:06] mwhudson: i'll show you the place that Java touched me, later [00:10] welp, keith is intersted at least [00:10] i don't really care about the solution [00:10] only they engage with the idea that the gc is not helping the compiler [00:39] menn0: thumper http://paste.ubuntu.com/11234864/ [00:39] uh oh [00:39] this is our old friend [00:39] * menn0 looks [00:40] davecheney: where are you seeing that? [00:40] failure on ppc64 [00:40] .../jujud/agent [00:54] davecheney: is it consistent? === natefinch-afk is now known as natefinch [01:39] menn0: yup [01:40] i smell a data race [01:41] anastasiamac, wallyworld: I need to run off in about 10 minutes, pepper is booked in for a hair cut [01:41] :) [01:52] davecheney: turns out being in a hangout makes compiles even sloweer [01:55] hangouts DEFINITELY slow down compilation (and everything else) [01:55] menn0: go test -race .../jujud/agent [01:55] OK: 80 passed, 1 skipped [01:55] PASS [01:55] Found 13 data race(s) [01:56] since it's a prime number, they cancel out, right? [02:00] natefinch: so, should I add more races, or take some away ? [02:00] davecheney: no no, if you fix one, it won't be prime, and they won't cancel out anymore [02:00] * natefinch is sure that's a thing. [02:01] github.com/juju/juju/apiserver.(*changeCertConn).Read() [02:01] :37 +0xa3 [02:01] did someone recently add support to the apiserver to change certificates on the fly ? [02:01] davecheney: on the upside, coverage of jujud/agent is better than I thought, 78.5% - which means we're only probably missing ~3 data races not covered in tests [02:02] natefinch: superb [02:09] davecheney: wallyworld added the cert swapping thing [02:09] it's needed to support upgrade IIRC [02:09] s [02:09] recently = 1.22 [02:09] it's needed to support secure connections to state servers from cloud nodes [02:11] excellent [02:11] it's not clear if those races are just in the tests [02:11] * davecheney throws table at mocking functions at test time [02:12] but it's certainly a big problem [02:12] data race == program is unknowable [02:12] that could explain the runtime crashes we see [02:17] https://bugs.launchpad.net/juju-core/+bug/1456851 [02:17] Bug #1456851: cmd/jujud/agent: multiple data races detected === kadams54 is now known as kadams54-away [02:42] Bug #1222413 changed: openstack provider Instances suppresses errors

[02:42] Bug #1450129 changed: vsphere provider is missing firewaller, networking implementations

[02:42] Bug #1450701 changed: Juju CLI compatibility option

[02:42] Bug #1451283 changed: deployer sometimes fails with a unit status not found error

[02:42] Bug #1452114 changed: Unnecessary errors emitted during init system discovery

[02:42] Bug #1452535 changed: default storage constraints are not quite correct

[02:42] Bug #1453801 changed: /var/spool/rsyslog grows without bound

[02:42] Bug #1454043 changed: InstancePoller compares wrong Address list and always requests updated state Addresses

[02:42]

[02:42] Bug #1454676 changed: failed to retrieve the template to clone - 500 Internal Server error - error creating container juju-trusty-lxc-template -

[02:42] Bug #1454829 changed: 1.20.x client cannot communicate with 1.22.x env

Committed by wallyworld>

[02:42] Bug #1456851 was opened: cmd/jujud/agent: multiple data races detected [02:44] thumper: ok to merge this? https://github.com/juju/txn/pull/10 [02:44] no landing bot it seems [02:44] thumper: i have permission to merge, just want someone else to agree [02:46] thumper: um, we have a serious problem [02:46] the cert change listner basically doesn't work [02:47] and cannot be fixed in its current form [02:53] thumper: menn0 https://bugs.launchpad.net/juju-core/+bug/1456857 [02:53] Bug #1456857: apiserver: updateCert has data race, corrupts certificate information [02:54] wallyworld: ^^^ [03:12] Bug #1415176 changed: debug-hooks exit 1 , doesn't mark hook as failed

[03:12] Bug #1420057 changed: agents see "too many open files" errors after many failed API attempts

by dave-cheney>

[03:12] Bug #1429790 changed: debug-hooks not working with manually provisioned machines

[03:12] Bug #1437266 changed: Bootstrap node occasionally panicing with "not a valid unit name"

[03:12] Bug #1441206 changed: Container destruction doesn't mark IP addresses as Dead

[03:12] Bug #1441913 changed: juju upgrade-juju failed to configure mongodb replicasets

[03:12] Bug #1442012 changed: persist iptables rules / routes for addressable containers across host reboots

[03:12] Bug #1444861 changed: Juju 1.23-beta4 introduces ssh key bug when used w/ DHX

[03:12] Bug #1446264 changed: joyent machines get stuck in provisioning

[03:12] Bug #1449301 changed: storage: storage cannot be destroyed

[03:12] Bug #1449390 changed: storage: charms must wait for storage to be attached before running "install" hook

[03:12] Bug #1449822 changed: storage: storage-detached should be storage-detaching

[03:12] Bug #1450118 changed: vsphere provider should use OVA instead of OVF from cloud images.

[03:12] Bug #1451674 changed: Broken DB field ordering when upgrading to Juju compiled with Go 1.3+

[03:12] Bug #1452113 changed: log files are lost when agents are restarted under systemd

[03:12] Bug #1452207 changed: worker/uniter: charm does not install properly if storage isn't provisioned before uniter starts

[03:12] Bug #1452511 changed: jujud does not restart after upgrade-juju on systemd hosts

[03:12] Bug #1454481 changed: juju log spams ERROR juju.worker.diskmanager lsblk.go:111 error checking if "sr0" is in use: open /dev/sr0: no medium found

[03:12] Bug #1454599 changed: firewaller gets an exception if a machine is not provisioned

Committed by hduran-8>

[03:12] Bug #1454870 changed: Client last login time writes should not use mgo.txn

[03:12] Bug #1456857 was opened: apiserver: updateCertificate has data race, corrupts certificate information [03:12] spammy mup [03:13] menn0: my guess is no bot [03:13] menn0: so just merge it (assuming all tests pass :-) [03:19] thumper: yep they do [03:30] protip: whenever you use PatchValue, you're probably creating a data race [03:30] please don't use PatchValue [03:34] davecheney: what do you mean doesn't work? if it didn't work, lxc image caching would not work [03:35] data race != doesn't work [03:35] hey, emergency for demo [03:35] getting this error [03:35] WARNING failed to load charm at "/home/ubuntu/charms/trusty/rally": YAML error: line 20: did not find expected key [03:35] kind of cryptic error [03:36] what version of juju? what is charm yaml? [03:37] 1.24-beta3 [03:37] https://github.com/juju-solutions/rally [03:39] marcoceppi: sadly the error is from inside the yaml lib and it has been provided with no context :-( [03:40] marcoceppi: looks like action yaml [03:40] wallyworld: yeah, missing " [03:41] wallyworld: thanks@ [03:41] never have seen that error before [03:41] np [03:41] and proof didn't pick it up [03:41] something to fix :-) [03:42] Bug #1454466 was opened: Deployment times out waiting for relation convergence - neutron-gateway in installing state

[03:46] waigani_: you around? [03:47] natefinch: yep [03:47] waigani_: #1 - thanks for fixing my dumb windows bug in the log rotation tests [03:47] natefinch: np :) [03:48] waigani_: #2 - since you wrote a script to run CI, maybe you can help me get my CI script running.... what should "job_name" be? my new test is log_rotation.py - does that mean the job name should be log_rotation? or log_rotation.py or something else entirely? [03:49] natefinch: from what I can see, job_name is just used to name the environment [03:49] natefinch: so the CI guys can keep track of what envs are for what jobs I'm guessing [03:50] ahh ok.. do you know if there's a way to run just my one test, and not all of CI? [03:50] natefinch: what's your one test? [03:51] menn0: I had a couple quick thoughts about your scanner patch, are you around? [03:51] natefinch: is your script up somewhere I can have a squizz? [03:51] jam: yep I'm here [03:52] waigani_: I wrote a python script following the pattern set by some of the other tests, like assess_bootstrap.py [03:52] waigani_: here's the code: https://gist.github.com/natefinch/e377eacd6b2316b2a884 [03:52] menn0: so one thing I was thinking about is that since we have to read the whole DB it really is a bit too often to do it every 2hrs (i think). so I was trying to think of ways to make it more logical [03:52] menn0: one that I thought could be really good is to track how many txns are in the collection and only prune when they grow by a certain amount [03:52] like say 2x [03:53] menn0: as this is really "don't let TXNs grow without bound and take up 99.99% of the total DB size" [03:53] but since there is one TXN for every other doc, it is fair to expect TXN to be as much as 50% of the total DB size. [03:53] ideally we would record the size of the collection after the last pruning [03:53] waigani_: it requires a new test charm that exists here for now: https://github.com/natefinch/fill-logs [03:53] and then only once it has bloated do the next GC [03:54] as a poor man's approximation we could track that info in memory [03:54] waigani_: just needs to be manually copied into repository (again, just for now) [03:54] (always GC on the first inspection, track the final size of the DB, and then only GC again once the count() is 2x the original count()) [03:54] or whatever is the cheapest thing to measure. [03:55] We could do db.txn.stats() but I'm not sure if that shrinks after a big prune [03:55] jam: hmmm ok [03:55] waigani_: ah crap, the script requires modifications to the ci-tools that aren't pushed yet [03:55] jam: i'm pretty sure the stats tell you the allocated size and the in-use size [03:55] natefinch: so, when it's ready, the test charm should probably live here: lp:juju-ci-tools/repository [03:56] menn0: so db.collection.stats() if it accurately tracks what pages are in use would probably be a very cheap chekck [03:56] check [03:56] waigani_: yep, got that. But I need to be able to test it to know if it's ready :) Actually, I've done a lot of manual testing on the charm, so pretty sure it's fine. [03:56] menn0: db.txn.count() *could* be cheap depending on how mongo tracks documents. [03:56] jam: count is very cheap [03:56] jam: it's tracked separately [03:56] menn0: I think in mongo 3 because of MVCC it changes to be not so cheap [03:57] waigani_: deploy_job is giving me ImportError: No module named boto .... where do I get boto, pip? [03:57] jam: that makes sense [03:57] natefinch: sure. With the JES stuff, I took the approach of writing a new job - using deploy_job.py as a template [03:57] menn0: anyway, if count is cheap today we can go with it [03:57] waigani_: oh interesting [03:57] as we're a fair bit off, perhaps a code comment to evaluate if this stays cheap. [03:57] * menn0 nods [03:57] * natefinch grumbles that this whole "write a CI test" thing would be a hell of a lot easier if there were documentation about how to do it. [03:58] natefinch: so I've got deploy_jes_job.py - which builds ontop of deploy_job.py [03:58] jam: well the pruning change - mostly as you reviewed it - is merging for 1.22 as we speak :) [03:58] jam: but i can iterate [03:58] natefinch: so maybe you could do something similar? [03:59] jam: basically what you're after is: only prune if there's actual useful gains to be made [03:59] so that we're not loading the whole DB unecessarily [03:59] menn0: k. so my thoughts are generally that we want to have *some* GC so that things don't grow without bound, but obviously we can functionally cope with a fair amount of garbage, and we don't want to saturate our system just checking for garbage that isn't there. [03:59] menn0: right [03:59] menn0: the fact that our current GC is really expensive because we aren't doing incremental [04:00] menn0: I was going to say we could just drop the poll time to 1/day or 1/week even [04:00] but doing it when we expect to be able to clean things seems a better path [04:00] waigani_: yeah, I can look into that [04:01] I love the way every time something goes wrong with a python script I get a huge useless stack trace [04:01] jam: so to recap: track the count of the txns collection after each prune, and only try to prune if the count grows to 2x the previous value [04:02] jam: (and prune the first time if there's no count recorded) [04:03] natefinch: as opposed to a panic in go? the stack trace is useful for diagnosis :-) [04:03] so, the CI script gives me "ImportError: No module named boto" pip install boto givees me "ImportError: cannot import name IncompleteRead" ... my kingdom for a statically linked binary that just f'ing works. [04:03] menn0: right. IMO ideally we would save the count after the last GC so that we don't always GC on startup [04:03] but we can live with that [04:03] wallyworld: a stack trace from pip is useless to the end user [04:03] (i.e. me) [04:03] it's just ugly [04:04] so are go panics [04:04] menn0: also, I wanted to make sure that you don't GC immediately on startup (while load is the greatest on the machine), but I'm pretty sure you don't [04:04] wallyworld: your code shouldn't panic unless there's something hugely drastically wrong [04:04] menn0: can you make sure there is a test that you don't GC immediately ? [04:04] jam: the first prune doesn't happen until 2hrs after startup anyway [04:04] wallyworld: like, programmer error, generally [04:04] wallyworld: python scripts throw exceptions if you look at them the wrong way [04:04] natefinch: same with python - the programmer is just lazy not to deal with the error [04:04] jam: and if the count-at-last-prune is kept in the DB then it'll only happen when we want it to [04:04] wallyworld: then almost every python programmer ever is lazy [04:05] menn0: so in a healthy system (once we've fixed the address updater bug), I don't think we'll generate much garbage. [04:05] menn0: like, I would expect it to take us weeks to actually grow to 2x [04:05] natefinch: and go programmers aren't? [04:05]