[00:00] <wgrant> StevenK:   File "/srv/buildbot/slaves/launchpad/lucid-devel/build/orig_sourcecode/eggs/auditorfixture-0.0.3-py2.6.egg/auditorfixture/server.py", line 100, in _start
[00:00] <wgrant>     raise Exception("Timeout waiting for auditor to start.")
[00:01] <wgrant> Exception: Timeout waiting for auditor to start.
[00:02] <StevenK> wgrant: I don't think the new testbrowser does send a referer.
[00:03] <StevenK> wgrant: Hmmmmm. What? That's passed a few other buildbot runs.
[00:03] <wgrant> StevenK: Yes. It's a nice new intermittent failure.
[00:03] <StevenK> Bleh
[00:03] <StevenK> First one, I think
[00:04] <StevenK> wgrant: It looks like most of the test failures will be sorted by fixing NoReferrerError.
[00:07] <wgrant> StevenK: There are surely more than 21 tests that use testbrowser to submit forms, so there must be something special about yours.
[00:09] <StevenK> wgrant: http://pastebin.ubuntu.com/1138745/ is the diff. 17481 tests run in 4:24:53.047311, 21 failures, 2 errors
[00:09] <wgrant> StevenK: Sure, but what's special about the tests that failed?
[00:35] <wgrant> StevenK: How goes the QA?
[00:42] <StevenK> wgrant: You offered a project I could push to
[00:43] <StevenK> wgrant: All of the failing tests are doctests.
[00:44] <StevenK> So maybe the problem is in the browser objects we toss into the doctests
[00:51] <wgrant> StevenK: But there are hundreds of doctests that work.
[00:53] <wgrant> StevenK: You now have APG for python-oops-tools/private, and branches default to private
[00:53] <wgrant> (using a sharing policy, not BVPs)
[00:54] <wgrant> Bah, actually, that won't work
[00:54] <wgrant> Need to use a BVP
[00:54] <wgrant> sec
[00:54] <wgrant> There
[00:55] <StevenK> wgrant: So, there are 675 doctests. Only 11 use getControl(..).click()
[00:56] <wgrant> StevenK: No
[00:56] <wgrant> $ bzr grep -l 'getControl.*click()' | wc -l
[00:56] <wgrant> 331
[00:56] <wgrant> $ bzr grep -l 'getControl.*click()' | grep txt$ | wc -l
[00:56] <wgrant> 311
[00:57] <StevenK> for i in $(find . -name '*.txt' | grep -E '(doc|tests)') ; do grep -l '.click()' $i; done | wc -l
[00:57] <StevenK> 11
[00:57] <wgrant> pagetests live in stories
[00:57] <wgrant> not doc or tests
[00:57] <StevenK> Ah
[00:58] <StevenK> I thought I might be missing one
[00:58] <StevenK> wgrant:  bzr push lp://qastaging/~stevenk/python-oops-tools/foo-1
[00:58] <StevenK> It's been too long since I had to push a branch to qas.
[00:59] <wgrant> Hm, I can't see it...
[01:00] <wallyworld_> sinzui: sadly there is no native css support for multi-line text truncation but i have found a really neat little yui module which works perfectly and does all the internal calcs to simply allow a requested number of lines to be specified. it also falls back to native support if only one line is required.
[01:00] <StevenK> Sigh, pasted it but didn't hit enter
[01:01] <sinzui> wallyworld_: go on?
[01:01] <StevenK> wgrant: Which gives me an error, so there has to be something wrong with that URL
[01:01] <wgrant> StevenK: What's the error? That /+branch-id/foo is not a branch?
[01:01] <wallyworld_> sinzui: i'd like to use it. the js then is Y.all('.ellipsis').ellipsis({'lines': 2})
[01:02] <StevenK> bzr: ERROR: Server sent an unexpected error: ('error', 'NotBranchError', 'Not a branch: "chroot-67089488:///+branch-id/519259/".')
[01:02] <wgrant> StevenK: Right, that's fine. The stacked-on branch doesn't exist on qas.
[01:02] <wallyworld_> sinzui: for example
[01:02] <wgrant> StevenK: The branch is created now. You can test.
[01:03] <wallyworld_> sinzui: i haven't looked, but the tooltip could also display all the text as well i think
[01:03] <wallyworld_> sinzui:  so it seems like a nice solution for a few hundred lines of 3rd part code, much like we do for sottable
[01:03] <sinzui> wallyworld_: since you are adding it to things that use the title editor, I don't see a reason why you wouldn't hesitate
[01:04] <wallyworld_> sinzui: extra code, so thought i would check
[01:05] <sinzui> wallyworld_: add it I can take a look if you wnt
[01:05] <sinzui> want
[01:05] <StevenK> wgrant: Okay, unsubscribed and no redirect.
[01:05] <StevenK> wgrant: Remove the BVP?
[01:05] <wgrant> StevenK: You mean APG?
[01:05] <wallyworld_> sinzui: ok, will tidy up the prototype and add it properly, then update the mp. no rush, since it won't be deployed till next week anyway
[01:06] <StevenK> wgrant: Er, yeah.
[01:06] <wgrant> You still want it to be private, don't you?
[01:06] <wgrant> Right
[01:06] <wgrant> Gone
[01:06] <StevenK> wgrant: Right, so now we create foo-2 which should have me with an AAG?
[01:07] <wgrant> StevenK: Or I subscribe you to foo-1 again
[01:07] <wgrant> That's probably better
[01:07] <wgrant> StevenK: You are subscribed
[01:07] <StevenK> Too late, I pushed foo-2, but I get Forbidden on foo-1
[01:07]  * StevenK refreshes
[01:07] <wgrant> And +sharing confirms you have access
[01:07] <StevenK> Okay, unsubscribing
[01:08] <StevenK> wgrant: Redirected to https://code.qastaging.launchpad.net/python-oops-tools
[01:08] <wgrant> Excellent
[01:08]  * StevenK marks as qa-ok
[01:08]  * wgrant deploys
[01:08] <StevenK> wgrant: Oh, with the notification too, so it's excellent
[01:17] <StevenK> wgrant: Sigh, it looks like the poll doctests make use of POST
[01:32] <StevenK> wgrant: I bet that auditor failure was the usual omg-port-is-in-use-panic-and-catch-fire failure
[01:40] <wgrant> StevenK: If you haven't worked out what's broken, throw me a list of errors and I'll try to find out
[01:43] <StevenK> wgrant: I've fixed a few.
[01:44] <wgrant> StevenK: Any that were clicking submit?
[01:44] <wgrant> Or are those still proving troublesome?
[01:44] <StevenK> wgrant: Nope, I've dropped that for now.
[01:44] <StevenK> wgrant: I can scp the subunit stream to lillypilly if you wish
[01:44] <wgrant> StevenK: testr failing | utilities/paste
[01:45] <StevenK> wgrant: http://pastebin.ubuntu.com/1138836/
[01:48] <wgrant> StevenK: A lot of them seem to be posting manually
[01:48] <wgrant> Another one calls goBack just before the failure, which is possibly relevant
[01:50] <StevenK> wgrant: Yeah, I've started on xx-productseries.txt converting it to browser.post
[01:51] <wgrant> It looks like the actual change must be in zope.app.testing, rather than mechanize or testbrowser
[01:51] <wgrant> I think
[01:52] <wgrant> Since http() uses zope.app.testing's HTTPCaller directly.
[01:52] <wgrant> In fact
[01:53] <wgrant> I think all the browser-based failures are immediately after a goBack
[02:48] <lifeless> anyone up for a review - https://code.launchpad.net/~lifeless/python-oops-amqp/misc/+merge/119076 ?
[02:50]  * sinzui awards wgrant a gold ★
[02:50] <wgrant> sinzui: What have I done?
[02:50] <wgrant> lifeless: Looking
[02:50] <sinzui> made bugs public
[02:50] <wgrant> sinzui: Ah, yeah
[02:51] <wgrant> sinzui: I also removed ~launchpad-security's subscriptions to about 400 bugs
[02:51] <wgrant> That were public
[02:51] <wgrant> We can hopefully do away with the team soon.
[02:52] <sinzui> I just made qastaging laucnhapd/+sharing look more like I expect production to be.
[02:52] <wgrant> Great.
[02:52] <sinzui> We have a lot of bots that I left in place.
[02:52] <wgrant> I'm trying to clean things up so that we can use sharing as we intend sharing to be used :)
[02:52] <sinzui> I left many teams in place to keeps subscriptions, then I looked at the bugs, saw they were closed, so I unsubscribed a lot of teams
[03:00] <wgrant> lifeless: Do you deliberately depend on both bsons?
[03:01] <lifeless> wgrant: james_w is doing a migration to the 'real' one across everything, incrementally.
[03:01] <lifeless> wgrant: so this is just dealing with that more or less
[03:01] <lifeless> wgrant: I've ushed up the fixes james_w asked for
[03:02] <wgrant> lifeless: Looks good, then.
[03:02] <wgrant> Thanks.
[03:38]  * StevenK stabs these tests
[03:44] <wgrant> StevenK: 'sup?
[03:45] <StevenK> wgrant: My switch from POST to browser.post() is not going well.
[03:45] <wgrant> Ah
[03:46] <StevenK> browser.post() is also triggering NoReferrerError
[03:46] <wgrant> Is there a referer?
[03:47] <StevenK> browser.open() ... browser.post() should set one?
[03:47] <StevenK> Or is my understanding of zope.testbrowser bonkers?
[03:48] <wgrant> I'm not sure that post uses the current context
[03:48] <wgrant> It's probably similar to open in that respect
[03:49] <StevenK> I'm not sure that sprinkling 'browser.addHeader('Referrer: ...')' into the test is the right behaviour either
[03:50] <wgrant> There's hopefully a prettier way to do that.
[04:10] <StevenK> ValueError: line 123 of the docstring for xx-productseries.txt lacks blank after ...: '  ...Support E4X in EcmaScript...'
[04:10] <StevenK> Bleh
[04:13] <StevenK> What the heck does 'lacks blank' even mean, anyway. :-(
[04:13] <wgrant> StevenK: It thinks it's a continuation of the previous statement
[04:15] <StevenK> ... how
[04:15] <wgrant> Like that, yes.
[04:27] <StevenK>  lib/lp/blueprints/stories/blueprints/xx-productseries.txt
[04:27] <StevenK>   Ran 1 tests with 0 failures and 0 errors in 4.433 seconds.
[04:28] <StevenK> However, that was by adding .addHeader('Referer', ...)
[04:56] <lifeless> wgrant: stub is curious about the sso link removal project status
[04:57] <wgrant> lifeless: I have an SSO branch. It works. It has no tests, and IIRC it doesn't handle failure very well, and due to SSO's view structure it makes several XML-RPC requests
[04:57] <wgrant> So the whole thing needs a lot of refactoring
[04:57] <wgrant> Some of which landed three months after I proposed the branch
[04:57] <wgrant> But the LP side works fine, and fundamentally the SSO side is fairly easily doable.
[04:57] <lifeless> ok so some moderate work to do
[04:57] <wgrant> Yes
[04:58] <wgrant> And I think elmo will cry less if LP is never down :)
[05:13] <stub> Is the LP side a separate service or just the appserver?
[05:15] <stub> Now I think of it, if we tear out the slony code from the appserver then I think it will happily respond to read only requests when the master is down, because it doesn't need the master to calculate lag.
[05:16] <wgrant> stub: xmlrpc-private is always master-only at present, but indeed
[05:16] <stub> Might need a little polish, like ignoring the last write timestamp in the cookie, and no master only mode if lag > 2 minutes
[05:16] <wgrant> stub: Well
[05:16] <wgrant> Maybe
[05:17] <wgrant> Slave-capable things should use the slave if possible. If the slave is lagging too much, it should fall back to the master.
[05:17] <wgrant> If the master is unavailable, does it want to fail the request, or use the lagging slave?
[05:17] <stub> Use the lagging slave
[05:17] <wgrant> Right.
[05:17] <wgrant> That's my suspicion.
[05:18] <wgrant> It means we need to tweak things to only request master if they really need it
[05:18] <stub> We are interested in using up to date master. If the master is down, the lagged slave is still by definition the most up to date data we have
[05:18] <wgrant> Most XML-RPC requests are probably OK with a slave
[05:18] <stub> c/using up to date master/using up to date data/
[05:18] <wgrant> Actually
[05:18] <wgrant> Hm
[05:18] <wgrant> It might be similar to the API
[05:18] <wgrant> Where we want to use the master if at all possible
[05:19] <wgrant> Because why not be consistent and up to date
[05:19] <stub> Well, that is a bug
[05:19] <wgrant> So we want really up to date data, but we don't want to fail if we can avoid it
[05:19] <wgrant> So MasterPleaseIfAtAllPossibleDatabasePolicy :)
[05:19] <stub> By the time you receive your data, it might be out of date. We can never guarantee consistency, even from the master
[05:19] <wgrant> True.
[05:19] <wgrant> And I guess slave lag should be pretty minimal nowadays.
[05:19] <wgrant> "Nowadays" being since Monday.
[05:19] <spm> if the master is down, will the lag result actually show as lagged? I assume yes, but...
[05:19] <stub> So instead we just give data that is 'recent enough', which can come from a slave.
[05:20] <wgrant> spm: Yes
[05:20] <wgrant> spm: The "lag" is the age of the last WAL replayed from the master.
[05:20] <wgrant> stub: Right.
[05:20] <stub> The cutoff we care about is 'is the lag greater than my last write', which means we need a session identifier.
[05:20] <spm> cool. just conscious that the process on the master is what does the laggy updates, aiui
[05:20] <wgrant> spm: In the old world, yeah
[05:21] <spm> oh this is new shiny? nm then.
[05:21] <wgrant> spm: (although the old stuff also wouldn't break in this case: it stored lag plus a last update time, IIRC)
[05:21] <stub> spm: Its designed that the master isn't particularly aware of the slaves, who they are or what they are doing.
[05:21] <spm> right
[05:21] <wgrant> So if the master goes away, the last update time lags, and clients can notice
[05:23] <wgrant> stub: So I think there's a place for a MasterPlease policy, which is used for eg. recently-POSTed web sessions and xmlrpc-private and all API sessions (until we have a reliable session identifier for API clients), which uses the master unless it's disappeared.
[05:23] <wgrant> Real write requests would still use the classical Master policy
[05:23] <wgrant> So would fail during fdt.
[05:24] <stub> yeah, sounds about right.
[05:24] <StevenK> MasterIfPossibleDatabasePolicy ?
[05:24] <StevenK> MasterWithFallbackDatabasePolicy perhaps
[05:26] <stub> We might be able to just tweak the existing LP policy.
[05:27] <stub> If the master is unavailable but asked for, give out a slave. If the POST or XML-RPC request or whatever attempts to UPDATE, it will fail with a read only violation. Put some lipstick on that, make it a 503 status code and we might be good.
[05:28] <lifeless> stub: btw, do we set the feedback setting on the slaves?
[05:28] <stub> lifeless: yes, I didn't want to mess with behaviour too much just yet
[05:29] <stub> lifeless: it is probably how we will keep it too.
[05:30] <lifeless> hot_standby_feedback is the one I mean; defaults off but looks like we may want it on
[05:42] <stub> it is on for us, yes
[05:56] <lifeless> cool cool
[07:53] <adeuring> good morning
[10:02] <ev> is there anyone I need to notify if I'm going to do a large number of lplib API calls in a test?
[10:13] <wgrant> ev: How many is 'large', what sort of calls, and can you do it on (qa)staging instead/first?
[10:14] <ev> wgrant: apols, I just realized that was hopelessly vague. So I have 81,455 crashes. I'm going to get the package for each and all of the relevant dependency packages. I then need to call getPublishedBinaries for each of those (but I'll cache calls on a key of package + series)
[10:14] <ev> and yes, I can do it on staging first. Is that woefully slow by comparsion?
[10:14] <ev> comparison*
[10:15] <cjwatson> Is this a one-off or in a frequently-run test suite?
[10:15] <wgrant> I forget whether getPublishedBinaries is terrible or not
[10:15] <wgrant> It should be reasonably fast even on staging if it doesn't do any stupid substring matching
[10:15] <ev> cjwatson: it will be run daily, but for right this moment it's just a one off to get some basic data
[10:15] <ev> wgrant: I have exact_match set, though I do realize there could be substring matches elsewhere
[10:16] <wgrant> I think that should be most of it
[10:16] <wgrant> So, I'd try it on staging first.
[10:16] <wgrant> It should be fairly quick once it's warmed up
[10:16] <ev> excellent
[10:16] <wgrant> Although
[10:16] <ev> wgrant: should I notify webops as well?
[10:16] <wgrant> At 82k crashes
[10:16] <cjwatson> getPublishedBinaries - which publication statuses?
[10:16] <wgrant> Presumably there's lots of deps?
[10:16] <wgrant> For each?
[10:16] <wgrant> Ah, but if you cache...
[10:17] <lifeless> ev: no need to notify webops if you are doing these alls serially.
[10:17] <lifeless> ev: its a less than 1% increase in traffic.
[10:17] <ev> lifeless: it is serially, and hi :)
[10:17] <lifeless> ev: if you're doing it in parallel, thats another matter :)
[10:17] <lifeless> ev: oh hi :)
[10:17] <cjwatson> If it's just "Published", it would be better to just get the relevant Packages files from a mirror and parse locally ...
[10:17] <wgrant> It's not going to be a problem, but we can probably make it faster :)
[10:17] <wgrant> Yeah
[10:17] <wgrant> That's the thing
[10:17] <lifeless> ev: btw, I did get you commit access to lp:python-oopsrepository right?
[10:17] <wgrant> I don't see why you don't just use the normal indices
[10:18] <cjwatson> If you're trying to get historical publication information for some reason, that would be different
[10:18] <ev> lifeless: yes, I've been terrible and having merged back yet. Will do today.
[10:18] <cjwatson> Like when they were superseded or something
[10:19] <lifeless> ev: we've got webops, u1, ca and LP all using one python-oops-tools system now
[10:19] <lifeless> ev: so the interest in migrating to a cassandra backend is growing.
[10:19] <ev> lifeless: though I'll throw it up as a MP first, just so people have a chance to tell me no before I merge it in
[10:19] <ev> lifeless: excellent!
[10:19] <ev> lifeless: yeah, I've had brief conversations with james_w about it, and you I believe :)
[10:20] <ev> cjwatson: historical information. It's about creating an "ideal" crash line
[10:20] <ev> that is, crashes where every package in the dependency chain that apport lists was up to date at the time of the crash
[10:21] <ev> cjwatson: I've forwarded you a mail I sent to lifeless explaining the basic idea
[10:22] <ev> the code will be something akin to this http://paste.ubuntu.com/1139323/  (at least for the test)
[10:23] <cjwatson> Can you use created_since_date
[10:23] <cjwatson> ?
[10:23] <ev> since we now have to calculate the unique users seen in the past 90 day period for the denominator, and that's not a calculation that can be done quickly, the whole thing will be calculated once a day for the day that's passed
[10:24] <cjwatson> Consider ordered=False too, since you don't appear to need ordered results
[10:24] <ev> (with the "actual" line being total crashes divided by unique users in 90 days and the "ideal" line being total crashes that were on up to date systems divided by unique users in 90 days)
[10:24] <ev> cjwatson: created_since_date doesn't work as far as I can tell for the reason mentioned in the code comment. But maybe I'm wrong?
[10:25] <ev> cjwatson: ordered=False> excellent, will do
[10:25] <wgrant> ev: Would you be better served by maintaining a full set of when each (name, version, arch) first appeared?
[10:25] <wgrant> Rather than querying most of Ubuntu's history every day
[10:26] <cjwatson> Yeah, surely there's some kind of inter-run caching possible here
[10:26] <ev> wgrant: so cache the package name, version, and arch tuple into cassandra?
[10:26] <cjwatson> It's not like binary_package_version or date_published on past publications are going to change
[10:26] <wgrant> Right
[10:26] <wgrant> The history won't change
[10:26] <lifeless> ev: is that 81K distinct crash signatures?
[10:26] <lifeless> ev: or 81K reports ?
[10:26] <ev> yeah, sure
[10:26] <cjwatson> date_superseded might of course, but you aren't looking at that
[10:26] <wgrant> You can easily keep a local copy of the relevant bits of history
[10:26] <ev> lifeless: 81K reports for a day period
[10:27] <ev> which seems about average
[10:27] <lifeless> so few
[10:27] <wgrant> And use created_since_date to just bring in all the new records every $interval
[10:27] <cjwatson> It might even be worth doing one getPublishedBinaries call with created_since_date for the whole interval, rather than one per binary name?
[10:27] <ev> cjwatson: right, just when it was published
[10:27] <wgrant> cjwatson: Exactly.
[10:28] <wgrant> You keep a local database of (name, version, arch, date_published/date_created), then every $interval ask for all the new publications since the last time you asked - a bit
[10:28] <cjwatson> It's per-series so the initial setup will have some giant returned result set, but only back <six months
[10:29] <ev> where interval is the daily run of this code to generate the totals for the ideal line for the day past, right?
[10:29] <mpt> lifeless, once this publishing history discussion ^ is sorted, we have a fun question about calculation of that "ideal" line
[10:29] <wgrant> ev: Well, it doesn't really have to be this code
[10:29] <wgrant> ev: The update process is separate.
[10:29] <lifeless> mpt: cool
[10:30] <wgrant> ev: You can rapidly query your local cache of the relevant info whenever.
[10:30] <lifeless> FWIW I don't care whether ev caches the data or not.
[10:30] <lifeless> datastores are data stores.
[10:30] <ev> wgrant: okay, but still iterating over the same data set, right? My point is that it's not building a cache for packages it's not going to care about. Just ones for the oopses and their dependencies that we've seen day by day
[10:31] <ev> lifeless: you'd argue for talking directly to LP without a cache?
[10:31] <lifeless> LP can trivially handle the load; the current API may be inefficient, but its got no intrinsic reason to be so.
[10:31] <wgrant> lifeless: Datastores are datastores, but the LP API is about as inefficient as it gets.
[10:31] <lifeless> ev: I would start with the simplest thing possible.
[10:31] <wgrant> Cache locally => hundreds of times faster
[10:31] <lifeless> ev: and add complexity only when I had to.
[10:32] <wgrant> ev: How many packages is that? The closure of dependencies could be fairly large.
[10:32] <ev> wgrant: I can calculate an approximation based off a days run
[10:32] <lifeless> ev: e.g. start by just talking to LP; then the next step either make the LP API faster (often easy, lots of unoptimised stuff) or add a local store.
[10:33] <ev> I'll just add that to the set of things to count in this sample
[10:33] <ev> lifeless: okay
[10:33] <lifeless> wgrant and cjwatson may be entirely correct that doing it via LP will be terrible (and the hidden HTTP requests launchpadlib does are likely to prove them right :P)
[10:34] <lifeless> but its still better to deliver something soon and then iterate IMNSHO
[10:34] <ev> absolutely
[10:35] <cjwatson> I wouldn't be making these suggestions if I thought they were hard to implement :)
[10:35] <cjwatson> FWIW
[10:35] <lifeless> cjwatson: sure, and I don't think they are necessarily wrong.
[10:36] <cjwatson> as in, it's what I'd do and I expect writing the code for it would be quicker than waiting for the initial "easy" but slow version to complete
[10:36] <lifeless> I'm just rather aware about the political side of getting this data as soon as possible, due to that u-r thread of doom
[10:36] <cjwatson> so I think in this case the "easy" version is a false economy
[10:37] <lifeless> cjwatson: I'm not suggesting using launchpadlib directly because its easier, but because it has less moving parts.
[10:37] <cjwatson> even so
[10:37] <ev> I suspect I'm going to lose the thread of doom. There's no way I can get the changes to apport for a single pair of dialogs done by the end of the day. Well, I can probably have the code done, but then there's getting pitti to magically appear and review it, and it's quite deep.
[10:37] <lifeless> FWIW my initial suggestion to ev was a dedicated API to do the heavy lifting in LP.
[10:39] <ev> indeed, I wasn't expecting to get to such optimization just yet as this conversation started off discussing an initial test
[10:39] <ev> so, mpt. Maths
[10:39] <ev> gah, NOT A PLURAL WORD
[10:39] <mpt> Road works
[10:39] <lifeless> Mathematics
[10:40] <ev> :)
[10:40] <mpt> lifeless, so. The graph aims to show the average number of crashes per calendar day. (Making it per 24 hours of uptime, to eliminate the spike during weekends, is a problem we've tabled for now.)
[10:41] <mpt> lifeless, to do that we take the number of errors reported each day, and divides it by an estimate of the number of machines from which errors would be reported if they happened.
[10:41] <mpt> (Now I'm the one adding extra Ses.)
[10:42] <mpt> As an estimate of "the number of machines from which errors would be reported", we use "the number of machines that reported at least one error any time in the 90 days up to that day".
[10:43] <mpt> That slightly under-counts because of machines that were active but lucky enough not to have any errors. And it slightly over-counts because of machines that were destroyed or had Ubuntu removed from them during that 90-day period.
[10:43] <mpt> Hopefully the under-count and over-count cancel each other out.
[10:43] <mpt> Anyway.
[10:43] <lifeless> uh,
[10:43] <lifeless> it massively undercounts
[10:44] <lifeless> but thats a different point
[10:44] <mpt> ok, why does it massively undercount?
[10:44] <lifeless> you want 'size of the population of machines with error reporting turned on and users that don't always hit no'
[10:45] <mpt> "users that would usually hit yes", but yes.
[10:45] <lifeless> you are getting '90 sliding observation of [machines with error reporting turned on and users that don't always hit no] that encountered 1 or more errors and reported them'
[10:45] <lifeless> mpt: how often does the error reporting message come up for you
[10:46] <mpt> lifeless, about three or four times a week.
[10:46] <lifeless> mpt: so for me it comes up -maybe- once a month. I think twice since precise released.
[10:46] <mpt> lifeless, if it turns out that the average is anywhere close to 1/90, then we'll need to increase the 90-day period to more than that.
[10:46] <lifeless> mpt: the underreport is due to all the machines that don't encounter errors at all
[10:47] <lifeless> and you can't tell how big the under report is because the sample you have is only from reporting machines.
[10:47] <mpt> So? So is the numerator.
[10:47] <lifeless> I mean, machines that are biased to report at a frequency of 1 in 90 days or great.
[10:47] <lifeless> *greater*
[10:48] <lifeless> mpt: I don't follow how that matters
[10:48] <lifeless> mpt: you said "As an estimate of "the number of machines from which errors would be reported"
[10:48] <mpt> yes
[10:48] <lifeless> mpt: I'm saying that its seems likely to me that your estimate is very low. We can test this theory.
[10:49] <lifeless> ev: whats the current unique reporting machine count for the last 90 days ?
[10:49] <mpt> We're assuming the number of errors/day is a unimodal distribution (probably Poisson), and that there aren't a lot of machines that have zero errors in a 90-day period
[10:50] <ev> lifeless: I don't have that yet. The query was taking more than 12 hours to back-populate so I need to come up with a quicker approach.
[10:50] <ev> but
[10:50] <mpt> (where "aren't a lot" = "are fewer than 1.1%"
[10:50] <mpt> )
[10:51] <ev> oh, nevermind. I thought I had a quick way to get the unique machines for all releases for the past 90 days
[10:51] <lifeless> mpt: For a single individual, the distribution should be poisson, unless the way they use their machine influences crash rates, in which case it won't be.
[10:51] <ev> but it's not as quick as I thought
[10:53] <lifeless> mpt: this is a distraction; we can investigate whether we have a underestimate or not separately.
[10:53] <mpt> lifeless, I think if we are massively under-counting, then the next rollout of the graph will probably show an average of close to 0.01 errors/day, because it was dominated by machines that reported only one error in that 90-day period.
[10:54] <mpt> Anyway, distraction, yes.
[10:54] <mpt> For the ideal line, we want to show the effect of people installing updates or not.
[10:54] <lifeless> mpt: that doesn't necessarily follow. We know our estimate should match some fraction of the precise userbase, where that fraction is the number of users that leave the tick box on and click continue.
[10:55] <mpt> Not how quickly they are, but how much their promptness/tardiness affects Ubuntu's reliability in the wild.
[10:55] <lifeless> mpt: we can independently estimate that fraction, multiple by the separate estimate of precise desktop users, and compare to the errors.ubuntu.com estimate.
[10:55] <lifeless> mpt: if they different substantially, one or more of the estimators is wrong.
[10:55] <ev> (so I can quickly get the number of unique users that have ever report crashes, so something just over 120 days, and that's 1,975,010)
[10:56] <lifeless> mpt: 81K*90 = 7.2M, which is too low I believe.
[10:56] <lifeless> ev: great.
[10:56] <lifeless> That means we're massively underestimating :)
[10:56] <mpt> lifeless, it's much less than 81K*90, because many of those machines are the same in multiple days
[10:57] <lifeless> mpt: sure, I used 81K*90 as an upper bound
[10:57] <lifeless> mpt: because if it was still too low, there is no way that any answer ev gave could be higher.
[10:57] <mpt> sure
[10:57] <lifeless> ok, so ideal line.
[10:58] <lifeless> so you want to show the number of crashes per day that would be saved if users updated ?
[10:58] <mpt> If we calculate it right, the "ideal" line will be like a smooth + lagged version of the "actual" line.
[10:58] <mpt> wait, no, other way around.
[10:58] <mpt> The "actual" line will be like a smooth + lagged version of the "ideal" line.
[10:59] <mpt> If we issue a fix for an error that's causing 50% of the errors reported, the "ideal" line will drop down to half its previous level immediately, and the "actual" line will drift down slowly to meet it.
[11:00] <mpt> Conversely, if something goes wrong and we issue a really crashy update, the (now-misnamed) "ideal" line will spike up, and the "actual" line will drift up to meet it.
[11:00] <lifeless> Sure
[11:00] <lifeless> s/ideal/projected/
[11:00] <lifeless> potential
[11:00] <lifeless> possible
[11:00] <mpt> something like that.
[11:01] <mpt> "If all updates were installed", we call it in the page currently
[11:01] <mpt> just for clarity :-)
[11:01] <mpt> Now, for this we do the same kind of division as before
[11:01] <lifeless> yeah
[11:02] <mpt> The numerator is the number of error reports on that day, for which that package and all its dependencies were up to date
[11:02] <mpt> But we're not sure what the denominator should be.
[11:02] <mpt> I thought that it should be the same denominator as the "actual" line, the estimate of all machines that would typically report errors if they encountered them.
[11:03] <mpt> That passes the sanity test that if every Ubuntu machine was perfectly up to date, the "actual" and "potential" lines would be exactly the same.
[11:03] <lifeless> uhm
[11:03] <lifeless> why calculate it from scratch ?
[11:04] <lifeless> I mean, the way you expressed it: 'issue a fix for an error that's causing 50% of the errors reported, the "ideal" line will drop down to half its previous level'
[11:04] <mpt> If you mean the denominator, I'm not suggesting calculating it from scratch
[11:04] <mpt> oh
[11:05] <lifeless> when everything is up to date / there are no fixes available then ideal == actual
[11:06] <mpt> Because errors are not evenly distributed over updates. For example, there are a bunch of machines out there (we don't know how many) that install only security updates, not other updates, and security updates may be more or less likely than average to fix reportable errors.
[11:06] <mpt> s/over updates/over updated packages/
[11:06] <jml> any buildout folks around?
[11:08] <lifeless> jml: passingly familiar
[11:09] <lifeless> mpt: ideally all machines install all updates right?
[11:09] <mpt> lifeless, anyway, I *think* (though I'm not sure) that using the total number of 90-day-active machines may cause the "projected" line to be too low. The slower people install updates, the lower the number of errors will be from up-to-date packages, but that doesn't mean those up-to-date packages are more reliable.
[11:10] <mpt> lifeless, ideally, yes.
[11:10] <jml> we have a Python project that comes with executable files. buildout (and pretty much anything that installs from eggs) loses the executable bit. afaict, this is due to a bug in stdlib zipfile where the external_attr part of the ZipInfo is ignored on extraction.
[11:11] <mpt> lifeless, but remember, we're not trying to measure the proportion of machines that are all up to date, we're trying to measure how much reliability is affected by packages being out of date.
[11:11] <lifeless> ok, so lets talk reliability estimators
[11:11] <mpt> If a package is way out of date but doesn't generate any error reports, that's fine as far as this graph is concerned.
[11:11] <jml> my immediate plan is to carry a temporary fork of distribute that corrects extract_zipfile in setuptools.archive_utils to chmod after extraction.
[11:11] <lifeless> jml: executable files in the package? or scripts that should be executable after install
[11:12] <jml> lifeless: the first one.
[11:12] <lifeless> jml: thats frowned upon. lintian will whinge, for instance.
[11:12] <lifeless> jml: why do you want that ?
[11:13] <jml> lifeless: I am not changing it.
[11:13] <lifeless> jml: the question is too abstract.
[11:13] <jml> lifeless: but, since you asked, it's because pkgme's interface to backends is by spawning sub-processes
[11:13] <jml> a backend is just a couple of executables
[11:14] <jml> if those backends happen to be written in Python, then distributing them is very tedious, thanks to this bug in zipfile
[11:14] <lifeless> jml: I don't think its a bug.
[11:14] <jml> lifeless: tarfile preserves permissions
[11:15] <cjwatson> lifeless: lintian> that rather depends on where the files in question are installed, and whether they include #!
[11:15] <jml> lifeless: why should zipfile not?
[11:15] <lifeless> jml: the interface for running things from a python package is python -m foo.bar
[11:15] <lifeless> jml: not /usr/lib/pythonx.y/dist-packages/foo/bar.py
[11:15] <jml> lifeless: it's a script
[11:15] <lifeless> jml: if its a script, the installation of it should be putting it in the right bin directory, updating the interpreter and making it executable for you.
[11:16] <lifeless> cjwatson: 'in a python package' is well defined, and #! in those files also warns IIRC.
[11:16] <cjwatson> Oh, you meant Python package, right, you just said package :)
[11:17] <jml> lifeless: pkgme works by searching a list of paths that contain backends and then running the executables it finds there.
[11:17] <cjwatson> Though I'm not sure I believe your IIRC without proof, as I don't immediately see evidence of that in lintian
[11:22] <lifeless> cjwatson: I may well be horribly mistaken
[11:24] <lifeless> jml: so, with buildout, that won't work unless those scripts have no dependencies that buildout is supplying.
[11:24] <lifeless> jml: you need to run them via bin/py <path to python file> or bin/py -m <python module path>
[11:25] <lifeless> jml: otherwise you will get the system interpreter path.
[11:25] <jml> lifeless: yes, this is true of virtualenv too
[11:25] <lifeless> jml: To me, this makes the issue you are actually facing irrelevant.
[11:25] <lifeless> jml: Have I missed something ?
[11:26] <jml> we have some work-around for that atm
[11:28] <jml> which I'm having trouble locating
[11:29] <jml> ah yes. it's hideous and won't work with buildout.
[11:30] <lifeless> so, we can talk about the mode bg
[11:30] <jml> but I know something worse that will
[11:30] <lifeless> but I don't think it will help you will it ?
[11:33] <lifeless> jml: its late, I need to go. I'd be happy to design something simple that will work for you, just not now.
[11:33] <jml> lifeless: ok, thanks.
[11:53] <jam> trying to go to: https://launchpad.net/projects/+review-licenses is timing out for me. It worked for a bit yesterday (until I reviewed a bunch of projects and then reloaded)
[11:54] <jam> I submitted a bug, is there much else to do (I'm trying to take care of the review queue, etc)
[11:56] <wgrant> jam: It's working for me. Tried refreshing a couple of times?
[11:56] <wgrant> My superpowers may be causing permission checks to be skipped though
[11:57] <rick_h_> yea, 6 loads here all timeouts
[11:59] <rick_h_> timeout backs to the is_valid_person ValidPersonCache.get and all storm from there on out
[11:59] <StevenK> Sounds like preloading is in order then
[12:00] <jam> rick_h_: yeah, my OOPS shows 1.7s run in a single query, though running it on staging completes in 17ms...
[12:01] <wgrant> jam: OOPS ID?
[12:02] <jam> wgrant: https://oops.canonical.com/oops/?oopsid=OOPS-3abca09f555663402bbd26a37805e0a0
[12:03] <wgrant> Ah
[12:03] <wgrant> I bet it's the private team privacy adapter
[12:03] <wgrant> But sooooo many queries
[12:04] <wgrant> Indeed
[12:04] <wgrant> The insane private team privacy rules
[12:05] <wgrant> jam: Can you see it now?
[12:06] <wgrant> I've removed the only obvious private team from the listing
[12:06] <jam> wgrant: timeout again
[12:06] <jam> wgrant: new oops: https://oops.canonical.com/oops/?oopsid=OOPS-5c67f643d64f10768d671842a80680e0
[12:06] <wgrant> Ah, there's another one
[12:06] <wgrant> Try now?
[12:07] <rick_h_> loads now
[12:07] <wgrant> Yeah
[12:07] <wgrant> There were two private teams on Canonical projects
[12:07] <jam> yay
[12:07] <jam> though there still seems to be death-by-thousand-cuts on that page
[12:07] <rick_h_> good to know, shold we get a hard timeout exception for the review page for now?
[12:07] <rick_h_> so it's usable for maint?
[12:07] <jam> If you look at the new oops, it has one query repeated 56 times.
[12:07] <wgrant> jam: Yeah, fortunately they're pretty small cuts in this case.
[12:08] <wgrant> rick_h_: Maybe. But AIUI czajkowski is back next week, so it's not maintenance's responsibility any more
[12:08] <wgrant> And she is immune to these issues
[12:08] <rick_h_> ah, wasn't aware she was immune
[12:09] <wgrant> This is useful ammunition in my war against lifeless' overcomplicated private team visibility rules :)
[12:10] <jam> wgrant: weird, I see the helensburgh project in there, even though it succeeded in loading... :)
[12:10] <wgrant> jam: It's driven by a private team, not owned by one
[12:10] <wgrant> That page only shows owners
[12:11] <wgrant> (well, and registrantsa)
[12:18] <rick_h_> huwshimi: ping, do we have a login for the web balsmiq?
[12:18] <rick_h_> huwshimi: or did you just get the air version running?
[12:19] <huwshimi> rick_h_: I think so, I'll forward the details.
[12:19] <rick_h_> huwshimi: ty
[12:21] <cjwatson> wgrant: Care to re-review https://code.launchpad.net/~cjwatson/launchpad/archive-getallpermissions/+merge/117606 ?  I'm becoming good friends with StormStatementRecorder.
[12:23] <wgrant> cjwatson: Looks good, thanks.
[12:24] <wgrant> cjwatson: (although you might want to rename allPermissions to getPermissionsForArchive or permissionsForArchive or something)
[12:25] <cjwatson> Mm, yeah.  The almost-but-not-quite-the-same names between Archive and ArchivePermission are a tad confusing.
[12:25] <wgrant> Yeah
[12:25] <wgrant> ArchivePermissionSet's are wrong
[12:26] <wgrant> Because method names are meant to be verbs
[12:26] <wgrant> But consistency might be best there for now
[15:09] <deryck> rick_h_, hey, how about at 15 after?  in roughly 5 minutes, for our call?
[15:10] <rick_h_> sure thing
[18:44] <lifeless> wgrant: unoptimisde thing is slow isn't a surprise
[19:46] <Ergo^> evening