[06:50] <vorlon> fwiw I'm rebooting the autopkgtest master to see if that helps with the i/o issues (the system has been up for quite a while and I don't know that I trust the kernel).  However I didn't count on systemd insisting on nuking all of /tmp on boot before it lets me back in, and that's where all of the autopkgtest units' working files are, so it'll be a while before it's back up
[06:50] <vorlon> (hopefully it'll come back before I timeout and have to go to bed, but there's really no telling at this point)
[07:04] <vorlon> Laney, juliank: ^^ I have no idea, maybe it'll come up and be healthy, or maybe it'll need some babysitting when it comes up (to kill off instances left running?), but I can't stay up to wait for it indefinitely; if either of you happen to be at keyboard today when it comes up and want to have a look, great, otherwise I'll check in in ~8h
[07:10] <Laney> vorlon: might be around a bit later on (out for a few hours now). fwiw (if rebooting doesn't help), if you reduce parallelism to say 5 units per arch then it'll probably decrease load enough to be able to chew through the chunky jobs, and it can then be increased again afterwards
[09:35] <juliank> everything seems ok
[09:35] <juliank> 163 tar processes running, fun, but machine is responsive
[09:37] <Laney> check if the big jobs (binutils, libreoffice, chromium, k*) are actually being processed rather than timing out repeatedly
[09:37] <Laney> if you see them bunched up copying back to the controller, that is a suspicious sign
[09:38] <Laney> maybe watch journal for a bit to see what's actually going on
[09:38]  * Laney out again
[09:39] <juliank> I see exceed quota, but no timeouts so far, will look in from time to time
[09:40] <juliank> Ah I think the chromium ones are running right now
[09:53] <juliank>                                                                                                                 autopkgtest [09:49:23]: ERROR: testbed failure: sent `copyup /tmp/autopkgtest.7oD5ck/build.OEX/src/ /tmp/autopkgtest-work.emdsp_b7/out/tests-tree/', got `timeout', expected
[09:53] <juliank> yeah, so did not help
[09:59] <juliank> I guess decreasing parallelity it is
[10:00] <juliank> * parallelism
[10:12] <acheronuk> juliank: it did a few days ago. bigger questions is though why this problem now? the infra did used to cope. granted that some things slowed progress a fair bit, but there was not this timeout -> respawn job to front of queue paralysis. somthing on the test machines/infra/whatever has regressed
[11:06] <juliank> acheronuk: I don't know
[11:06] <juliank> acheronuk: I did decrease #workers to 5 per-arch
[11:06] <juliank> and it seems we reduced tar processes by 50%
[11:07] <juliank> I/O load hence seems lower now, which should hopefully mean we don't timeout copying anymore
[11:08] <juliank> What I think we'd have to do is ensure that we only have a certain number of needs-build jobs running at the same time
[11:11] <juliank> even ls /tmp works now
[11:12] <juliank> it might be useful splitting that up a bit
[11:12] <juliank> /tmp/autopkgtest-{fifo,work,ssh}.$randomstring
[11:12] <juliank> we have that for each running autopkgtest
[11:13] <juliank> that means /tmp has a lot of files in that dir
[11:13] <juliank> it might make sense to build hierarchies
[11:14] <juliank> e.g. /tmp/autopkgtest-{fifo,ssh,work}.1rdoc1m_ => /tmp/autopkgtest.1r/do/c1/m_/autopkgtest-{fifo,ssh,work}.1rdoc1m_
[11:14] <juliank> like git does
[11:14] <juliank> to not have that dir grow to a huge ton of entries
[11:14] <juliank> we should see whether chromium copy worked in ~ 40 mins
[11:30] <acheronuk> juliank: fine. but fact remains that this is a recent phenomena, or one whose severity and prevalence has dramatically increased lately
[11:31] <acheronuk> now I think on it, I did notice that towards the end of disco cycle, some jobs did appear to struggle
[11:33] <acheronuk> e.g kwin, and plasma-workspace jobs did start to at some time seem to stall for hrs, then restart, without any explanation to me where all I could do was look at user facing logs
[11:35] <acheronuk> annoying, but only occasional, even when the infra was very loaded up with K* tests, and they mostly did finish the apparent subsequent go, so I did not flag it up
[11:35] <acheronuk> it is possible other *BIG* tests not KDE did the same, but I did not watch those
[11:37] <acheronuk> however, something has changed recently that means that, this is now more likely than not under load, and hitting a wider range of tests
[11:42] <acheronuk> anyway. that is how I have observed things change recently (and I do watch a lot when many KDE tests are queued), whatever that means
[12:05] <juliank> I'm sorry to inform you that copying timed out again
[12:06] <juliank> oh, but one succeeded
[12:06] <juliank> (of chromium-browser/s390x)
[12:07] <juliank> ah no, arm64 succeeded, s390x timed out
[12:08] <juliank> https://objectstorage.prodstack4-5.canonical.com/v1/AUTH_77e2ada1e7a84929a74ba3b87153c0ac/autopkgtest-eoan/eoan/arm64/c/chromium-browser/20190427_115905_661f8@/log.gz
[12:08] <juliank> wooho
[12:08] <juliank> why am I not raising the copy timeout?
[12:09] <juliank> it would be better than retrying again and again
[12:10] <juliank> but it already is a 100 mins, so...
[12:12] <juliank> it seems likely raising it by 20 mins solves some of those issues
[12:15] <juliank> I increased it by 20 mins, and requested restarts once current jobs are complete
[12:19] <juliank> FWIW, solving this means setting up a second cloud worker, which probably means we should do a staging cloud worker first
[12:19] <juliank> that's in progress, but will probably take some time
[13:18] <Laney> acheronuk: can you please stop theorising that something changed and we're missing it?
[13:18] <Laney> it's simply not true (we have in fact had this problem before)
[13:19] <Laney> the last time was when we created the card to work on the parallelism stuff that juliank has been doing this week
[13:41] <Laney> juliank: WDYT about tagging some of the messages from the worker with their type? So we can just see like acknowledged/received request messages or something
[13:42] <Laney> . o O ( journalctl ADT_MESSAGE_TYPE=ack ADT_MESSAGE_TYPE=upload ADT_MESSAGE_TYPE=running-autopkgtest ... )
[15:43] <acheronuk> Laney: no, I will not. as things have clearly regressed. stating that fact is not theorising
[15:45] <acheronuk> if it it a recurring problem that occurred at a time when I was not aware, then obviously it is relevant to look at what circumstances are similar to now
[15:46] <acheronuk> however no matter what you say, it is a clear regression over behaviour during the vast majority of the time I have ever observed behaviour if the tests
[15:48] <acheronuk> Laney: sorry. I am just trying to be helpful, by adding my observational input as I tend to tarck the running tests quite closely when mine are in play. only as in the spirit of trying to add data to help sort this
[15:49] <acheronuk> *tend to track
[15:58] <juliank> Laney: hmm, not sure how useful that would be, then you need to find the unit to query context
[15:59] <juliank> Laney, acheronuk FWIW, there do not seem to be any timeouts anymore AFACIT
[15:59] <juliank> it's been about 4 hours now, so I'd hope that's useful data
[16:00] <acheronuk> juliank: :)
[16:01] <juliank> FWIW, a cron job ran and failed to send email https://pastebin.canonical.com/p/VmZgzBCG9y/
[16:21] <Laney> acheronuk: I'll say it as clearly as I can, one last time: your input is not needed here.
[16:22] <Laney> juliank: thanks, just taking aAAAAAAAaaaaaaaaaaaaages for things to run
[16:23] <acheronuk> Laney: I will says it as clearly as I can, I am just adding observations that you are ignoring in the spirit of trying to help worl out how they hell to get past this
[16:24] <acheronuk> as I have a vested interest in how this goes, I will not stay quiwt when it seems they are being overlooked
[16:24] <Laney> We don't need external observations. We have internal ones. We're doing all we can. Thanks.
[16:29] <acheronuk> Laney: I appreciate that. I am not trying to be awkward deliberately. I want this sorted as much as anyone
[16:32] <acheronuk> frustration with the current situation, and an apparent mismatch between the the 'explained situation" and observed reality, means I feel the need to at least say so
[16:32] <acheronuk> I will try to dial back the 1st, but the 2nd I will still query. Politely of course
[16:32] <acheronuk> thank you for helping
[16:35] <Laney> OK, thanks for that. I've not been involved in any conversations about this outside of here, FWIW, but it may be difficult to follow.
[16:37] <Laney> What happens is that autopkgtest-virt-ssh runs copy the source tree back to the controller, so that it can be sent to new instances, if the tests require them. If the test has Restrictions: build-needed, that is a *built* source tree.
[16:37] <Laney> These jobs have a timeout (--copy-timeout).
[16:38] <Laney> We can very rarely get into an unlucky situation where lots of jobs (we run more than a hundred workers in parallel normally, I think) are doing this copying stage at the same time. The machine has limited IO and network bandwidth, and if this gets capped out then the transfers start becoming very slow indeed.
[16:38] <Laney> Then the job hits the copy timeout and is restarted.
[16:39] <Laney> If one is going slowly then they're all going slowly, so they are all likely to timeout.
[16:39] <Laney> Then they start copying again and the same problem occurs. It's difficult to get out of that hole once you're in it.
[16:39] <acheronuk> yes, that is something I gathered from the last week's discussions. thanks
[16:39] <Laney> So we've reduced parallelism so that fewer of these copy jobs are happening simultaneously, meaning that they can actually complete.
[16:40] <acheronuk> why are we hitting that now, when we did not before during even high load of build needed jobs?
[16:40] <Laney> If a log says "Removing autopkgtest-satdep (0) ..." or something like that, then it's actually doing the copy. autopkgtest doesn't ATM output anything in that case.
[16:41] <Laney> Accident of timing in a way that is bad enough to make it happen. It's happened only once before to my knowledge. Usually things wouldn't line up in a way to make this problem happen - the copying is fast in the normal case, and so jobs don't tend to starve each other.
[16:42] <Laney> It's only relatively few packages are actually large enough to be noticable in this way
[16:42] <Laney> e.g. check that there are loads of libreoffice jobs running now - that's one of them
[16:44] <acheronuk> 'accident of timing' does not seem right. there are many occasions when I have landed KDE stuff all at once, requiring that and have not seen this happen
[16:44] <acheronuk> Laney: true. I did note the bazillion libreoffice jobs!
[16:44] <Laney> It is right.
[16:45] <Laney> Almost all of the time we manage to slip these copying operations past each other, but if enough packages lose the race with each other then there's this storm effect
[16:45] <Laney> It's why juliank's work on being able to split the controller into multiple is probably going to help, since then copy jobs won't fight with each other so much
[16:46] <acheronuk> so why has this happened twice on the last week, and not in the many KDE* uploads I have done in the past?
[16:46] <Laney> It's not twice this week, it's the same one
[16:47] <Laney> I shouldn't have turned the parallelism back up before the problem was fully drained
[16:47] <Laney> KDE alone probably isn't enough to trigger it; this time we had binutils, libreoffice × many, chromium, ... at the same time
[16:47] <Laney> all of those are the very chunky packages
[16:47] <Laney> Gotta go now, hope that helps
[16:49] <acheronuk> Laney: ok. I am still surprised if that is true that had not hit that before during the time I have been doing this, but hey odd things happen
[20:55] <vorlon> Laney, juliank: independent of whether we need to increase the number of dispatcher instances, I think the copy timeout handling is clearly pathological; it makes sense to have a timeout to detect a dead worker, but to deal with a slow transfer by killing the jo and starting all over just makes the storm worse (and from what I've seen, neither the instances nor the transfers are stopped when the
[20:55] <vorlon> timeout is hit)
[20:56] <vorlon> it would be saner to just let the copies complete, however long they take, instead of restarting them and ensuring they never complete
[20:57] <vorlon> looks like ppc64el and s390x queues are empty now, so I'm going to bump the number of runners on the remaining archs to 8 so that they complete faster
[21:02] <juliank> vorlon: Probably should just warn about copy timeouts in cron job somehow rather than fail
[21:02] <juliank> "Oh, this tar process has been running for 2 hours"
[21:04] <juliank> probably want to ionice this to a lower priority after like 2 hous so that other jobs can work while long copies are going on
[21:25] <vorlon> hmm, I'm not sure ionice would improve the overall throughput... since in most cases where this becomes an issue, it's because most of your units are all stuck in this state of doing source tree copies
[22:15] <Laney> Maybe.
[22:16] <Laney> Increasing the timeout might be sensible. Having jobs which take interminably long to run because they're copying at 1KB/s, hmm...
[22:16] <Laney> A backoff solution might actually provide for better throughput.
[22:20] <Laney> Looks like one hump might be over, but it's too late for me to decide whether to increase workers now...