vorlonfwiw I'm rebooting the autopkgtest master to see if that helps with the i/o issues (the system has been up for quite a while and I don't know that I trust the kernel).  However I didn't count on systemd insisting on nuking all of /tmp on boot before it lets me back in, and that's where all of the autopkgtest units' working files are, so it'll be a while before it's back up06:50
vorlon(hopefully it'll come back before I timeout and have to go to bed, but there's really no telling at this point)06:50
vorlonLaney, juliank: ^^ I have no idea, maybe it'll come up and be healthy, or maybe it'll need some babysitting when it comes up (to kill off instances left running?), but I can't stay up to wait for it indefinitely; if either of you happen to be at keyboard today when it comes up and want to have a look, great, otherwise I'll check in in ~8h07:04
Laneyvorlon: might be around a bit later on (out for a few hours now). fwiw (if rebooting doesn't help), if you reduce parallelism to say 5 units per arch then it'll probably decrease load enough to be able to chew through the chunky jobs, and it can then be increased again afterwards07:10
juliankeverything seems ok09:35
juliank163 tar processes running, fun, but machine is responsive09:35
Laneycheck if the big jobs (binutils, libreoffice, chromium, k*) are actually being processed rather than timing out repeatedly09:37
Laneyif you see them bunched up copying back to the controller, that is a suspicious sign09:37
Laneymaybe watch journal for a bit to see what's actually going on09:38
* Laney out again09:38
juliankI see exceed quota, but no timeouts so far, will look in from time to time09:39
juliankAh I think the chromium ones are running right now09:40
juliank                                                                                                                autopkgtest [09:49:23]: ERROR: testbed failure: sent `copyup /tmp/autopkgtest.7oD5ck/build.OEX/src/ /tmp/autopkgtest-work.emdsp_b7/out/tests-tree/', got `timeout', expected09:53
juliankyeah, so did not help09:53
juliankI guess decreasing parallelity it is09:59
juliank* parallelism10:00
acheronukjuliank: it did a few days ago. bigger questions is though why this problem now? the infra did used to cope. granted that some things slowed progress a fair bit, but there was not this timeout -> respawn job to front of queue paralysis. somthing on the test machines/infra/whatever has regressed10:12
juliankacheronuk: I don't know11:06
juliankacheronuk: I did decrease #workers to 5 per-arch11:06
juliankand it seems we reduced tar processes by 50%11:06
juliankI/O load hence seems lower now, which should hopefully mean we don't timeout copying anymore11:07
juliankWhat I think we'd have to do is ensure that we only have a certain number of needs-build jobs running at the same time11:08
juliankeven ls /tmp works now11:11
juliankit might be useful splitting that up a bit11:12
juliankwe have that for each running autopkgtest11:12
juliankthat means /tmp has a lot of files in that dir11:13
juliankit might make sense to build hierarchies11:13
julianke.g. /tmp/autopkgtest-{fifo,ssh,work}.1rdoc1m_ => /tmp/autopkgtest.1r/do/c1/m_/autopkgtest-{fifo,ssh,work}.1rdoc1m_11:14
julianklike git does11:14
juliankto not have that dir grow to a huge ton of entries11:14
juliankwe should see whether chromium copy worked in ~ 40 mins11:14
acheronukjuliank: fine. but fact remains that this is a recent phenomena, or one whose severity and prevalence has dramatically increased lately11:30
acheronuknow I think on it, I did notice that towards the end of disco cycle, some jobs did appear to struggle11:31
acheronuke.g kwin, and plasma-workspace jobs did start to at some time seem to stall for hrs, then restart, without any explanation to me where all I could do was look at user facing logs11:33
acheronukannoying, but only occasional, even when the infra was very loaded up with K* tests, and they mostly did finish the apparent subsequent go, so I did not flag it up11:35
acheronukit is possible other *BIG* tests not KDE did the same, but I did not watch those11:35
acheronukhowever, something has changed recently that means that, this is now more likely than not under load, and hitting a wider range of tests11:37
acheronukanyway. that is how I have observed things change recently (and I do watch a lot when many KDE tests are queued), whatever that means11:42
juliankI'm sorry to inform you that copying timed out again12:05
juliankoh, but one succeeded12:06
juliank(of chromium-browser/s390x)12:06
juliankah no, arm64 succeeded, s390x timed out12:07
juliankwhy am I not raising the copy timeout?12:08
juliankit would be better than retrying again and again12:09
juliankbut it already is a 100 mins, so...12:10
juliankit seems likely raising it by 20 mins solves some of those issues12:12
juliankI increased it by 20 mins, and requested restarts once current jobs are complete12:15
juliankFWIW, solving this means setting up a second cloud worker, which probably means we should do a staging cloud worker first12:19
juliankthat's in progress, but will probably take some time12:19
Laneyacheronuk: can you please stop theorising that something changed and we're missing it?13:18
Laneyit's simply not true (we have in fact had this problem before)13:18
Laneythe last time was when we created the card to work on the parallelism stuff that juliank has been doing this week13:19
Laneyjuliank: WDYT about tagging some of the messages from the worker with their type? So we can just see like acknowledged/received request messages or something13:41
Laney. o O ( journalctl ADT_MESSAGE_TYPE=ack ADT_MESSAGE_TYPE=upload ADT_MESSAGE_TYPE=running-autopkgtest ... )13:42
acheronukLaney: no, I will not. as things have clearly regressed. stating that fact is not theorising15:43
acheronukif it it a recurring problem that occurred at a time when I was not aware, then obviously it is relevant to look at what circumstances are similar to now15:45
acheronukhowever no matter what you say, it is a clear regression over behaviour during the vast majority of the time I have ever observed behaviour if the tests15:46
acheronukLaney: sorry. I am just trying to be helpful, by adding my observational input as I tend to tarck the running tests quite closely when mine are in play. only as in the spirit of trying to add data to help sort this15:48
acheronuk*tend to track15:49
juliankLaney: hmm, not sure how useful that would be, then you need to find the unit to query context15:58
juliankLaney, acheronuk FWIW, there do not seem to be any timeouts anymore AFACIT15:59
juliankit's been about 4 hours now, so I'd hope that's useful data15:59
acheronukjuliank: :)16:00
juliankFWIW, a cron job ran and failed to send email https://pastebin.canonical.com/p/VmZgzBCG9y/16:01
Laneyacheronuk: I'll say it as clearly as I can, one last time: your input is not needed here.16:21
Laneyjuliank: thanks, just taking aAAAAAAAaaaaaaaaaaaaages for things to run16:22
acheronukLaney: I will says it as clearly as I can, I am just adding observations that you are ignoring in the spirit of trying to help worl out how they hell to get past this16:23
acheronukas I have a vested interest in how this goes, I will not stay quiwt when it seems they are being overlooked16:24
LaneyWe don't need external observations. We have internal ones. We're doing all we can. Thanks.16:24
acheronukLaney: I appreciate that. I am not trying to be awkward deliberately. I want this sorted as much as anyone16:29
acheronukfrustration with the current situation, and an apparent mismatch between the the 'explained situation" and observed reality, means I feel the need to at least say so16:32
acheronukI will try to dial back the 1st, but the 2nd I will still query. Politely of course16:32
acheronukthank you for helping16:32
LaneyOK, thanks for that. I've not been involved in any conversations about this outside of here, FWIW, but it may be difficult to follow.16:35
LaneyWhat happens is that autopkgtest-virt-ssh runs copy the source tree back to the controller, so that it can be sent to new instances, if the tests require them. If the test has Restrictions: build-needed, that is a *built* source tree.16:37
LaneyThese jobs have a timeout (--copy-timeout).16:37
LaneyWe can very rarely get into an unlucky situation where lots of jobs (we run more than a hundred workers in parallel normally, I think) are doing this copying stage at the same time. The machine has limited IO and network bandwidth, and if this gets capped out then the transfers start becoming very slow indeed.16:38
LaneyThen the job hits the copy timeout and is restarted.16:38
LaneyIf one is going slowly then they're all going slowly, so they are all likely to timeout.16:39
LaneyThen they start copying again and the same problem occurs. It's difficult to get out of that hole once you're in it.16:39
acheronukyes, that is something I gathered from the last week's discussions. thanks16:39
LaneySo we've reduced parallelism so that fewer of these copy jobs are happening simultaneously, meaning that they can actually complete.16:39
acheronukwhy are we hitting that now, when we did not before during even high load of build needed jobs?16:40
LaneyIf a log says "Removing autopkgtest-satdep (0) ..." or something like that, then it's actually doing the copy. autopkgtest doesn't ATM output anything in that case.16:40
LaneyAccident of timing in a way that is bad enough to make it happen. It's happened only once before to my knowledge. Usually things wouldn't line up in a way to make this problem happen - the copying is fast in the normal case, and so jobs don't tend to starve each other.16:41
LaneyIt's only relatively few packages are actually large enough to be noticable in this way16:42
Laneye.g. check that there are loads of libreoffice jobs running now - that's one of them16:42
acheronuk'accident of timing' does not seem right. there are many occasions when I have landed KDE stuff all at once, requiring that and have not seen this happen16:44
acheronukLaney: true. I did note the bazillion libreoffice jobs!16:44
LaneyIt is right.16:44
LaneyAlmost all of the time we manage to slip these copying operations past each other, but if enough packages lose the race with each other then there's this storm effect16:45
LaneyIt's why juliank's work on being able to split the controller into multiple is probably going to help, since then copy jobs won't fight with each other so much16:45
acheronukso why has this happened twice on the last week, and not in the many KDE* uploads I have done in the past?16:46
LaneyIt's not twice this week, it's the same one16:46
LaneyI shouldn't have turned the parallelism back up before the problem was fully drained16:47
LaneyKDE alone probably isn't enough to trigger it; this time we had binutils, libreoffice × many, chromium, ... at the same time16:47
Laneyall of those are the very chunky packages16:47
LaneyGotta go now, hope that helps16:47
acheronukLaney: ok. I am still surprised if that is true that had not hit that before during the time I have been doing this, but hey odd things happen16:49
vorlonLaney, juliank: independent of whether we need to increase the number of dispatcher instances, I think the copy timeout handling is clearly pathological; it makes sense to have a timeout to detect a dead worker, but to deal with a slow transfer by killing the jo and starting all over just makes the storm worse (and from what I've seen, neither the instances nor the transfers are stopped when the20:55
vorlontimeout is hit)20:55
vorlonit would be saner to just let the copies complete, however long they take, instead of restarting them and ensuring they never complete20:56
vorlonlooks like ppc64el and s390x queues are empty now, so I'm going to bump the number of runners on the remaining archs to 8 so that they complete faster20:57
juliankvorlon: Probably should just warn about copy timeouts in cron job somehow rather than fail21:02
juliank"Oh, this tar process has been running for 2 hours"21:02
juliankprobably want to ionice this to a lower priority after like 2 hous so that other jobs can work while long copies are going on21:04
vorlonhmm, I'm not sure ionice would improve the overall throughput... since in most cases where this becomes an issue, it's because most of your units are all stuck in this state of doing source tree copies21:25
LaneyIncreasing the timeout might be sensible. Having jobs which take interminably long to run because they're copying at 1KB/s, hmm...22:16
LaneyA backoff solution might actually provide for better throughput.22:16
LaneyLooks like one hump might be over, but it's too late for me to decide whether to increase workers now...22:20

Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!