/srv/irclogs.ubuntu.com/2019/04/27/#ubuntu-release.txt

vorlon	fwiw I'm rebooting the autopkgtest master to see if that helps with the i/o issues (the system has been up for quite a while and I don't know that I trust the kernel). However I didn't count on systemd insisting on nuking all of /tmp on boot before it lets me back in, and that's where all of the autopkgtest units' working files are, so it'll be a while before it's back up	06:50
vorlon	(hopefully it'll come back before I timeout and have to go to bed, but there's really no telling at this point)	06:50
vorlon	Laney, juliank: ^^ I have no idea, maybe it'll come up and be healthy, or maybe it'll need some babysitting when it comes up (to kill off instances left running?), but I can't stay up to wait for it indefinitely; if either of you happen to be at keyboard today when it comes up and want to have a look, great, otherwise I'll check in in ~8h	07:04
Laney	vorlon: might be around a bit later on (out for a few hours now). fwiw (if rebooting doesn't help), if you reduce parallelism to say 5 units per arch then it'll probably decrease load enough to be able to chew through the chunky jobs, and it can then be increased again afterwards	07:10
juliank	everything seems ok	09:35
juliank	163 tar processes running, fun, but machine is responsive	09:35
Laney	check if the big jobs (binutils, libreoffice, chromium, k*) are actually being processed rather than timing out repeatedly	09:37
Laney	if you see them bunched up copying back to the controller, that is a suspicious sign	09:37
Laney	maybe watch journal for a bit to see what's actually going on	09:38
* Laney out again		09:38
juliank	I see exceed quota, but no timeouts so far, will look in from time to time	09:39
juliank	Ah I think the chromium ones are running right now	09:40
juliank	autopkgtest [09:49:23]: ERROR: testbed failure: sent `copyup /tmp/autopkgtest.7oD5ck/build.OEX/src/ /tmp/autopkgtest-work.emdsp_b7/out/tests-tree/', got `timeout', expected	09:53
juliank	yeah, so did not help	09:53
juliank	I guess decreasing parallelity it is	09:59
juliank	* parallelism	10:00
acheronuk	juliank: it did a few days ago. bigger questions is though why this problem now? the infra did used to cope. granted that some things slowed progress a fair bit, but there was not this timeout -> respawn job to front of queue paralysis. somthing on the test machines/infra/whatever has regressed	10:12
juliank	acheronuk: I don't know	11:06
juliank	acheronuk: I did decrease #workers to 5 per-arch	11:06
juliank	and it seems we reduced tar processes by 50%	11:06
juliank	I/O load hence seems lower now, which should hopefully mean we don't timeout copying anymore	11:07
juliank	What I think we'd have to do is ensure that we only have a certain number of needs-build jobs running at the same time	11:08
juliank	even ls /tmp works now	11:11
juliank	it might be useful splitting that up a bit	11:12
juliank	/tmp/autopkgtest-{fifo,work,ssh}.$randomstring	11:12
juliank	we have that for each running autopkgtest	11:12
juliank	that means /tmp has a lot of files in that dir	11:13
juliank	it might make sense to build hierarchies	11:13
juliank	e.g. /tmp/autopkgtest-{fifo,ssh,work}.1rdoc1m_ => /tmp/autopkgtest.1r/do/c1/m_/autopkgtest-{fifo,ssh,work}.1rdoc1m_	11:14
juliank	like git does	11:14
juliank	to not have that dir grow to a huge ton of entries	11:14
juliank	we should see whether chromium copy worked in ~ 40 mins	11:14
acheronuk	juliank: fine. but fact remains that this is a recent phenomena, or one whose severity and prevalence has dramatically increased lately	11:30
acheronuk	now I think on it, I did notice that towards the end of disco cycle, some jobs did appear to struggle	11:31
acheronuk	e.g kwin, and plasma-workspace jobs did start to at some time seem to stall for hrs, then restart, without any explanation to me where all I could do was look at user facing logs	11:33
acheronuk	annoying, but only occasional, even when the infra was very loaded up with K* tests, and they mostly did finish the apparent subsequent go, so I did not flag it up	11:35
acheronuk	it is possible other BIG tests not KDE did the same, but I did not watch those	11:35
acheronuk	however, something has changed recently that means that, this is now more likely than not under load, and hitting a wider range of tests	11:37
acheronuk	anyway. that is how I have observed things change recently (and I do watch a lot when many KDE tests are queued), whatever that means	11:42
juliank	I'm sorry to inform you that copying timed out again	12:05
juliank	oh, but one succeeded	12:06
juliank	(of chromium-browser/s390x)	12:06
juliank	ah no, arm64 succeeded, s390x timed out	12:07
juliank	https://objectstorage.prodstack4-5.canonical.com/v1/AUTH_77e2ada1e7a84929a74ba3b87153c0ac/autopkgtest-eoan/eoan/arm64/c/chromium-browser/20190427_115905_661f8@/log.gz	12:08
juliank	wooho	12:08
juliank	why am I not raising the copy timeout?	12:08
juliank	it would be better than retrying again and again	12:09
juliank	but it already is a 100 mins, so...	12:10
juliank	it seems likely raising it by 20 mins solves some of those issues	12:12
juliank	I increased it by 20 mins, and requested restarts once current jobs are complete	12:15
juliank	FWIW, solving this means setting up a second cloud worker, which probably means we should do a staging cloud worker first	12:19
juliank	that's in progress, but will probably take some time	12:19
Laney	acheronuk: can you please stop theorising that something changed and we're missing it?	13:18
Laney	it's simply not true (we have in fact had this problem before)	13:18
Laney	the last time was when we created the card to work on the parallelism stuff that juliank has been doing this week	13:19
Laney	juliank: WDYT about tagging some of the messages from the worker with their type? So we can just see like acknowledged/received request messages or something	13:41
Laney	. o O ( journalctl ADT_MESSAGE_TYPE=ack ADT_MESSAGE_TYPE=upload ADT_MESSAGE_TYPE=running-autopkgtest ... )	13:42
acheronuk	Laney: no, I will not. as things have clearly regressed. stating that fact is not theorising	15:43
acheronuk	if it it a recurring problem that occurred at a time when I was not aware, then obviously it is relevant to look at what circumstances are similar to now	15:45
acheronuk	however no matter what you say, it is a clear regression over behaviour during the vast majority of the time I have ever observed behaviour if the tests	15:46
acheronuk	Laney: sorry. I am just trying to be helpful, by adding my observational input as I tend to tarck the running tests quite closely when mine are in play. only as in the spirit of trying to add data to help sort this	15:48
acheronuk	*tend to track	15:49
juliank	Laney: hmm, not sure how useful that would be, then you need to find the unit to query context	15:58
juliank	Laney, acheronuk FWIW, there do not seem to be any timeouts anymore AFACIT	15:59
juliank	it's been about 4 hours now, so I'd hope that's useful data	15:59
acheronuk	juliank: :)	16:00
juliank	FWIW, a cron job ran and failed to send email https://pastebin.canonical.com/p/VmZgzBCG9y/	16:01
Laney	acheronuk: I'll say it as clearly as I can, one last time: your input is not needed here.	16:21
Laney	juliank: thanks, just taking aAAAAAAAaaaaaaaaaaaaages for things to run	16:22
acheronuk	Laney: I will says it as clearly as I can, I am just adding observations that you are ignoring in the spirit of trying to help worl out how they hell to get past this	16:23
acheronuk	as I have a vested interest in how this goes, I will not stay quiwt when it seems they are being overlooked	16:24
Laney	We don't need external observations. We have internal ones. We're doing all we can. Thanks.	16:24
acheronuk	Laney: I appreciate that. I am not trying to be awkward deliberately. I want this sorted as much as anyone	16:29
acheronuk	frustration with the current situation, and an apparent mismatch between the the 'explained situation" and observed reality, means I feel the need to at least say so	16:32
acheronuk	I will try to dial back the 1st, but the 2nd I will still query. Politely of course	16:32
acheronuk	thank you for helping	16:32
Laney	OK, thanks for that. I've not been involved in any conversations about this outside of here, FWIW, but it may be difficult to follow.	16:35
Laney	What happens is that autopkgtest-virt-ssh runs copy the source tree back to the controller, so that it can be sent to new instances, if the tests require them. If the test has Restrictions: build-needed, that is a built source tree.	16:37
Laney	These jobs have a timeout (--copy-timeout).	16:37
Laney	We can very rarely get into an unlucky situation where lots of jobs (we run more than a hundred workers in parallel normally, I think) are doing this copying stage at the same time. The machine has limited IO and network bandwidth, and if this gets capped out then the transfers start becoming very slow indeed.	16:38
Laney	Then the job hits the copy timeout and is restarted.	16:38
Laney	If one is going slowly then they're all going slowly, so they are all likely to timeout.	16:39
Laney	Then they start copying again and the same problem occurs. It's difficult to get out of that hole once you're in it.	16:39
acheronuk	yes, that is something I gathered from the last week's discussions. thanks	16:39
Laney	So we've reduced parallelism so that fewer of these copy jobs are happening simultaneously, meaning that they can actually complete.	16:39
acheronuk	why are we hitting that now, when we did not before during even high load of build needed jobs?	16:40
Laney	If a log says "Removing autopkgtest-satdep (0) ..." or something like that, then it's actually doing the copy. autopkgtest doesn't ATM output anything in that case.	16:40
Laney	Accident of timing in a way that is bad enough to make it happen. It's happened only once before to my knowledge. Usually things wouldn't line up in a way to make this problem happen - the copying is fast in the normal case, and so jobs don't tend to starve each other.	16:41
Laney	It's only relatively few packages are actually large enough to be noticable in this way	16:42
Laney	e.g. check that there are loads of libreoffice jobs running now - that's one of them	16:42
acheronuk	'accident of timing' does not seem right. there are many occasions when I have landed KDE stuff all at once, requiring that and have not seen this happen	16:44
acheronuk	Laney: true. I did note the bazillion libreoffice jobs!	16:44
Laney	It is right.	16:44
Laney	Almost all of the time we manage to slip these copying operations past each other, but if enough packages lose the race with each other then there's this storm effect	16:45
Laney	It's why juliank's work on being able to split the controller into multiple is probably going to help, since then copy jobs won't fight with each other so much	16:45
acheronuk	so why has this happened twice on the last week, and not in the many KDE* uploads I have done in the past?	16:46
Laney	It's not twice this week, it's the same one	16:46
Laney	I shouldn't have turned the parallelism back up before the problem was fully drained	16:47
Laney	KDE alone probably isn't enough to trigger it; this time we had binutils, libreoffice × many, chromium, ... at the same time	16:47
Laney	all of those are the very chunky packages	16:47
Laney	Gotta go now, hope that helps	16:47
acheronuk	Laney: ok. I am still surprised if that is true that had not hit that before during the time I have been doing this, but hey odd things happen	16:49
vorlon	Laney, juliank: independent of whether we need to increase the number of dispatcher instances, I think the copy timeout handling is clearly pathological; it makes sense to have a timeout to detect a dead worker, but to deal with a slow transfer by killing the jo and starting all over just makes the storm worse (and from what I've seen, neither the instances nor the transfers are stopped when the	20:55
vorlon	timeout is hit)	20:55
vorlon	it would be saner to just let the copies complete, however long they take, instead of restarting them and ensuring they never complete	20:56
vorlon	looks like ppc64el and s390x queues are empty now, so I'm going to bump the number of runners on the remaining archs to 8 so that they complete faster	20:57
juliank	vorlon: Probably should just warn about copy timeouts in cron job somehow rather than fail	21:02
juliank	"Oh, this tar process has been running for 2 hours"	21:02
juliank	probably want to ionice this to a lower priority after like 2 hous so that other jobs can work while long copies are going on	21:04
vorlon	hmm, I'm not sure ionice would improve the overall throughput... since in most cases where this becomes an issue, it's because most of your units are all stuck in this state of doing source tree copies	21:25
Laney	Maybe.	22:15
Laney	Increasing the timeout might be sensible. Having jobs which take interminably long to run because they're copying at 1KB/s, hmm...	22:16
Laney	A backoff solution might actually provide for better throughput.	22:16
Laney	Looks like one hump might be over, but it's too late for me to decide whether to increase workers now...	22:20

Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!