/srv/irclogs.ubuntu.com/2018/03/28/#ubuntu-kernel.txt

=== aaa is now known as Hofman
=== Hofman is now known as slopisadictator
=== slopisadictator is now known as aaa
xnoxcking, how silly is this udev rule - https://github.com/ibm-s390-tools/s390-tools/blob/master/etc/udev/rules.d/60-readahead.rules#L10 ? why is read_ahead_kb 128? should the kernel read_ahead_kb be bumped? is there a sysctl to set the default read_ahead_kb in the kernel?11:22
* xnox is grepping the kernel and getting lost11:22
xnoxcan i bump it to 512 everywhere?11:23
xnox 916        q->backing_dev_info->ra_pages =                                                                                                                                                                        11:24
xnox 917                        (VM_MAX_READAHEAD * 1024) / PAGE_SIZE;11:24
xnoxoooh11:24
cking512 is quite large for all systems isn't it?11:24
xnoxhttp://paste.ubuntu.com/p/btw3vKkx8X/11:25
xnoxis my thinking....11:25
xnoxcking, are you saying 128kb should be enough for everyboy - Bill Gates?11:25
xnoxcking, are you saying 128kb should be enough for everybody - Bill Gates?11:25
ckingheh, well, today's drives are getting faster and memory is larger so a larger readahead makes some sense I suppose11:26
xnoxcking, if that's too large for all systems, i want this exposed as a /all/ sysfs ctl such that I can configure the default at runtime and apply to all.11:26
xnoxcause at the moment, i can either set it on each device as they appear.... which is racy. Or recompile the kernel.11:26
xnoxcking, given 4k native drives, bumping the readahead 4 times makes sense to keep the constant readhead number of sectors, no?11:28
ckingxnox,  the problem is that the default may not be ideal for all use cases. for example, doing lots of random I/O you don't want a really large readahead11:28
ckinglarge readahead makes sense when you are streaming data in sequentially11:29
xnoxon 4k native drives, the cost should be the same as it used to be with 128kb & 512 sector drives.... unless i'm a clueless muppet =)11:29
* xnox looks at the historgram of all reads an "average" machine does.11:29
* xnox notes there is no such graph11:30
ckingwell, you still have bus bandwidth limitations, large readahead will impact even with large sectors, so the choice is a good compromise, which is hard to guess11:30
xnoxi'm wondering if this is true in general, given all of this is hidden inside the layers of bcache / md / filesystems layers, and most of that stuff is very extends happy these days, etc.11:31
xnoxcking, and ibm is bumping that up, on s390x, because everything is big on big iron?11:31
ckingxnox, s390x can do I/O *really* fast11:31
* xnox notes that it looks like that udev rule they commit alongside the bootloader is meant to run everywhere.11:31
ckingit's one of it's key wins11:32
xnoxoh, so if I condition this to s390, people will complain? =)11:32
* xnox about to make s390x bloody fast; and make all the other kids complain11:32
ckingxnox, whatever is chosen is going to be sub-optimal for somebody11:33
xnoxhowever, ibm recommends this and all of us ship that udev rule, so for everybody we go with ibm recommendation on this one.....11:33
xnoxone can overcommit z/vms too11:34
ckingvery true11:34
xnoxi really don't like how it is *1024/PAGE_SIZE. Surely a default of a specific number of sectors or pages makes more sense. And systems with larger disk sectors / memory page_size get more caching.11:35
xnoxbtrfs does funky stuff to the device.11:36
xnox2727        sb->s_bdi->ra_pages = VM_MAX_READAHEAD * SZ_1K / PAGE_SIZE;                                                                                                                                            11:36
xnox2728        sb->s_bdi->ra_pages *= btrfs_super_num_devices(disk_super);                                                                                                                                            11:36
xnox2729        sb->s_bdi->ra_pages = max(sb->s_bdi->ra_pages, SZ_4M / PAGE_SIZE); 11:36
ckingnice, so a device specific setting behaves differently depending on the file system, yat11:37
cking*yay11:37
xnoxbtrfs goes to 1111:37
xnoxcking, well, btrfs does take over the whole device, if the whole device is added to the pool....11:38
xnoxmd layer does stuff with stripe sizes11:39
xnoxnfs too11:40
* cking wonders what zfs does11:41
dijuremotyhicks: I was running: ii  intel-microcode                             3.20180108.0+really20170707ubuntu16.04.111:51
dijuremoI have installed your package, intel-microcode (3.20180312.0~ubuntu16.04.1)11:51
dijuremoAnd will conduct testing and report back.11:51
f_gcking: it sets ra_pages to 011:53
f_gah no, to 1 actually11:54
ckingf_g, that's not what I expected for sure11:54
f_gcourtesy of having its own prefetcher11:54
ckingah, that makes sense11:54
f_gbc17f1047a83cc8c4065e0ef84333a0d9b9d73aa11:54
f_g:)11:54
dijuremotyhicks: My first test was to update the microcode, reboot, and keep the old firmware A14. Machine seemed fine, though it fully hung on my first test. The test was to log in, run glxgears and Matlab: bench(5) then log off. The machine hung up after login off at the login screen where a darker shade appeared over the whole screen. Num Lock would not work on the keyboard. I force restarted the machine, repeated the same process 3 times with 12:19
dijuremotyhicks: My second test was to update to A16, which is the latest firmware for the DELL T3610. I booted up, was able to log in, run glxgears and Matlab, log off and log back in and repeat the test several times (3+) with no problem. The machine was crashing (screen hung) right after entering the password and hitting enter when I was using the outdated microcode. Any idea when the new microcode package is going to be made available in the m12:32
tyhicksdijuremo: thanks for testing! I believe that the security team plans to publish the new intel-microcode this week14:02
tyhicksdijuremo: could you paste the output of 'iucode-tool --scan-system' when you get a chance?14:02
dijuremotyhicks: oh well, spoke too soon, now that we have added gnome-desktop per user request, the machine is back to crashing either at the login screen or if we log out after starting a graphical session... sigh...17:15
dijuremoiucode-tool: system has processor(s) with signature 0x000306e417:17
dijuremoIs there a need to rebuild the nvidia kernel module with the latest Intel Microcode in place?17:19
tyhicksdijuremo: no, there shouldn't be17:20
tyhicksdijuremo: can you pastebin the output of 'dmesg -t | grep -i microcode'? are you still running firmware A16?17:22
tyhicksdijuremo: also, can you elaborate on what you mean by "crashing"?17:23
dijuremoThe system freezes completely, only hardboot from the power button will reboot it. Numlock stops turining on/off in the ssytem17:24
dijuremomicrocode: sig=0x306e4, pf=0x1, revision=0x42c 17:24
dijuremomicrocode: Microcode Update Driver: v2.2.17:24
dijuremoOur testing earlier today involved two separate users login in and running the usual Matlab and glxgears for tests. It was all working well up until the point we added gnome-desktop to the machine. It seems the issues are related to GUI/Nvidia, but I cannot be sure.17:27
tyhicksthat matches the microcode revision in the latest blob from intel:17:27
tyhicks01/126: sig 0x000306e4, pf mask 0xed, 2018-01-25, rev 0x042c, size 1536017:27
tyhicksdijuremo: which kernel are you running?17:29
dijuremo4.13.0-37-generic17:31
dijuremoBIOS Information 17:34
dijuremo        Vendor: Dell Inc. 17:34
dijuremo        Version: A16 17:34
dijuremo        Release Date: 02/05/201817:34
tyhicksdijuremo: at this point I think we're better off tracking all the failing combinations in a bug report17:42
tyhicksdijuremo: could you file one?17:43
dijuremoPoint me in the right direction, never done one for Ubuntu...17:44
tyhicksdijuremo: also, I think it'd be very helpful to confirm one more time that lockups do not happen after removing intel-microcode, downgrading to firmware A14, and then rebooting into 4.13.0-37-generic17:44
tyhicksdijuremo: I think you should initially file it against intel-microcode: https://bugs.launchpad.net/ubuntu/+source/intel-microcode/+filebug17:45
dijuremoLet me try going to A14 without even removing gnome-desktop and I will report back. The system was stable with BIOS A14 and new microcode.17:45
tyhicksdijuremo: please include that info in the bug report17:45
=== himcesjf_ is now known as him-cesjf
tyhicksdijuremo: at this point, it is sounding like a bad BIOS update17:46
dijuremotyhicks: This behaviour is the exact same in several DELL and one Lenovo System. Machines boot up just fine and freeze after attempting to log in. Or if you are loging off, right after log off, I think when lightdm restarts.17:47
dijuremoIn all cases, we have fixed the issues by downgrading the BIOS.17:47
tyhicksdijuremo: including the 'iucode-tool --scan-system' output from all affected systems would be helpful17:48
dijuremoBut in the other DELL machines I had not tested the latest microcode update17:48
dijuremoNeither the Lenovo M93p17:48
tyhicksdijuremo: I don't have a machine with 0x000306e4 so I'm unable to reproduce with that particular processor17:48
tyhicksdijuremo: big thanks for helping so much to test out all the different combinations17:49
dijuremoThe Lenovo had an i7, would that be a good test you can check?17:49
dijuremoThe T series are all Xeons. Seen it on the 5810 which has the E5-1680 and the 5820 with the newer W-2XXX cpus17:50
tyhicksno i7s but let me check my xeons17:50
dijuremoI am going to see what I can do, I have given these systems to end users, so not so sure they will be happy with me hijacking them for a few hours...17:51
tyhicksfwiw, I've been running the updated microcode for weeks now without any issue17:51
tyhicksI haven't applied any BIOS updates to those machines, though17:51
dijuremoDo you have Nvidia cards?17:51
tyhicksno17:51
tyhicksall machines are either intel graphics or headless17:51
dijuremoSo the BIOS is pretty much what breaks them...17:51
dijuremoAnd I think it is highly related as well to the nvidia driver. though we had one set of laptops with the freezing issue which was solved by running the 4.4.0-116 kernel I think.17:52
dijuremoIt was a couple of weeks ago, so I do not remember much. But bought two laptops, one came with an older firmware and took 4.4.13-036 without issue. The other machine had the newer firmware and would freeze.17:53
dijuremoI tried downgrading the BIOS on the laptop that was freezing, but the downgrade was blocked by DELL :(17:53
dijuremoAt that time I did not know of pti=off or nospectre_v2 or updating microcode, so I have not tried any of those....17:54
dijuremotyhicks: FWIW, the machine is currently up, without login into it. So it seems that the graphical login triggers the freezes, though one time earlier today it fully froze during boot.17:59
dijuremotyhicks:  I am ssh-ing it from another machine and have a shell on it..18:00
tyhicksdijuremo: ok, so you have the A14 firmware with intel-microcode uninstalled?18:00
dijuremoNope, I have A16 without having logged in at all in the GUI18:00
dijuremoStill A16 microcode .deb from march I got from your PPA18:01
tyhicksdijuremo: I think we've established that the A16 BIOS locks up in all cases18:01
dijuremotyhicks: A16 locks up in all cases after login via the GUI. Sometimes as you log in, sometimes when you try to log off and lightdm restarts18:02
dijuremoThat is with the microcode from March (your PPA deb package).18:02
tyhicksdijuremo: I wonder if we'll see any microcode revisions reported by the kernel with A16 but with intel-microcode uninstalled18:02
dijuremoWithout the microcode, it freezes as soon as lightdm starts18:03
tyhickshuh18:03
tyhicksthat makes me think that the BIOS has the microcode that Intel pulled due to QA issues18:03
dijuremoWith the older 201801 microcode from the main Ubuntu repo18:03
tyhicksIntel briefly released updated microcode and the pulled it because users were experiencing issues18:04
dijuremoThese BIOS releases are from March, so supposedly vendors had not published the first problematic ones or removed them from their sites.18:04
dijuremoBut some of the 5820 and 5810 BIOS are very freshly released... 18:04
tyhicksI don't know how to tell which microcode revision is provided in a BIOS update18:04
dijuremoThe T3610 is from this month as well.18:05
dijuremoA16 Dell Precision Workstation T3610 System BIOS  BIOS06 Mar 201818:06
tyhickstiming doesn't tell us if they are shipping a possibly bad revision18:06
dijuremoI do not know how to check for good/bad18:07
dijuremoMy understanding was they were not releasing the ones with the bad fixes...18:07
dijuremoHow could one figure out what those are?18:07
tyhicksI'm not sure - thinking about that now...18:08
dijuremoFor the 5820 there is a newer BIOS released yesterday.... Dell Precision 5820 Tower System BIOS BIOS27 Mar 2018 1.4.018:08
tyhicksdijuremo: you could try uninstalling intel-microcode, rebooting, and then see if the 'dmesg -t | grep -i microcode' command shows anything18:08
dijuremoWill do that now18:09
dijuremoWe do not want the older microcode from 201801, right?18:09
tyhicksdijuremo: 'sudo cat /sys/devices/system/cpu/cpu0/microcode/version' would also be interesting18:09
tyhicksdijuremo: that's right18:09
dijuremoWith A16 cat /sys/devices/system/cpu/cpu0/microcode/version 18:09
dijuremo0x42c18:09
dijuremoWe want this to use the original CPU microcode, right?  ->   apt purge intel-microcode18:10
tyhicksdijuremo: right - we want to see what revision is reported when just using the microcode from the A16 BIOS 18:11
dijuremodmesg -t  | grep micro 18:13
dijuremomicrocode: sig=0x306e4, pf=0x1, revision=0x42c 18:13
dijuremomicrocode: Microcode Update Driver: v2.2.18:13
dijuremocat /sys/devices/system/cpu/cpu0/microcode/version  18:13
dijuremo0x42c18:13
dijuremoroot@localhost:~# dpkg -l | grep intel-microcode 18:13
dijuremoroot@localhost:~# 18:13
tyhicksdijuremo: that was after a reboot?18:13
dijuremoYes18:13
dijuremoroot@localhost:~# uptime 18:14
dijuremo 14:14:03 up 2 min,  1 user,  load average: 0.43, 0.53, 0.2418:14
tyhickshmm18:14
tyhicksI'm stumped18:14
dijuremoShould I downgrade to A14 and repeat?18:15
dijuremoWe should get a different microcode then, right?18:15
tyhicksthe intel-microcode deb and the A16 BIOS deliver the same microcode revision yet the lockup only happens with A1618:15
tyhicksdijuremo: yes, that's a good test18:15
tyhicksdijuremo: downgrade to A14, check what microcode revision is being reported, ensure that you can't trigger the lockup, then install the intel-microcode from the PPA and reboot, then see if you can trigger the lockup18:16
tyhicksthis is maddening but very helpful since we're trying to determine how widely to push out the new intel-microcode updates18:17
dijuremoWith the A14 BIOS18:21
dijuremoroot@localhost:~# cat /sys/devices/system/cpu/cpu0/microcode/version  18:21
dijuremo0x42818:21
tyhicksok, so it was downgraded18:22
dijuremodmesg -t  | grep micro 18:22
dijuremomicrocode: sig=0x306e4, pf=0x1, revision=0x428 18:22
dijuremomicrocode: Microcode Update Driver: v2.2.18:22
dijuremoI am going to log into GUI and log out a few times and also change desktops from unity to gnome with A14.18:22
tyhickssounds good18:22
dijuremoLogin works well for Unity and Gnome under my account and I asked someone else to log in to their account to gnome. Ran Matlab and glxgears no issue18:31
dijuremoSo total of 3 GUI logins and logouts without freezes on A14 BIOS and 4.13.0-37-generic kernel18:32
dijuremo 18:32
dijuremoWould the next test be to install your PPA intel-microcode package and test again?18:32
dijuremoSee if we can get it to crash?18:32
tyhicksdijuremo: yes, please18:33
dijuremoI also do not know that running matlab and glxgears are the best test, but since the freezes happened on a GUI, I thought those would be at least decent apps to run.18:34
dijuremoIf you can think of any other apps to run, let me know.18:34
tyhicksI've heard of one other report that involves starting up a VM18:34
tyhicksit seems like you have a fairly reliable reproducer though18:35
dijuremoI would have to install virtualbox and a VM... do not have one there at the moment.18:35
tyhicksno worries18:36
dijuremoInstalled intel-microcode from your PPA and now rebooting...18:36
ricotzsforshee, hi, are you about to rebase the bionic next branch to 4.15.14?18:36
dijuremoAfter reboot:18:37
dijuremodmesg -t  | grep micro 18:37
dijuremomicrocode: microcode updated early to revision 0x42c, date = 2018-01-25 18:37
dijuremomicrocode: sig=0x306e4, pf=0x1, revision=0x42c 18:37
dijuremomicrocode: Microcode Update Driver: v2.2.18:37
dijuremocat /sys/devices/system/cpu/cpu0/microcode/version  18:38
dijuremo0x42c18:38
dijuremodmidecode | grep -i version | grep 14 18:38
dijuremo        Version: A1418:38
dijuremo Bios is still A1418:38
* tyhicks nods18:39
dijuremoSo far no freezes... A14 BIOS with your Microcode Logged in 3 times user1 both Unity and Gnome, user2 to Gnome, logged out after running matlab bench(5) and glxgears, machine is still OK, no freezes.18:48
dijuremoDoes that tell us something else in the BIOS not related to the CPU microcode is to blame?18:49
tyhickspossibly18:49
tyhicksI'd say the only thing left to do is upgrade the BIOS to A16 and try once more18:49
dijuremoWill do, that should reproduce the freeze easily...18:49
sforsheericotz: that must have just come it. We'll undoubtedly get to that shortly.18:55
sforshee*out18:55
TJ-dijuremo: when the BIOS is updated do you reset to factory defaults or leave the existing BIOS config in place?18:55
ricotzsforshee, yeah, a couple hours, just notice you were pushing changes, so I was curious about this last release18:56
TJ-dijuremo: I've seen instances where a later BIOS changes the layout of the config NVRAM space and inherited values can be read for some other setting (due to different offsets, etc.)18:57
dijuremoAfter BIOS update I have not reset to factory defaults. I can try that. The first GUI login worked well, I ran Matlab and Glxgears, and then after login out, the machine froze. Numlock key is no longer tuning on/off and I cannot ssh into the machine 19:00
tyhickswhile maybe not definitive, I think it points at the BIOS update as introducing the issue19:04
tyhicksTJ- has a good idea19:05
dijuremoFor a moment I thought it worked, because this time I was able to log in and log out once, but during the second GUI login the machine froze and Numlock key does not respond nor does ssh...19:12
dijuremoNot sure if it would matter, but the BIOS is set to EFI mode.19:13
dijuremoHas been for all the testing....19:13
dijuremoSo resetting BIOS defaults with A16 did not really fixed the problem either.19:13
tyhickson one hand I'm glad we were able to narrow it down to the new BIOS but on the other hand, I'm stumped at what the next steps would be19:18
dijuremotyhicks: And this happens with several models and also a Lenovo M93p19:18
dijuremoExact same type of freezing with the later BIOS from manufacturers. Fix is to downgrade the BIOS where possible.19:19
dijuremoAnd in that DELL 7480 laptop with the latest BIOS, the fix was to stick to 4.4.0 kernel because 4.13.0 would freeze the machine.19:19
dijuremoI guess we could try downgrading the kernel to 4.4.0 here and test with A16. Would that provide any useful information?19:20
tyhicksit would be a good data point19:21
dijuremoThat may be tomorrow since it will involve clearing and rebuilding the nvidia driver, etc and I am getting close to leave and have to wrap up other things. Will keep you posted.19:21
tyhicksthat's understandable - thanks for all the debugging you've done so far19:22
dijuremoThanks for all the help!!! I really appreciate it!19:22
TJ-tyhicks: can of worms!19:31
TJ-tyhicks: one question I've not been able to find answered in the Intel architecture docs is this: when doing a CPU warm reset does the loaded microcode get reset back to CPU-ROM version, or does a loaded microcode persist. Inference says it gets cleared on CPU reset but it isn't documented anywhere I've been able to find.19:39
tyhicksTJ-: hmm... I would think it gets cleared but I'm really not sure20:06
TJ-Aha.. Intel 64 and IA-32Architecture Software Developers Manual vol 3A; System Programming Guide; Part1 9.11.6.1: "The effects of a loaded update are cleared from the processo20:45
TJ-r upon a hard reset. Therefore, each time a hard reset 20:45
TJ-is asserted during the BIOS POST, the update must be re20:45
TJ-loaded on all processors that observed the reset. The 20:45
TJ-effects of a loaded update are, howeve20:45
TJ-r, maintained across a processor INIT. 20:45

Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!