=== aaa is now known as Hofman === Hofman is now known as slopisadictator === slopisadictator is now known as aaa [11:22] cking, how silly is this udev rule - https://github.com/ibm-s390-tools/s390-tools/blob/master/etc/udev/rules.d/60-readahead.rules#L10 ? why is read_ahead_kb 128? should the kernel read_ahead_kb be bumped? is there a sysctl to set the default read_ahead_kb in the kernel? [11:22] * xnox is grepping the kernel and getting lost [11:23] can i bump it to 512 everywhere? [11:24] 916 q->backing_dev_info->ra_pages = [11:24] 917 (VM_MAX_READAHEAD * 1024) / PAGE_SIZE; [11:24] oooh [11:24] 512 is quite large for all systems isn't it? [11:25] http://paste.ubuntu.com/p/btw3vKkx8X/ [11:25] is my thinking.... [11:25] cking, are you saying 128kb should be enough for everyboy - Bill Gates? [11:25] cking, are you saying 128kb should be enough for everybody - Bill Gates? [11:26] heh, well, today's drives are getting faster and memory is larger so a larger readahead makes some sense I suppose [11:26] cking, if that's too large for all systems, i want this exposed as a /all/ sysfs ctl such that I can configure the default at runtime and apply to all. [11:26] cause at the moment, i can either set it on each device as they appear.... which is racy. Or recompile the kernel. [11:28] cking, given 4k native drives, bumping the readahead 4 times makes sense to keep the constant readhead number of sectors, no? [11:28] xnox, the problem is that the default may not be ideal for all use cases. for example, doing lots of random I/O you don't want a really large readahead [11:29] large readahead makes sense when you are streaming data in sequentially [11:29] on 4k native drives, the cost should be the same as it used to be with 128kb & 512 sector drives.... unless i'm a clueless muppet =) [11:29] * xnox looks at the historgram of all reads an "average" machine does. [11:30] * xnox notes there is no such graph [11:30] well, you still have bus bandwidth limitations, large readahead will impact even with large sectors, so the choice is a good compromise, which is hard to guess [11:31] i'm wondering if this is true in general, given all of this is hidden inside the layers of bcache / md / filesystems layers, and most of that stuff is very extends happy these days, etc. [11:31] cking, and ibm is bumping that up, on s390x, because everything is big on big iron? [11:31] xnox, s390x can do I/O *really* fast [11:31] * xnox notes that it looks like that udev rule they commit alongside the bootloader is meant to run everywhere. [11:32] it's one of it's key wins [11:32] oh, so if I condition this to s390, people will complain? =) [11:32] * xnox about to make s390x bloody fast; and make all the other kids complain [11:33] xnox, whatever is chosen is going to be sub-optimal for somebody [11:33] however, ibm recommends this and all of us ship that udev rule, so for everybody we go with ibm recommendation on this one..... [11:34] one can overcommit z/vms too [11:34] very true [11:35] i really don't like how it is *1024/PAGE_SIZE. Surely a default of a specific number of sectors or pages makes more sense. And systems with larger disk sectors / memory page_size get more caching. [11:36] btrfs does funky stuff to the device. [11:36] 2727 sb->s_bdi->ra_pages = VM_MAX_READAHEAD * SZ_1K / PAGE_SIZE; [11:36] 2728 sb->s_bdi->ra_pages *= btrfs_super_num_devices(disk_super); [11:36] 2729 sb->s_bdi->ra_pages = max(sb->s_bdi->ra_pages, SZ_4M / PAGE_SIZE); [11:37] nice, so a device specific setting behaves differently depending on the file system, yat [11:37] *yay [11:37] btrfs goes to 11 [11:38] cking, well, btrfs does take over the whole device, if the whole device is added to the pool.... [11:39] md layer does stuff with stripe sizes [11:40] nfs too [11:41] * cking wonders what zfs does [11:51] tyhicks: I was running: ii  intel-microcode                             3.20180108.0+really20170707ubuntu16.04.1 [11:51] I have installed your package, intel-microcode (3.20180312.0~ubuntu16.04.1) [11:51] And will conduct testing and report back. [11:53] cking: it sets ra_pages to 0 [11:54] ah no, to 1 actually [11:54] f_g, that's not what I expected for sure [11:54] courtesy of having its own prefetcher [11:54] ah, that makes sense [11:54] bc17f1047a83cc8c4065e0ef84333a0d9b9d73aa [11:54] :) [12:19] tyhicks: My first test was to update the microcode, reboot, and keep the old firmware A14. Machine seemed fine, though it fully hung on my first test. The test was to log in, run glxgears and Matlab: bench(5) then log off. The machine hung up after login off at the login screen where a darker shade appeared over the whole screen. Num Lock would not work on the keyboard. I force restarted the machine, repeated the same process 3 times with [12:32] tyhicks: My second test was to update to A16, which is the latest firmware for the DELL T3610. I booted up, was able to log in, run glxgears and Matlab, log off and log back in and repeat the test several times (3+) with no problem. The machine was crashing (screen hung) right after entering the password and hitting enter when I was using the outdated microcode. Any idea when the new microcode package is going to be made available in the m [14:02] dijuremo: thanks for testing! I believe that the security team plans to publish the new intel-microcode this week [14:02] dijuremo: could you paste the output of 'iucode-tool --scan-system' when you get a chance? [17:15] tyhicks: oh well, spoke too soon, now that we have added gnome-desktop per user request, the machine is back to crashing either at the login screen or if we log out after starting a graphical session... sigh... [17:17] iucode-tool: system has processor(s) with signature 0x000306e4 [17:19] Is there a need to rebuild the nvidia kernel module with the latest Intel Microcode in place? [17:20] dijuremo: no, there shouldn't be [17:22] dijuremo: can you pastebin the output of 'dmesg -t | grep -i microcode'? are you still running firmware A16? [17:23] dijuremo: also, can you elaborate on what you mean by "crashing"? [17:24] The system freezes completely, only hardboot from the power button will reboot it. Numlock stops turining on/off in the ssytem [17:24] microcode: sig=0x306e4, pf=0x1, revision=0x42c [17:24] microcode: Microcode Update Driver: v2.2. [17:27] Our testing earlier today involved two separate users login in and running the usual Matlab and glxgears for tests. It was all working well up until the point we added gnome-desktop to the machine. It seems the issues are related to GUI/Nvidia, but I cannot be sure. [17:27] that matches the microcode revision in the latest blob from intel: [17:27] 01/126: sig 0x000306e4, pf mask 0xed, 2018-01-25, rev 0x042c, size 15360 [17:29] dijuremo: which kernel are you running? [17:31] 4.13.0-37-generic [17:34] BIOS Information [17:34]        Vendor: Dell Inc. [17:34]        Version: A16 [17:34]        Release Date: 02/05/2018 [17:42] dijuremo: at this point I think we're better off tracking all the failing combinations in a bug report [17:43] dijuremo: could you file one? [17:44] Point me in the right direction, never done one for Ubuntu... [17:44] dijuremo: also, I think it'd be very helpful to confirm one more time that lockups do not happen after removing intel-microcode, downgrading to firmware A14, and then rebooting into 4.13.0-37-generic [17:45] dijuremo: I think you should initially file it against intel-microcode: https://bugs.launchpad.net/ubuntu/+source/intel-microcode/+filebug [17:45] Let me try going to A14 without even removing gnome-desktop and I will report back. The system was stable with BIOS A14 and new microcode. [17:45] dijuremo: please include that info in the bug report === himcesjf_ is now known as him-cesjf [17:46] dijuremo: at this point, it is sounding like a bad BIOS update [17:47] tyhicks: This behaviour is the exact same in several DELL and one Lenovo System. Machines boot up just fine and freeze after attempting to log in. Or if you are loging off, right after log off, I think when lightdm restarts. [17:47] In all cases, we have fixed the issues by downgrading the BIOS. [17:48] dijuremo: including the 'iucode-tool --scan-system' output from all affected systems would be helpful [17:48] But in the other DELL machines I had not tested the latest microcode update [17:48] Neither the Lenovo M93p [17:48] dijuremo: I don't have a machine with 0x000306e4 so I'm unable to reproduce with that particular processor [17:49] dijuremo: big thanks for helping so much to test out all the different combinations [17:49] The Lenovo had an i7, would that be a good test you can check? [17:50] The T series are all Xeons. Seen it on the 5810 which has the E5-1680 and the 5820 with the newer W-2XXX cpus [17:50] no i7s but let me check my xeons [17:51] I am going to see what I can do, I have given these systems to end users, so not so sure they will be happy with me hijacking them for a few hours... [17:51] fwiw, I've been running the updated microcode for weeks now without any issue [17:51] I haven't applied any BIOS updates to those machines, though [17:51] Do you have Nvidia cards? [17:51] no [17:51] all machines are either intel graphics or headless [17:51] So the BIOS is pretty much what breaks them... [17:52] And I think it is highly related as well to the nvidia driver. though we had one set of laptops with the freezing issue which was solved by running the 4.4.0-116 kernel I think. [17:53] It was a couple of weeks ago, so I do not remember much. But bought two laptops, one came with an older firmware and took 4.4.13-036 without issue. The other machine had the newer firmware and would freeze. [17:53] I tried downgrading the BIOS on the laptop that was freezing, but the downgrade was blocked by DELL :( [17:54] At that time I did not know of pti=off or nospectre_v2 or updating microcode, so I have not tried any of those.... [17:59] tyhicks: FWIW, the machine is currently up, without login into it. So it seems that the graphical login triggers the freezes, though one time earlier today it fully froze during boot. [18:00] tyhicks: I am ssh-ing it from another machine and have a shell on it.. [18:00] dijuremo: ok, so you have the A14 firmware with intel-microcode uninstalled? [18:00] Nope, I have A16 without having logged in at all in the GUI [18:01] Still A16 microcode .deb from march I got from your PPA [18:01] dijuremo: I think we've established that the A16 BIOS locks up in all cases [18:02] tyhicks: A16 locks up in all cases after login via the GUI. Sometimes as you log in, sometimes when you try to log off and lightdm restarts [18:02] That is with the microcode from March (your PPA deb package). [18:02] dijuremo: I wonder if we'll see any microcode revisions reported by the kernel with A16 but with intel-microcode uninstalled [18:03] Without the microcode, it freezes as soon as lightdm starts [18:03] huh [18:03] that makes me think that the BIOS has the microcode that Intel pulled due to QA issues [18:03] With the older 201801 microcode from the main Ubuntu repo [18:04] Intel briefly released updated microcode and the pulled it because users were experiencing issues [18:04] These BIOS releases are from March, so supposedly vendors had not published the first problematic ones or removed them from their sites. [18:04] But some of the 5820 and 5810 BIOS are very freshly released... [18:04] I don't know how to tell which microcode revision is provided in a BIOS update [18:05] The T3610 is from this month as well. [18:06] A16 Dell Precision Workstation T3610 System BIOS BIOS 06 Mar 2018 [18:06] timing doesn't tell us if they are shipping a possibly bad revision [18:07] I do not know how to check for good/bad [18:07] My understanding was they were not releasing the ones with the bad fixes... [18:07] How could one figure out what those are? [18:08] I'm not sure - thinking about that now... [18:08] For the 5820 there is a newer BIOS released yesterday.... Dell Precision 5820 Tower System BIOS BIOS 27 Mar 2018 1.4.0 [18:08] dijuremo: you could try uninstalling intel-microcode, rebooting, and then see if the 'dmesg -t | grep -i microcode' command shows anything [18:09] Will do that now [18:09] We do not want the older microcode from 201801, right? [18:09] dijuremo: 'sudo cat /sys/devices/system/cpu/cpu0/microcode/version' would also be interesting [18:09] dijuremo: that's right [18:09] With A16 cat /sys/devices/system/cpu/cpu0/microcode/version [18:09] 0x42c [18:10] We want this to use the original CPU microcode, right? -> apt purge intel-microcode [18:11] dijuremo: right - we want to see what revision is reported when just using the microcode from the A16 BIOS [18:13] dmesg -t  | grep micro [18:13] microcode: sig=0x306e4, pf=0x1, revision=0x42c [18:13] microcode: Microcode Update Driver: v2.2. [18:13] cat /sys/devices/system/cpu/cpu0/microcode/version   [18:13] 0x42c [18:13] root@localhost:~# dpkg -l | grep intel-microcode [18:13] root@localhost:~# [18:13] dijuremo: that was after a reboot? [18:13] Yes [18:14] root@localhost:~# uptime [18:14] 14:14:03 up 2 min,  1 user,  load average: 0.43, 0.53, 0.24 [18:14] hmm [18:14] I'm stumped [18:15] Should I downgrade to A14 and repeat? [18:15] We should get a different microcode then, right? [18:15] the intel-microcode deb and the A16 BIOS deliver the same microcode revision yet the lockup only happens with A16 [18:15] dijuremo: yes, that's a good test [18:16] dijuremo: downgrade to A14, check what microcode revision is being reported, ensure that you can't trigger the lockup, then install the intel-microcode from the PPA and reboot, then see if you can trigger the lockup [18:17] this is maddening but very helpful since we're trying to determine how widely to push out the new intel-microcode updates [18:21] With the A14 BIOS [18:21] root@localhost:~# cat /sys/devices/system/cpu/cpu0/microcode/version   [18:21] 0x428 [18:22] ok, so it was downgraded [18:22] dmesg -t  | grep micro [18:22] microcode: sig=0x306e4, pf=0x1, revision=0x428 [18:22] microcode: Microcode Update Driver: v2.2. [18:22] I am going to log into GUI and log out a few times and also change desktops from unity to gnome with A14. [18:22] sounds good [18:31] Login works well for Unity and Gnome under my account and I asked someone else to log in to their account to gnome. Ran Matlab and glxgears no issue [18:32] So total of 3 GUI logins and logouts without freezes on A14 BIOS and 4.13.0-37-generic kernel [18:32] [18:32] Would the next test be to install your PPA intel-microcode package and test again? [18:32] See if we can get it to crash? [18:33] dijuremo: yes, please [18:34] I also do not know that running matlab and glxgears are the best test, but since the freezes happened on a GUI, I thought those would be at least decent apps to run. [18:34] If you can think of any other apps to run, let me know. [18:34] I've heard of one other report that involves starting up a VM [18:35] it seems like you have a fairly reliable reproducer though [18:35] I would have to install virtualbox and a VM... do not have one there at the moment. [18:36] no worries [18:36] Installed intel-microcode from your PPA and now rebooting... [18:36] sforshee, hi, are you about to rebase the bionic next branch to 4.15.14? [18:37] After reboot: [18:37] dmesg -t  | grep micro [18:37] microcode: microcode updated early to revision 0x42c, date = 2018-01-25 [18:37] microcode: sig=0x306e4, pf=0x1, revision=0x42c [18:37] microcode: Microcode Update Driver: v2.2. [18:38] cat /sys/devices/system/cpu/cpu0/microcode/version   [18:38] 0x42c [18:38] dmidecode | grep -i version | grep 14 [18:38]        Version: A14 [18:38] Bios is still A14 [18:39] * tyhicks nods [18:48] So far no freezes... A14 BIOS with your Microcode Logged in 3 times user1 both Unity and Gnome, user2 to Gnome, logged out after running matlab bench(5) and glxgears, machine is still OK, no freezes. [18:49] Does that tell us something else in the BIOS not related to the CPU microcode is to blame? [18:49] possibly [18:49] I'd say the only thing left to do is upgrade the BIOS to A16 and try once more [18:49] Will do, that should reproduce the freeze easily... [18:55] ricotz: that must have just come it. We'll undoubtedly get to that shortly. [18:55] *out [18:55] dijuremo: when the BIOS is updated do you reset to factory defaults or leave the existing BIOS config in place? [18:56] sforshee, yeah, a couple hours, just notice you were pushing changes, so I was curious about this last release [18:57] dijuremo: I've seen instances where a later BIOS changes the layout of the config NVRAM space and inherited values can be read for some other setting (due to different offsets, etc.) [19:00] After BIOS update I have not reset to factory defaults. I can try that. The first GUI login worked well, I ran Matlab and Glxgears, and then after login out, the machine froze. Numlock key is no longer tuning on/off and I cannot ssh into the machine [19:04] while maybe not definitive, I think it points at the BIOS update as introducing the issue [19:05] TJ- has a good idea [19:12] For a moment I thought it worked, because this time I was able to log in and log out once, but during the second GUI login the machine froze and Numlock key does not respond nor does ssh... [19:13] Not sure if it would matter, but the BIOS is set to EFI mode. [19:13] Has been for all the testing.... [19:13] So resetting BIOS defaults with A16 did not really fixed the problem either. [19:18] on one hand I'm glad we were able to narrow it down to the new BIOS but on the other hand, I'm stumped at what the next steps would be [19:18] tyhicks: And this happens with several models and also a Lenovo M93p [19:19] Exact same type of freezing with the later BIOS from manufacturers. Fix is to downgrade the BIOS where possible. [19:19] And in that DELL 7480 laptop with the latest BIOS, the fix was to stick to 4.4.0 kernel because 4.13.0 would freeze the machine. [19:20] I guess we could try downgrading the kernel to 4.4.0 here and test with A16. Would that provide any useful information? [19:21] it would be a good data point [19:21] That may be tomorrow since it will involve clearing and rebuilding the nvidia driver, etc and I am getting close to leave and have to wrap up other things. Will keep you posted. [19:22] that's understandable - thanks for all the debugging you've done so far [19:22] Thanks for all the help!!! I really appreciate it! [19:31] tyhicks: can of worms! [19:39] tyhicks: one question I've not been able to find answered in the Intel architecture docs is this: when doing a CPU warm reset does the loaded microcode get reset back to CPU-ROM version, or does a loaded microcode persist. Inference says it gets cleared on CPU reset but it isn't documented anywhere I've been able to find. [20:06] TJ-: hmm... I would think it gets cleared but I'm really not sure [20:45] Aha.. Intel 64 and IA-32Architecture Software Developers Manual vol 3A; System Programming Guide; Part1 9.11.6.1: "The effects of a loaded update are cleared from the processo [20:45] r upon a hard reset. Therefore, each time a hard reset [20:45] is asserted during the BIOS POST, the update must be re [20:45] loaded on all processors that observed the reset. The [20:45] effects of a loaded update are, howeve [20:45] r, maintained across a processor INIT.