=== himcesjf_ is now known as him-cesjf [07:39] jsalisbury: I/we don't have any affected hardware for testing, and our production kernel has the problematic commit reverted currently. if alkisg does not report back, feel free to ping me and I can see about spinning up a test kernel and attempting to convince our affected users to try it out ;) [07:40] It's rainy today, I'm not sure I'll be able to bike to the affected school... [08:16] OK, I got remote access, testing http://kernel.ubuntu.com/~jsalisbury/lp1742630/i386/linux-image-4.13.0-25-generic_4.13.0-25.29~lp1742630_i386.deb [08:17] apw: the 4.4 origin/pti branch is missing GPR scrubbing on vmexit for Intel / VMX [08:18] the upstream mainline commit has both Intel and AMD in one commit, your branch contains one by AMD for AMD only [08:18] a1c61c3a6dec6fca6380fa7aa294978dc84e616c in xenial's pti branch [08:20] 0cb5b30698fdc8f6b4646012e3acb4ddce430788 in mainline [08:59] jsalisbury: in cooperation with the school teacher, he tells me that 4.10.0-42 boots, your 4.13.0-25 does NOT boot, and the stock 4.13.0-26 does NOT boot either [09:00] Also, now I have 3 schools affected [09:08] alkisg, i can't tell if your saying that jsalisbury's test kernel also does not boot [09:08] bjf: exactly, jsalisbury's test kernel does also not boot [09:09] I can go visit some school in person if that can somehow help [09:18] E.g. maybe I could boot with `debug` and get some better error message than "it reboots"... :) === jdstrand_ is now known as jdstrand [09:28] alkisg jsalisbury: we've had positive feedback from affected users for our kernel with the revert, and Debian went the same route in their 4.9 (which got the buggy commit via -stable it seems?). maybe that would be the better short-term option given the lack of response upstream, especially with users being forced to 4.13 from a previously working 4.10.. [09:29] f_g: I don't know your case exactly, but if you have a kernel with some reverted commit that I could test, different from the one of jsalisbury's, I'd be glad to [09:29] (i386 only here though) [09:30] f_g, what bug number is that ... [09:30] alkisg, if someone can tell me what patch we are reverting i can give yuo a kernel to test at least [09:30] alkisg: it's a derivative, so it's not a stock Ubuntu kernel. amd64 only as well, so that is likely not much help to you. you can just do a git revert of the offending commit, bump the version and build your own (for testing purposes). the ubuntu wiki has details ;) [09:31] apw: the scsi/libsas one? #1726519 [09:31] f_g, the one you and alkisg are talking about [09:31] apw: jsalisbury mentioned https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1742630 [09:31] Launchpad bug 1742630 in linux (Ubuntu Artful) "Booting from 4.13 leads to Oops: NULL pointer dereference - RIP: isci_task_abort_task+0x30/0x3e0 [isci]" [High,Triaged] [09:31] both are the same bug ;) [09:31] I tested his kernel but it did NOT solve the issue [09:32] alkisg, his was adding the "likely fix" i assume [09:33] f_g, and 909657615d9b ("scsi: libsas: allow async aborts") is what you reverted with success ? [09:33] apw: exactly [09:33] apw: he's mentioning what he did in https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1742630 comment #5 [09:33] Launchpad bug 1742630 in linux (Ubuntu Artful) "Booting from 4.13 leads to Oops: NULL pointer dereference - RIP: isci_task_abort_task+0x30/0x3e0 [isci]" [High,Triaged] [09:34] alkisg, and did you test the kernl he refers to comment #34 which to my eye claimes the revert is applied ? [09:34] apw: exactly, although he had to rebuild for i386 just for me [09:34] bah, i need to smack him, there are no patches in that directory [09:35] alkisg, and it did not fix you is that also true ? [09:35] Hehe [09:35] apw: the school teacher that tested jsalisbury's kernel reported that it did not boot [09:35] I installed it for him, he said he got a black screen when he selected it in grub, and then he booted with the previous 4.10 kernel [09:36] apw: if you spin an i386 kernel for me, I can test it as soon as it's ready [09:36] (in person) [09:36] no it appears if you had an i386 built then it wasn't that one [09:37] apw, that one: http://kernel.ubuntu.com/~jsalisbury/lp1742630/i386/ [09:37] alkisg, right, that actually #34 points to this one, which has no i386 ... http://kernel.ubuntu.com/~jsalisbury/lp1726519/ [09:38] alkisg, so i don't think you have actually tested the revert, but some other "likely fix" [09:38] apw: jsalisbury's prepared an amd64 build with the revert. I told him I don't have amd64 to test with that, and he issued another build of that just for me [09:38] apw: that likely fix comment is by me - I based that on the Fixes stanza and commit message, like I said I don't have any affected hardware to test. [09:39] apw: so afaik they both have the same fix; but jsalisbury would know more... [09:39] alkisg, right but the one you just pasted to me, is not the same build ... it is a build pointed to by the "likely fix" commit [09:39] ok [09:39] there is one user reporting that the patched kernel fixes the issue, so maybe alkisg's "does not boot" is unrelated [09:39] alkisg, so i would say basically we have no idea because jsalisbury didn't include the patches in the results so we cannot tell ... *grrrr* [09:40] alkisg, so all i can suggest is i build you an i386 with that definativly reverted [09:40] apw, if you can build/point me to an i386 build with the revert, I can test it immediately [09:40] ok [09:40] Cool [09:40] alkisg, so this is a xenial linux-hwe ... right ? [09:40] Right [09:41] alkisg: if you go there in person, it would be good to check if you get an oops that matches the description. all the reports I have print the oops loud and clear right at boot [09:42] f_g: I can arrange to be there in half an hour; the problem is I can't stay there for a loooong time, let's say half/one hour [09:42] So, it'll be best if we gather whatever I need to test before going there [09:42] alkisg: yes, having the kernel with revert at hand for testing is a good idea :) [09:43] alkisg, you can test this kernel remotely i think ? [09:43] apw: sure, but I don't mind going there, in case I can get more info with "debug" or something [09:43] alkisg, lets see if it is this i guess :) [09:43] ok [10:00] alkisg, http://people.canonical.com/~apw/lp1726519-xenial/ [10:00] apw: ty [10:05] Eh the teacher left the school, I'll go there and test in 30' [10:06] how annoying [10:07] you need a school in your house, the only sensible solution :/ [11:18] apw: missing the GPR scrub on vmexit for VMX in 4.13/pti as well (compared to mainline again) [11:42] f_g, yep, thanks, will get that replaced, what a mess the world is right now [11:49] indeed. [11:49] * apw wonders how alkisg is getting on [11:50] f_g, pti branch is updated with those now, pending them passing some kind of testing [11:52] alkisg, any luck? we are running out of runway to include anything in this respin [11:52] apw: we've been running with them for a week already without any negative reports [11:53] f_g, great, the vmx one i am pretty confident with as they were clean applications [11:53] it is the other thing that is worrying me right now, but it may become moot fairly soon [11:54] yep, the vmx part is from google to fix a google PoC - I guess that part is already pretty battle-tested ;) [12:57] apw: initial limited testing looks good for both 4.4 and 4.13 [12:58] I hope re-integrating the pti branches with mainline RETPOLINE won't cause too much problems - lots of user out there still running affected CPUs that will never get IBRS and IBPB [13:27] f_g, upstream will be adding ibrs/ibpb on top of their reptoline, once that exists and stops moving we'll be wanting to flip to that [13:34] apw: sorry, I went to the school but there were many students in the computer lab and it was very hard to reboot. I did install the kernel, and I'm waiting for the result tomorrow morning. [13:35] (eh, I didn't explain that we're using a server/netbooted client model, and the problem happened on the server) [13:35] okies, tehn we'll ignore that one for now [15:11] alkisg, f_g I'm reviewing the scrollbck on this channel now. I'll review the bug comments too, but it sounds like the kenrel posted in comment #5 of bug 1742630 does not boot for you, alkisg? Did the kernel apw built for you resolve the bug? [15:11] bug 1742630 in linux (Ubuntu Artful) "Booting from 4.13 leads to Oops: NULL pointer dereference - RIP: isci_task_abort_task+0x30/0x3e0 [isci]" [High,Triaged] https://launchpad.net/bugs/1742630 [15:31] jsalisbury: correct, your kernel didn't boot, I'll know tomorrow about apw's kernel [15:32] alkisg, ok, thanks! [15:32] Thank you too :) [15:32] alkisg, if you can, can you grab a screen shot of diginal picture of my kernel when it does not boot? [15:32] s/of/or/ [15:32] jsalisbury: sure; will "debug" help more? [15:33] I'll check both anyways [15:33] alkisg, it would be good to see if there is a panic or stack trace for my test kernel [15:33] ok [15:33] alkisg, I'd like to send that info upstream as well, and provide feedback on the patch. [21:15] IDENTIFY jsalisbury [21:15] he shoots ... and misses === apw_ is now known as apw [21:16] heh