[12:28] <dupondje> anyone could give me some pointers to debug the following?
[12:28] <dupondje> [ 2990.419420] pcieport 0000:00:1d.0: AER: Corrected error received: id=00e8
[12:28] <dupondje> [ 2990.419434] pcieport 0000:00:1d.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=00e8(Transmitter ID)
[12:28] <dupondje> [ 2990.419441] pcieport 0000:00:1d.0:   device [8086:a118] error status/mask=00001000/00002000
[12:28] <dupondje> [ 2990.419446] pcieport 0000:00:1d.0:    [12] Replay Timer Timeout  
[12:29] <dupondje> already tested latest daily built kernel, but same errors
[12:35] <apw> dupondje, are you seeing any other symptoms with that error, as that just says it is a corrected error
[12:36] <apw> ie is it a one off, does it repeat, does the machine crater after it
[12:41] <dupondje> apw: it repeats, sometimes after 1 minute, sometimes only 1 per hour
[12:41] <dupondje> its random
[12:41] <dupondje> but nothing seems to hang/lock/die whatever :)
[12:42] <apw> and i assume you didn't see them on older kernels
[12:42] <apw> if so it may well be we learned how to report them
[12:43] <dupondje> well I saw it on stock 17.10 kernel and on daily kernel
[12:43] <dupondje> its on a new laptop, so no idea about older kernels :)
[12:46] <dupondje> also seeing:
[12:46] <dupondje> [   54.376166] nvme 0000:04:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=0400(Receiver ID)
[12:46] <dupondje> [   54.376169] nvme 0000:04:00.0:   device [1c5c:1284] error status/mask=00000081/0000e000
[12:46] <dupondje> [   54.376172] nvme 0000:04:00.0:    [ 0] Receiver Error         (First)
[12:46] <dupondje> [   54.376174] nvme 0000:04:00.0:    [ 7] Bad DLLP   
[12:46] <dupondje> but way less frequent
[12:46] <apw> was that nvme installed when you bought it ?
[12:47] <apw> both of those errors are reporting that transactions on PCIE are failing, that would often imply h/w issues
[12:47] <dupondje> yes
[12:49] <dupondje> brand new device
[12:49] <dupondje> its ofcourse possible something is wrong with it ...
[12:52] <apw> or indeed that "occasional" errors are to be expected as long as they are corrected
[12:52] <apw> you really need to find another machine of the same type and find out if it has the issue
[12:52] <apw> what kind of machine is it
[12:53] <dupondje> Dell Precision 5520
[12:53] <TJ-> It could be an ASPM issue
[12:54] <dupondje> could try booting 16.04 on a liveusb
[12:54] <dupondje> and see if it happens there also ...
[12:56] <TJ-> dupondje: is it a XPS 9560 ?
[12:56] <TJ-> oh no, sorry, you said!
[12:57] <dupondje> TJ-: its not, but XPS 9560 is actually (exactly?) the same hardware ...
[12:57] <dupondje> afaik
[12:57] <TJ-> dupondje: I see some reports with Dell + Hynix M.2 NVMe 
[12:58] <dupondje> Model Number:                       PC300 NVMe SK hynix 512GB
[12:58] <dupondje> :D
[12:58] <dupondje> TJ-: link?
[12:58] <TJ-> there's a workaround here but it sounds a bit drastic, maybe apw can comment? https://bbs.archlinux.org/viewtopic.php?id=229682
[13:01] <TJ-> dupondje: it may be the ACPI DSDT isn't correctly configuring the device
[13:02] <dupondje> but thats a bug in the kernel? Guess I should better open some bugreport then?
[13:06] <TJ-> If it's ACPI it'll be a firmware bug. There's a common change I often recommend where there are unusual hardware issues, and it's very sucessful. See http://iam.tj/prototype/enhancements/Windows-acpi_osi.html
[13:09] <dupondje> still thats a workaround then :) Needs to be fixed in the NVME firmware then?
[13:11] <TJ-> dupondje: no. if the issue is ACPI and that fixes it, the bug is in the Dell PC firmware assuming it is running with Windows and not fully configuring the system when Linux is the OS
[13:14] <dupondje> guess I should poke Dell then :)
[13:14] <dupondje> they deliver the laptop with Ubuntu 16.04 on it, so you would expect that it works fine
[13:18] <TJ-> Does it work without that error on 16.04? I think you said you're using 17.10 on it ?
[13:18] <TJ-> Recent kernels have tightened up the ACPI implementation so we see a lot more of this kind of issue as a result
[13:22] <dupondje> just booted into livecd of 16.04 now
[13:25] <dupondje> but seems to be fine on 16.04, no such errors
[13:35] <dupondje> So conclusion is that its a BIOS/ACPI bug that is visible only in recent kernels?
[13:39] <TJ-> dupondje: the hint is there, yes
[13:42] <dupondje> hmmmm, guess Dell doesn't have a bugtracker :P