[03:08] <happyaron> wants to know how can I debug apparmor for lxc/lxd related issue?
[04:43] <jjohansen> happyaron: https://gitlab.com/apparmor/apparmor/wikis/AppArmor_Failures
[04:44] <jjohansen> in addition there is a debug mode
[04:44] <jjohansen> as root echo 1 > /sys/module/apparmor/parameters/debug
[04:45] <jjohansen> this will cause a few extra messages to the kernel ring buffer (dmesg) that may help
[04:47] <jjohansen> another one to watch for is seccomp no_new_privs which can block apparmor from making domain transitions
[04:48] <jjohansen> apparmor does emit a message for this, but you may need the debug setting turned on
[05:03] <happyaron> jjohansen: thanks! looking into that
[12:46] <jcdutton> Hi. How is bad bug comming along. Have we found out how to un-protect the SPI ?
[13:04] <sladen> jcdutton: the protection actually seems to be on the flash chip itself, which boots up read-only
[13:06] <sladen> jcdutton: and would normally get unlocked by some other means when required
[13:12] <sladen> ypwong: Mika is going to contact you about testing leaving FSMIE as-is
[13:15] <sladen> jcdutton: you disappeared, but yes, people are working on it
[13:16] <sladen> <sladen jcdutton: the protection actually seems to be on the flash chip itself, which boots up read-only
[13:16] <sladen> <sladen jcdutton: and would normally get unlocked by some other means when required
[13:18] <jcdutton> Maybe it has changed the Protected Regions?
[13:21] <sladen> jcdutton: my current theory is that the SMM firmware normally takes care of either  (a) clearing the protect on first access;  (b) going into lockdown when something unexpected is encountered
[13:22] <sladen> jdstrand: however we are dependent on those that have access to hardware to test.  And they are currently having dinner (because of the timezone) and will return to testing after a short break
[13:22] <sladen> jcdutton: ^^
[13:25] <jcdutton> I read that a reboot 3 times might help.  My research is that the reboot requires complete power off. I.e. No Main, no battery, power off, then on again
[13:27] <sladen> jcdutton: yes, battery out, power off
[13:27] <sladen> jcdutton: and does it?
[13:27] <sladen> jcdutton: what did you discover?
[13:28] <sladen> jcdutton: probably should have written 'power off
[13:28] <sladen> jcdutton: probably should have written 'power cycle'
[13:28] <jcdutton> I don't have a problem Laptop, but the datasheets say that the write-lock is only cleared on a complete power cycle.
[13:31] <ypwong> sladen, will do. If I break it I will have to go back to ODM to unbrick it.
[13:34] <sladen> ypwong: ta.  what would also be useful are before/after dumps of the flash.  This would help to confirm if it is eg. a broken checksum (not) being updated, and which is causing the firmware not to unlock the chip during boot the next time
[13:36] <sladen> ypwong: second would be the debug information that the intel-spi driver dumps out during boot, to see what the initial state of the SPI status and control registers are
[13:36] <sladen> ypwong: but take your time and have a well-earned break!
[13:47] <jcdutton> ypwong, What do ODM do to it to unbrick it?
[13:56] <ypwong> jcdutton, they use a spi flasher writer to change the CMP bit from 1 to 0
[13:56] <ypwong> jcdutton, https://www.dediprog.com/pd/spi-flash-solution/sfdk01
[13:56] <ypwong> sladen, will get to that after my meeting that's 4 mins from now :)
[14:02] <jcdutton> ypwong, sounds a bit odd to me. what would have set that to 1?
[14:02] <ypwong> jcdutton, that's what we are finding out
[14:02] <ypwong> kernel codes look sane but somehow that bit changed to 1
[14:05] <sladen> jcdutton: several lines of inquiry, latest speculation is related to  https://github.com/torvalds/linux/commit/9d63f17661e25fd28714dac94bdebc4ff5b75f09  which if it's not in the running kernel might leave junk in the FIFO from the previous access
[14:07] <sladen> jcdutton: though my reading of the documentation is that the flash chip comes up with CMP=1, and needs to be explicitly cleared, or WSP=1 needs setting to use a different protection scheme
[14:09] <jcdutton> ypwong, the problem with CMP being wrong, is that next to it are Write-Once bits, that if they are written, will perm brick the laptop.
[14:12] <sladen> jcdutton: yup, the latest speculation is that although the code does a read-modify-write (to set something else in that register);  if the 'read' was for the incorrect data, then the write will also be for the incorrect data
[14:13] <sladen> jcdutton: https://www.winbond.com/resource-files/w25q64fw_revd_032513.pdf  is the Winbond doc
[14:15] <jcdutton> That commit is crazy code.
[14:16] <sladen> jcdutton: why so?
[14:16] <jcdutton> It does not check for len = 0
[14:18] <sladen> mmm, that's certianly a good way to set all-1s
[14:22] <jcdutton> Does it need to do write-posting on that write?
[14:27] <sladen> this (risky) code could certainly do with some more error/sanity checking
[14:27] <sladen> ie. validating every single bit in what is going to be written back
[14:29] <jcdutton> Is this code actually used for anything, apart from re-flashing ?  In which case, it should not write anything until someone actually wishes to re-flash
[14:33] <jcdutton> sladen, that patch does fix a bug in the read statement, that is a good fix, but I don't think that would cause the problems we are seeing.
[14:35] <sladen> jcdutton: the *whole problem* is that is collateral from the _init()/probe code. ...which should not be doing *anything* invasive/risky
[14:35] <sladen> jcdutton: the code isn't even used
[14:36] <sladen> jcdutton: so in theory "nothing" is happening
[14:39] <jcdutton> Surely some of the writel in the init need write-posting?
[14:40] <jcdutton> although, better if it never did a write in that init code.
[14:45] <sladen> jcdutton: write-posting?
[14:52] <jcdutton> With PCI, in order to actually write something, you have to read it afterwards, to force a PCI transaction to actually pass the data across the buss
[14:52] <jcdutton> so, a writel without a following readl is unlikely to behave as expected
[14:55] <jcdutton> I was asking, in case this chatting with the SPI is not via an PCI bus, in which case it does not matter about write-posting
[15:09] <sladen> jcdutton: no, but I do have a feeling that some of this code is failing to triple check whether the SPI is ready, and not in the middle of something else
[20:53] <jcdutton> sladen, where in the code is it writing  the CMP bit?
[21:03] <sladen> jcdutton: the sr2 register in ...
[21:03] <jcdutton> I cannot see a write to sr2 in the init function
[21:04] <sladen> jcdutton: wait, trying to find it
[21:05] <jcdutton> I see, sr1-4 is all one 32 bit reg
[21:05] <sladen> jcdutton: it's in spi_nor_init()  write_sr(nor, 0);
[21:06] <sladen> jcdutton: the intention is to _clear_ the protection bits
[21:06] <sladen> jcdutton: however
[21:06] <sladen> jcdutton: and the really worrying thing about this (as you've already noted) is that the same register has write-once fuses in it aswell
[21:06] <sladen> as at the very least that routine should mask those out
[21:07] <sladen> so that it is impossible for an _init_ routine to write anything dangerous, even if the reading got screwed up
[21:07] <jcdutton> Agreed
[21:08] <sladen> this is one avenue of investigation
[21:08] <sladen> another is that the BIOS System Management Firmware normally clears t
[21:08] <sladen> clears it
[21:09] <jcdutton> A simple AND 0xffff00ff  would do it
[21:10] <jcdutton> But that only works if the QUAD and CMP should be 0. In some cases that might not be true.
[21:11] <jcdutton> Maybe 0xffff83ff  would be safer, in the sense that it would be reversable in software. Those other bits are not reversable.
[21:12] <sladen> prefereably   ~(DANGERBIT(X) | DANGEROUSBIT(Y) | DANGEROUSBIT(Z))  where the enums make it less easy to screw up
[21:12] <sladen> and more obvious what is happening
[21:12] <sladen> think the code is trying to enable the Quad mode
[21:14] <jcdutton> I think the only real solution to this problem is a intel-spi driver with a mass of quirks in it, so undo the damage it has done.
[21:14] <sladen> jcdutton: something like that is in the process of being tested...
[21:14] <sladen> jcdutton: but needs to not make the situation any worse
[21:15] <jcdutton> it can get worse???
[21:15] <sladen> (if it is only CMP getting set it is reversible)
[21:15] <sladen> and in software
[21:15] <sladen> if any of the write-protect fuses get blown, then it is bad
[21:17] <jcdutton> Maybe we need to get a sample. write a version of the driver that is 100% safe, and have it dump the sr2 to the syslog. and gather results. I think netboot works for everyone, so one could create a netboot image that just prints the results, without causing more damage.
[21:18] <sladen> booting with 'debug' should be enough
[21:19] <jcdutton> Let the tool offer to fix the CMP bit, but explain that if the Write-once bits are set, it is a return to base fix.
[21:19] <jcdutton> No-one in their right mind is going to run that intel-spi driver in its current state.
[21:20] <jcdutton> Unless the ~dangerous code is added.
[21:20] <jcdutton> I also found out that this is a PCI device
[21:21] <jcdutton> so they also need to add the write-post bits.
[21:23] <sladen> knock up a proposed patch for both
[21:23] <sladen> along with citations explaining the why
[21:24] <sladen> given the number impacted people (probably 1000x those who have reported it)
[21:25] <sladen> it needs to be done slowly and carefully with review
[21:25] <jcdutton> Agreed. I offer to review the proposed fix.
[21:27] <jcdutton> sladen, FYI, I used to be a kernel developer, but don't have a lot of time for it now.
[21:29] <jcdutton> sladen, have there been any reports of 3 power cycles fixing the problem?
[21:30] <sladen> jcdutton: no, that was my guess on the basis of (1) first time blacklisting the driver (2) letting the firmware come up and get upset and broken checksums etc + cleanup (3) hopefully come up in a better state
[21:31] <sladen> jcdutton: however, if it's not the SMI disabling, but is in need CMP in SR2 on the flash getting set because of a corrupt FIFO read before, then that is probably not going to help
[21:31] <sladen> jcdutton: and the latest information points to the need to reset CMP in SR2 on the flash, in software
[21:31] <sladen> jcdutton: the difficulty for me and others is the lack of hardware to test
[21:33] <jcdutton> Agreed. particularly with the possibility of making it worse due to the write-once bits.
[21:33] <jcdutton> You really need someone with the laptop and a flash programmer to hand, who can test and fix if needed.
[21:35] <sladen> which we have
[21:35] <sladen> sort-of
[21:35] <sladen> and my approach to this would be working out to reliably brick it, in order to unbrick it by not doing that
[21:37] <jcdutton> Where is the fifo you speak off?
[21:43] <jcdutton> A stepping stone towards that could be a bit of code that simply reports: "Software fixable. CMP flipped"  or "Hardware replacement needed" if the write-once are set.
[21:43] <jcdutton> and put that in a bootable image