tsimonq2 | under "What do you need help with?" on https://wiki.ubuntu.com/Kernel/GettingInvolved , it shows an outdated link and I don't know where the real link would be. | 02:42 |
---|---|---|
dsmythies | tsimonq2, I fixed the link. Try it now. | 03:59 |
apw | dsmythies, it is a consequence of CONFIG_ZONE_DEVICE turning off CONFIG_ZONE_DMA yes | 08:29 |
apw | dsmythies, not convinced the constraints are reasonable, at least for x86_64 where you don't have HIGHMEM so i think there is room for both though the config constraints prevent that | 09:06 |
apw | dsmythies, is that card used for something we are going to notice, like qemu ? | 09:06 |
tsimonq2 | thanks dsmythies | 11:38 |
rtg | caribou: could you have a look at bug #1528101 and tell me if kexec is functioning as designed ? | 13:26 |
ubot5 | bug 1528101 in linux (Ubuntu) "ISST-LTE: kdump failed: second kernel booting hangs after /scripts/init-bottom when large min_free_kbytes value being set" [Undecided,New] https://launchpad.net/bugs/1528101 | 13:26 |
caribou | rtg: sure | 13:27 |
apw | caribou, not sure if we have any sane way generically to know which options are good to keep and which are not for a kdump kernel, i wonder if we almost want a separate directory of overrides for kdump. | 13:28 |
caribou | apw: I've been running tests to try to come up with some sensible rules for the definition of crashkernel | 13:29 |
caribou | but there are so many variables that come into play that it is kinda hard | 13:30 |
caribou | for instance, I've been testing on VM with up to 22Gb : the default 128M works fine | 13:30 |
apw | so we might just say that this one is a speicifc override to be osmething sane like 1% | 13:31 |
apw | and be happy | 13:31 |
caribou | apw: but when I test on real hardware with 128G of memory, crashkernel=128M comes short not because of the memory but because of the fact that initrd is bigger and it OOMs | 13:31 |
apw | caribou, ok, even with your magic smaller one ? | 13:32 |
apw | so an =dep initrd is still too big for 128M / | 13:33 |
apw | ? | 13:33 |
caribou | apw: that was on trusty so the change was not there; I went on to rebuild with MODULES=dep & got bitten by the initramfs-tools bug that I fixed | 13:33 |
caribou | apw: it works now with a smaller initrd | 13:33 |
apw | ahh ok | 13:33 |
apw | of course this bug might be moot with that fix too, hrm | 13:34 |
caribou | apw: but in this case, the boot sequence fails before kdump even comes into play | 13:34 |
caribou | apw: partly, on system with SSDs, the kernel hook would fail when building the smaller initrd file | 13:34 |
apw | which we should fix i assume | 13:35 |
apw | for this bug, they are setting min_free somewhere, presumably in /etc/sysctl.d.conf and we really should not be honouring that in the constrained environment | 13:36 |
apw | its not clear how one would know that though programatically | 13:36 |
caribou | apw: LP: #1532146 is waiting for a sponsor if you have a minute ;-) | 13:36 |
ubot5 | Launchpad bug 1532146 in initramfs-tools (Ubuntu) "update-initramfs fails for MODULES=dep when root is on nvme device" [Medium,In progress] https://launchpad.net/bugs/1532146 | 13:36 |
apw | caribou, yeah its on my todo list :) i have another initramfs-tools bodge to ram in, just updateing aufs for a nasty hang bug | 13:37 |
caribou | apw: that's fine, as long as it is queued to be fixed somewhere | 13:37 |
caribou | apw: in that case (LP: #1532146), it doesn't even make it to kdump & fails before | 13:38 |
ubot5 | Launchpad bug 1532146 in initramfs-tools (Ubuntu) "update-initramfs fails for MODULES=dep when root is on nvme device" [Medium,In progress] https://launchpad.net/bugs/1532146 | 13:38 |
caribou | apw: so you're right, we should find a way to boot the kexec kernel w/o using any tailored sysctl values | 13:38 |
caribou | oups, wrong bug : bug #1528101 | 13:39 |
ubot5 | bug 1528101 in linux (Ubuntu) "ISST-LTE: kdump failed: second kernel booting hangs after /scripts/init-bottom when large min_free_kbytes value being set" [Undecided,New] https://launchpad.net/bugs/1528101 | 13:39 |
caribou | rtg: so for your question : kexec doesn't even get activated so yes, works as designed ;-) | 13:39 |
rtg | caribou: since I don't really understand the fine points of the discussion, I'll bug apw and make him explain it. | 13:41 |
caribou | apw: everything in /etc/sysctl.conf + /etc/sysctl.conf.d should not be honoured when the kexec kernel boots | 13:43 |
apw | caribou, should not as in you think they are not, or you think they are and shouldn't ? | 13:44 |
caribou | apw: looks like they are (if indeed they hare changing vm.min_free_kbytes with sysctl) | 13:45 |
caribou | apw: so when the kexec kernel boots post-panic, the kernel should boot w/o implementing what's in sysctl.conf[.d] if possible | 13:46 |
caribou | apw: the kexec kernel is only there to allow for the capture of the kernel dump | 13:47 |
caribou | apw: so there is no reason to apply any user-define customization there | 13:47 |
caribou | am I being clear ? | 13:47 |
rtg | caribou: that makes sense to me | 13:48 |
caribou | apw: rtg: it should be trivial to add a check in /etc/init.d/procps to verify if we have booted in the kexec kernel & not apply the changes | 13:49 |
rtg | caribouhow would you preserve the original sysctl values ? | 13:49 |
rtg | nm, the new kernel would have them as part of its image. | 13:50 |
caribou | apw: rtg: /proc/vmcore is only present if we are booted on the kexec kernel post-panic. This is how kdump-config knows that it has to capture the content | 13:51 |
rtg | caribou: what agent applies sysctl values to the kexec kernel ? kdump ? the initrd ? | 13:51 |
tseliot | rtg: I've just fixed fglrx in xenial. It was pretty easy this time | 13:52 |
caribou | rtg: /usr/sbin/procps apparently | 13:52 |
caribou | rtg: no, /etc/init.d/procps | 13:52 |
rtg | tseliot, thanks, it was the primary show stopper for a 4.4 kernel in xenial | 13:52 |
tseliot | good :) | 13:52 |
ricotz | rtg, tseliot, nice :) | 13:55 |
apw | caribou, yeah, that sounds like a sensible option, i might suggest it ought ti possibly have a kdump.conf.d or osmething which is applies instead, just in case there is something you need | 14:00 |
caribou | apw: make sense; the override would only apply kdump.conf if it exists | 14:01 |
apw | caribou, then in the unlikley event there is something you need in both you link the ifle into both .d's and be happy | 14:02 |
caribou | apw: though, it does nothing for this bug as I doubt that touching something as central as procps would qualify for SRU | 14:03 |
apw | caribou, there is nothing to say you can't change things like that, we might need to make it opt in or something | 14:03 |
apw | caribou, so like if there is a kdump.conf we use that _rather_ than the normal ones, and they can make it empty | 14:03 |
apw | sysctl-kdump.conf | 14:04 |
apw | so that the default doesn't change perhaps, something like that | 14:04 |
apw | you are our dump expert, and if you say its wrong that kinda makes it wrong and wrong things can be fixed in sru's | 14:05 |
caribou | apw: well the argument that it only applies in the context of the kexec post-panic kernel an never in a normal user context migth be sufficient | 14:05 |
caribou | apw: am I ???? :-D | 14:05 |
apw | caribou, clearly :) | 14:05 |
caribou | ok, let me run with that & see what I can propose; I'll also comment in the bug | 14:06 |
caribou | apw: ^ | 14:06 |
rtg | caribou: would you like me to take a stab at the patch and run it by you ? | 14:06 |
apw | rtg, or you could sponsor it for him | 14:07 |
rtg | apw, that to | 14:07 |
rtg | too* | 14:07 |
apw | sweet sorted | 14:07 |
caribou | rtg: let me try it, then I'll run it by you guys | 14:07 |
rtg | ack | 14:07 |
caribou | apw: rtg: is it possible to force vm.min_free_kbytes as a boot parameter (for a temporary workaround) ? | 14:14 |
apw | caribou, it is possible yes | 14:18 |
apw | i can't say for cirtain without looking though | 14:18 |
* rtg does not see anything promising in Documentation/kernel-parameters.txt | 14:20 | |
caribou | apw: don't bother, I'll look it up; thought you might know it off the top of your head | 14:20 |
apw | ak | 14:21 |
apw | ack | 14:21 |
caribou | apw: fedora fixed it a while ago : https://lists.fedoraproject.org/pipermail/kexec/2014-November/001478.html | 14:25 |
caribou | apw: not too found of how they fixed it though | 14:25 |
apw | caribou, no, but then there is precident, perhaps we can do that in SRUs and something cleverer in the new releases | 14:27 |
caribou | apw: indeed | 14:28 |
apw | cirtainly i'd not want to hard code it like that, ugg | 14:28 |
apw | and that only fixes it if it is in /etc/sysctl.conf at that, not sysctl.d ... hrm | 14:29 |
rtg | eeew! | 14:29 |
caribou | tbh, I'm not sure if the fix went in but it was reported | 14:30 |
caribou | rtg: apw: here is a question for you : does the kernel rely on sysctl settings to boot correctly (i.e. do we use that mechanism to fix some behaviors) ? | 14:32 |
apw | caribou, well as they are applied in root, they cannot be used to fix root disk related behavious | 14:33 |
rtg | caribou: there might be a lot more console noise | 14:33 |
caribou | rtg: I also see some hardening things in there we might want to keep | 14:34 |
rtg | caribou: doesn't kexec immediately reboot ? | 14:34 |
rtg | after acquiring the dump | 14:35 |
caribou | rtg: yes | 14:35 |
rtg | in which case hardening might not be so important | 14:35 |
apw | caribou, and we don't bring up things like networking do we | 14:39 |
caribou | apw: yes, especially if we want to do remote dumps | 14:39 |
caribou | apw: but it is not even dependent on remote dumps : network comes up systematically | 14:40 |
caribou | made my life soo much easier when I developped remote dumps | 14:40 |
apw | caribou, then there are options there which may be necessary, hrm | 14:42 |
apw | caribou, how about then having an sysctl-kdump.conf which is loaded _after_ all the others, which can set things "back" | 14:43 |
caribou | apw: interesting; the only thing is to know _what_ to set "back" | 14:44 |
apw | caribou, well we literally only know of one thing which should not be set, so it could start with that | 14:44 |
apw | (if we know what a sane value is) | 14:45 |
caribou | apw: in worst case scenario, it would be the sysadmin responsability to add to this file in case of problem; at least the mechanism is there | 14:46 |
apw | caribou, yeah, something like that | 14:46 |
caribou | apw: & we override the values for things we know would cause problems | 14:46 |
apw | the only issue really is knowing what to put in that one | 14:46 |
rtg | perhaps just vm.mmap_min_addr for now (which we _know_ is causing problems) | 14:48 |
caribou | rtg: that's my idea for a starte | 14:48 |
caribou | starter | 14:48 |
caribou | rtg: apw: as I don't think we have an easy way to identify what was the original value in the kernel's namelist before sysctl changed it | 14:49 |
apw | caribou, yeah, and its magic essentially | 14:49 |
apw | asses | 14:49 |
caribou | apw: ok let start with that | 14:50 |
rtg | if it is never set after kexec boots, then it is by default set to the kernel value, right ? | 14:50 |
rtg | which ought to be OK | 14:50 |
caribou | rtg: the kexec kernel will apply the sysctl changes just like the regular kernel | 14:51 |
caribou | rtg: just trying to find a way to avoid disabling sysctl alltogether | 14:51 |
rtg | caribou: except that you are going to modify procps to do something different (I thought) | 14:51 |
apw | caribou, of course we know how much memeory we have, it is 128M so a fixed value for that is "ok" | 14:51 |
caribou | apw: let me run a first pass at it & see what I can come up with | 14:54 |
apw | rtg, the point here is that sysctls can be requried for networking and the like | 14:54 |
apw | rtg, and knowing whihc are ok and which not is hard | 14:54 |
rtg | understood | 14:54 |
apw | caribou, i assume what happens now is we copy these into the initrd, apply them all there, and then again once we mount root (for non kdump boots) | 14:55 |
apw | caribou, which is why eliding them by name works for the fedora case | 14:55 |
caribou | apw: no, apparently the sysctl stuff is taken from thre real root & not the initrd | 14:56 |
apw | caribou, well that code is copying them into the initrd and they elide the problem one at that point | 14:57 |
apw | anyhow have a go and see what happens | 14:57 |
caribou | apw: sure | 14:58 |
dsmythies | apw: I do not actually have the sound card (Creative Labs SB Audigy) and came here on behalf of someone else, after helping them isolate the issue (at least to the point where I needed help here). Yes, his AlsaMxer program isn't working,as the card isn't being detected. I also wonder if some of those other sound card differences in the kernel configuration file are for the same reason (ref: http://paste.ubuntu.com/14481736/ ) | 15:00 |
apw | dsmythies, oddness, the delta i was shown was just the one card | 15:01 |
apw | rtg, ^ ? | 15:01 |
apw | rtg, did we investigate whether we could remove that config snippet for CONFIG_ZONE_DEVICE ? | 15:01 |
tjaalton | 4.4 doesn't see my nvme drive.. boot fails because of that | 15:03 |
tjaalton | mainlines at least | 15:08 |
tjaalton | are the soon-to-hit xenial packages somewhere? | 15:08 |
apw | tjaalton, ckt unstable ppa | 15:09 |
rtg | apw, I have not. | 15:09 |
tjaalton | apw: thx | 15:09 |
apw | rtg, i think we need to resolve what we are donig about that one before we release 4.4 as well | 15:11 |
rtg | apw, maybe have a diff setting for lowlatency ? | 15:11 |
apw | rtg, well whether we can in fact turn on both for the case we care about, amd64 | 15:13 |
dsmythies | apw, rtg: Thiss issue is on both lowlatency and generic, but yes amd64. | 15:13 |
apw | rtg, obbvoiusly the config constraints don't let us, but i think on amd64 we have no HIGHMEM so we have one more spare bit, and if we do then we can have zone_dma and zone_device together | 15:14 |
apw | that means i386 can't have both, but we're not caring about those | 15:14 |
rtg | apw, yeah, I'm not too concerned about NVDIMM on 32 bit | 15:15 |
dsmythies | tjaalton: Have a look at this: http://askubuntu.com/questions/709794/dell-xps-13-9350-compatibility/719947#719947 I do not know if it is true or not that CONFIG_BLK_DEV_NVME=y is needed. | 15:16 |
manjo | I have been modifying debian/rules.d/2-binary-arch.mk to skip modules signing if the private key is not available .. is there a way to tell builds in LP not to fail of if private key is not available? or what is the correct way of building it ? | 15:18 |
manjo | apw, ^ ? | 15:20 |
apw | manjo, we generate the private key during the build ? | 15:22 |
tjaalton | dsmythies: sounds like initramfs fail to me | 15:23 |
manjo | install-%: MODSECKEY=$(builddir)/build-$*/certs/signing_key.pem | 15:23 |
manjo | apw, ^ looks like it expects the key to be there | 15:23 |
manjo | apw, May be I am missing something .. I have been doing if [[ -f "$(MODSECKEY)" ]] ; then \ round the signing code ... which I don't think it the right thing to do | 15:24 |
apw | manjo, right, it expects it to be there because it makes it | 15:25 |
apw | manjo, so it should be making it, why is yours not | 15:25 |
manjo | apw, ppisati ran into the same issue with arm builds (many months ago) | 15:25 |
apw | maybe so, if so he may remmber what he was doing wrong | 15:25 |
apw | manjo, but i say again, the build makes that key | 15:25 |
manjo | apw, ok let me build in PPA and get it to fail... that will give me and idea what is going on .. | 15:26 |
apw | manjo, see certs/Makefile | 15:27 |
apw | @echo "### Now generating an X.509 key pair to be used for signing modules." | 15:27 |
apw | it is that key which we then use to _resign_ the modules after stripping them | 15:27 |
manjo | apw, I have been blindly carrying this patch is my PPA builds and May be it is not needed anymore ... let me build without it and see if the keys are generated | 15:28 |
apw | manjo, we have always built our kernles in PPAs and never needed such a thing | 15:29 |
rtg | manjo, there is this patch for newer kernels: 'UBUNTU: [Debian] Update to new signing key type and location' | 15:29 |
manjo | rtg, ok so that must have fixed it for me .. I will drop this mod to skip signing | 15:30 |
tjaalton | dsmythies: 4.4.0-0 finds the drive, but cryptsetup fails with "lvm is not available" | 15:30 |
caribou | apw: rtg: back to the vm.min_free_kbytes | 15:31 |
rtg | hmm | 15:31 |
apw | tjaalton, you'd think we would have noticed that lvm wasn't available in initrd ? | 15:32 |
apw | caribou, ? | 15:32 |
caribou | apw: rtg: how about I add an option to kdump-config that would re-apply values as found in /etc/default/kdump-tools ? | 15:32 |
caribou | apw: rtg: this option would be called by /etc/init.d/procps if /proc/vmcore is found | 15:33 |
apw | /etc/default/kdump-tools is in shell format right? waht would you proposed to encode in tehre ? | 15:33 |
rtg | caribou: what would be the packaged defaults in /etc/default/kdump-tools ? Anything ? | 15:34 |
caribou | apw: yes, I would add SYSCTL=vm.min_free_kbytes=blah and loop through each SYSCTL varible found | 15:35 |
caribou | rtg: we can put anything we know can cause problems & sysadmin can add to it' | 15:35 |
rtg | seems reasonable | 15:35 |
apw | caribou, or perhaps SYSCTL_UPDATES=/etc/sysctl-kdump.conf or something and use that file | 15:35 |
caribou | this way I restrict modifications outside of the kdump-tools realm to a minimum | 15:35 |
apw | caribou, but i like the idea of passing the problem to kdump-config in procps | 15:36 |
caribou | apw: the advantage of looping through the existing /etc/default/kdump-tools file is that I don't need an extra file | 15:36 |
apw | i think making that file overly complex is a mistake because things like systemd read them for you and convert them to environment variables and the like | 15:37 |
apw | caribou, well mangling existing defaults which are there are difficult especially with that new non-standard format | 15:38 |
apw | by making that a differnt sort of entry you make it hard to migrate configs later etc | 15:38 |
apw | caribou, also why do we even need to change procps if you do it via kdump-tools | 15:38 |
caribou | apw: true; especially since kdump-config sources the default file at the begininng. I'll use your option | 15:39 |
caribou | apw: because, like in that specific bug, we don't get as far as running kdump-config, it fails before that | 15:39 |
caribou | I can reproduce it btw | 15:39 |
caribou | apw: that was my initial idea, but it has to happen right when procps runs | 15:39 |
apw | caribou, but can we just get in before it, by being earlier | 15:40 |
apw | kdump-config is running in initramfs or in real root when it runs | 15:40 |
apw | ? | 15:40 |
caribou | apw: real root, so is procps | 15:40 |
caribou | apw: so the procps modification is only to run sysctl -p /etc/kdump-config.conf if /proc/vmcore exists | 15:41 |
apw | caribou, i presume the issue then is that the act of setting it to a large value ooms the box immediadly | 15:44 |
apw | caribou, so you can't then set it back, you already ooms | 15:44 |
caribou | apw: exactly | 15:44 |
dsmythies | apw, rtg: For my own curiousity, could you point me to "the config snippet for CONFIG_ZONE_DEVICE". So far, I haven't been able to find it. | 15:45 |
rtg | mm/Kconfig:config ZONE_DEVICE | 15:45 |
apw | config ZONE_DEVICE | 15:45 |
apw | bool "Device memory (pmem, etc...) hotplug support" if EXPERT | 15:45 |
apw | default !ZONE_DMA | 15:45 |
apw | depends on !ZONE_DMA | 15:45 |
apw | we might also need to fix the ZONE_SHIFT calculation, but, i can't see why this won't work | 15:46 |
tjaalton | apw: I'll check the diff between 4.3 & 4.4 initrd's and see what's going on there | 15:46 |
apw | tjaalton, lack of -extras perhaps ? | 15:47 |
apw | in your testing ? | 15:47 |
tjaalton | well this same thing is with mainline builds | 15:47 |
apw | caribou, so are you proposing to only apply that one file if you find it, and let kdump-ocnfig generate it or ? | 15:47 |
tjaalton | -extra is installed | 15:47 |
apw | caribou, as we cannot apply the real changes and then apply this one back over the top, as that will likely oom us in the middle | 15:47 |
caribou | apw: looks like it (I'm testing it at the same time) | 15:48 |
caribou | apw: I was thinking of applying the content of /etc/kdump-config.conf over the existing | 15:48 |
apw | i think you'll find the first will blammo on you | 15:49 |
caribou | apw: looks like the simple fact of applying vm.min_free_kbytes triggers the OOM | 15:49 |
apw | which makse sense when you look at the kernel code, it forfully empties memeory to make the boundary | 15:49 |
caribou | apw: then no matter what, we need to avoid this change to even happen | 15:50 |
apw | ugg | 15:51 |
apw | caribou, does the kernel know it is in kdump mode ? | 15:52 |
caribou | apw: yes, as it exposes /proc/vmcore | 15:52 |
caribou | apw: but isn't changing the kernel because one silly setup triggers OOM a bit of an overkill | 15:53 |
caribou | ? | 15:53 |
rtg | apw, fix this in the kernel by ignoring vm.min_free_kbyte if kexec'ed ? | 15:53 |
apw | rtg, i am thinking about it, yes | 15:54 |
apw | if /proc/vmcore is valid in fact, but yes | 15:54 |
rtg | hmm | 15:54 |
apw | as kexecing is a valid way to reboot too | 15:54 |
apw | and in the normal case we don't care, but if we are exposing a dump file to uspace then we coudl ignore uspace | 15:54 |
apw | caribou, rtg, though there is a separate way of looking at this | 15:55 |
caribou | userland is changing a kernel setup in a limited memory context, I would rather avoid changing this value if we very well know that we're in a limited contgext | 15:55 |
apw | that the request for memory can _never_ succeed, the request for reserve is large than total ram | 15:55 |
caribou | limited memory context | 15:55 |
apw | returning -ENOWAY instead vbecaus the request is rediculous might be sane | 15:55 |
caribou | apw: true | 15:56 |
rtg | instead of OOM'ing | 15:56 |
apw | right ooming is rarely useful | 15:56 |
rtg | apw, doing it that way means we don't have to care if /proc/vmcore is valid | 15:57 |
apw | rtg, so the code as it stands divides the limit you specify between all the non-highmem zones you have | 16:00 |
apw | so if you say 1000 pages and you have 10 in DMA and 90 in NORMAL | 16:00 |
apw | it will demand 100 in DMA and 900 in NORMAL even though that is patently nuts | 16:00 |
rtg | apw, seems like refusing the sysctl is a reasonable action | 16:02 |
apw | rtg, though even if i make it say you need enough memory in your zones to cope there are likely to be combinations which are just numerically valid and not viable | 16:03 |
apw | rtg, and telling those appart is tricky at best | 16:03 |
apw | caribou, eliding specific entries in userspace falls to procps, to sysctl itself detecting /proc/vmcore and ignoreing some names textually (me thinks) which is a little vile, but no vialer than doing so in the kernel | 16:08 |
tjaalton | apw: weird, 4.2.0-16 doesn't have sbin/lvm either, but it still works | 16:09 |
caribou | apw: ok, I'm looking into that | 16:09 |
caribou | apw: and at the same time, asking for the kernel to keep a minimum of 312M free and setting crashkernel=128M makes no sense, they should just increase crashkernel | 16:10 |
apw | caribou, well they claim to be using 768M for crashkernel i think, but either way its bonkers | 16:11 |
tjaalton | apw: oh, nvme.ko is not shipped with 4.4 initrd, because the module location changed | 16:13 |
apw | tjaalton, i thought initramfs-tools looked at those by module name only ? | 16:14 |
tjaalton | dunno, but it should be in kernel/drivers/nvme/host/nvme.ko now | 16:15 |
apw | oh perhaps it is shipping all "block/*" or something | 16:15 |
tjaalton | instead of kernel/drivers/block/nvme.ko | 16:15 |
apw | tjaalton, can you file me a bug against initramfs-tools for that please | 16:15 |
rtg | tjaalton, that means I should fix debian.master/control.d/generic.inclusion-list:drivers/block/nvme.ko | 16:16 |
apw | and give me the # and i'll batch that up with the other two critical fixes | 16:16 |
apw | rtg, needs fixing there too yes | 16:16 |
rtg | apw, I'll take care of it | 16:16 |
tjaalton | rtg: you'll file the bug too? | 16:16 |
rtg | tjaalton, nope, its dev kernel. I'll just fix it | 16:17 |
apw | i think he is saying he will fix the kernel bit, but i would love a bug for initramgs-tools for that bit | 16:17 |
tjaalton | right but initramfs-tools needs fixing too | 16:17 |
tjaalton | yeah I'll file that bit | 16:17 |
apw | copy_modules_dir kernel/drivers/block | 16:18 |
apw | yeah it copies on all block devices, and that just moved out of there | 16:18 |
tjaalton | fixed in debian https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=807000 | 16:18 |
ubot5 | Debian bug 807000 in initramfs-tools "initramfs-tools: module nvme not included in block modules on kernel > 4.2" [Normal,Fixed] | 16:18 |
apw | tjaalton, not sure that that fix makes a whole heap of sense in the face of the change the kernel is making, long term, but hey | 16:21 |
tjaalton | apw: what do you mean? cryptsetup works again with that, just tested | 16:25 |
tjaalton | apw: heh, two bugs already filed about this | 16:28 |
tjaalton | https://bugs.launchpad.net/ubuntu/+source/initramfs-tools/+bug/1524879 https://bugs.launchpad.net/ubuntu/+source/initramfs-tools/+bug/1532146 | 16:29 |
ubot5 | Launchpad bug 1524879 in initramfs-tools (Ubuntu) "initramfs-tools, Xenial is missing NVME kernel driver" [High,New] | 16:29 |
ubot5 | Launchpad bug 1532146 in initramfs-tools (Ubuntu) "update-initramfs fails for MODULES=dep when root is on nvme device" [Medium,In progress] | 16:29 |
tjaalton | the first already properly assigned :P | 16:30 |
* xnox thought apw was going to sponsor that cherrypick | 16:32 | |
xnox | as asked on irc some friday back. | 16:32 |
rtg | ppisati, I forgot to mention that Ubuntu-4.4.0-1000.6 works well with your image | 16:34 |
apw | xnox, then i am already doing it indeed :) | 16:34 |
ppisati | rtg: good to know | 16:35 |
dsmythies | apw, rtg: I am not understanding the config dependency trail, or what to change in order to be able to compile a test kernel. Do I just delete that ZONE_DEVICE snippet? | 16:35 |
rtg | ppisati, hopefully it will work with Snappy as well | 16:35 |
apw | dsmythies, no, we want to test building with the config requirement for !ZONE_DMA removed and that turned back on | 16:36 |
ppisati | rtg: it will | 16:36 |
apw | and see if tht builds on amd64 | 16:36 |
ppisati | rtg: though i might have found a problem with xenial/arm64 | 16:37 |
ppisati | rtg: so, no matter if i cross or native compile | 16:37 |
rtg | ppisati, the module load problem ? | 16:37 |
ppisati | rtg: right | 16:37 |
ppisati | rtg: if i compile under wily is fine | 16:37 |
rtg | ppisati, I noticed that it seems to be a toolchain issue | 16:37 |
ppisati | rtg: while it blows in a xenial chroot | 16:37 |
rtg | dannf found that one, right ? | 16:38 |
ppisati | rtg: it happens if i crosscompile (xenial/amd64 -> arm64) or native (my arm64 ppa) | 16:38 |
ppisati | rtg: i found it, and dannf confimed he sees that too in xenial 4.3 | 16:38 |
ppisati | dannf: ^ | 16:38 |
rtg | ppisati, what is Linaro using ? | 16:38 |
dsmythies | apw: O.K. so delete this line: "depends on !ZONE_DMA" then. (sorry for being a bit think on thick on this one). | 16:39 |
ppisati | rtg: no idea | 16:39 |
ppisati | rtg: you mean the toolchain | 16:39 |
rtg | yeah, maybe Salvetti has some thoughts | 16:39 |
dannf | ppisati, rtg: my next step was to rebuild gcc w/o linaro patches to see if that fixes it | 16:40 |
dannf | (otp, sorry for lag) | 16:40 |
rtg | dannf, are there patches in gcc specific to that arm64 errata ? | 16:41 |
dsmythies | I meant to say "sorry for being a bit thick on this one". | 16:44 |
apw | dsmythies, yes, then readd CONFIG_ZONE_DMA=y to the config and run fakeroot debian/rules updateconfigs | 16:44 |
apw | and then see if the build builds any more, i suspect it may not | 16:44 |
apw | but it may be resolvable | 16:44 |
apw | (on amd64) | 16:44 |
dannf | rtg: yeah | 16:46 |
dsmythies | apw: I will report back, but it might be awhile (like hours). | 16:47 |
jmux | How can I build a kernel containing a ddeb package? I just have a linux-lts-trusty-tools ddeb as an output of my build. | 17:31 |
apw | jmux, how are you building them ? | 17:32 |
jmux | My problem is actually http://ddebs.ubuntu.com/pool/main/l/linux-lts-trusty/, which doesn't contain debug symbols for the range from 15-Jul-2014 to 10-Nov-2015 | 17:32 |
arges | jmux: depending on the version it may be archived in launchpad | 17:33 |
jmux | apw: sbuild -A -s -d tramp --arch=i386 --add-depends=pkg-create-dbgsym linux-lts-trusty_3.13.0-34.60~precise1.dsc | 17:34 |
apw | right, we moved ddebs into the librarian around that time, though i would expect them to be in there | 17:34 |
apw | jmux, if you arn't on a real builder it will optimise away the .ddebs because they take an hour to package | 17:35 |
jmux | apw: I get the ddeb for the linux-lts-trusty-tools. I even set AUTOBUILD=1 | 17:36 |
arges | apw: they are gone: https://launchpad.net/ubuntu/+source/linux-lts-trusty/3.13.0-34.60~precise1 | 17:36 |
apw | you need something like full_build=true passed to debian/rules | 17:36 |
arges | this is how i've done it: http://chrisarges.net/2015/10/02/building-ubuntu-kernels-with-debug-symbols.html | 17:37 |
apw | arges, full_build=true is the approved incantation there, though the effect is similar | 17:38 |
arges | apw:noted | 17:38 |
jmux | Hmm missed full_build in debian/rules.d/0-common-vars.mk | 17:41 |
apw | jmux, it isn't something that one expects (like the spanish inquisition) | 17:45 |
jmux | Probably an other problem "man xz: Multithreaded compression and decompression are not implemented yet, so this option has no effect for now." | 17:45 |
jmux | So even if the build finishes in 10 minutes, the compression will take ages, I guess | 17:46 |
apw | it takes ages for sure, like 10 on our massivly fast boxes, hours on others | 17:46 |
apw | they make one cry in the main, whihc is why they get turned of by default for local builds | 17:47 |
apw | did you look and see if the one you needed was in the librarian though ? | 17:47 |
jmux | Is there a way to query librarian directly? The ddebs are in the build log… | 17:48 |
apw | i'd say from the .changes from the buildd that the .ddebs were collected under the old system | 17:49 |
apw | arges, does sts still keep history on those? | 17:49 |
arges | apw: i don't think so | 17:49 |
arges | i didn't find in the librarian so best bet it to regenerate them. | 17:49 |
* jmux is not sure the debug symbols will actually help with MCE like http://paste.debian.net/365584/ | 17:53 | |
apw | jmux, nope, did you try what it suggests ? | 17:54 |
jmux | Yup - but didn't help me | 17:54 |
jmux | http://paste.debian.net/365586/ | 17:55 |
apw | jmux, what are you hoping it will tell you? that says that when we were handling a page fault the internal state of the CPU was found to be wrong and unfixable | 17:56 |
jmux | But these machines were working flawlessly on lucid for years. No idea, why they stop working on precise with trusty stack | 17:56 |
apw | well they may be throwing errors all the time which the old kernel doesn't understand at all (which being lucid it would not) | 17:57 |
apw | those newer kernels almost cirtainly have the ability to read MCEs at all whereas the older ones did not i suspect | 17:59 |
jmux | apw: might be. I've just started collecting information. Not sure if a backtrace contains info about the origin of the page fault. | 18:00 |
apw | jmux, if that EIP is accurate (and it implies it is not but hey) that is the very first byte of the exception handler | 18:00 |
apw | jmux, and is this a 32 bit kernel ? | 18:01 |
jmux | it might be a really unnoticed HW bug - no idea yet | 18:01 |
jmux | Yup 32bit | 18:02 |
apw | if the things is occuring often and reproducibly you could try rmeoving the MCE handler (if it is a module) and see | 18:04 |
apw | if the machine is symptomless afterwards | 18:04 |
apw | as it was on lucid, it may be just be that they are bogus, given it never exploded before | 18:04 |
jmux | Problem is I really don't know what triggers it. I got the exact HW of a known problematic box and it took 12 hours to trigger. | 18:06 |
apw | jmux, it is possible nomce on the kernel command line will shut it off for that purpose | 18:06 |
apw | of course if it is real, all bets are off in that case | 18:06 |
jmux | This happens for random people | 18:06 |
jmux | About once in a week | 18:07 |
jmux | Support has actually played with Intel C states in BIOS which fixed it for some people | 18:18 |
apw | oh lovely | 18:18 |
jmux | That's why I still hope it can be a kernel bug, I'll also update the kernel. But for that random bugs without a real trigger… | 18:20 |
jmux | I had the crazy idea of the new kernel using more power, probably because of new features… instable system… all I know is Lucid worked for years and problem started with the precise updates | 18:22 |
* jmux hopes that shipping truisty with xenial stack will help, but that's half a year away | 18:23 | |
apw | jmux, they likley would have simply not been detected before in lucid | 18:36 |
apw | even if they were happening | 18:37 |
jmux | And something like "MCA: Internal unclassified error: 402" doesn't help | 18:38 |
jmux | I'll also try latest microcode updates | 18:39 |
jmux | If I get the same error tomorrow, I have probably at least one vaid data point. And there are BIOS updates. | 18:40 |
* jmux shivers thinking about BIOS updating a few thousand machines… | 18:41 | |
apw | jmux, a job for life ... literally | 18:43 |
jmux | Oh - and I just got the MCE error via serial null modem connection | 18:43 |
apw | jmux, a google of the specific error shows people sending their CPUs back for replacement to good effect o.O | 18:46 |
apw | (which if it was true would make upgrading the BIOSen look like a fun game) | 18:47 |
jmux | apw: hmm If all machine would show this MCE, but currently they just have a blank screen. Without the null modem, I wouldn't have seen this. | 18:50 |
apw | jmux, yeah the "kms will guanretee you can see panics" promise, not so much | 18:51 |
jmux | Probably disableing DPMS for known machines would help here | 18:51 |
jmux | Probably installing mcelog will also help | 18:56 |
apw | good luck | 18:56 |
jmux | Everything linux wise I can deploy automatically. mcelog definitly makes sense, as this would probably allow to catch the MCE errors | 18:58 |
jmux | Hmm https://access.redhat.com/solutions/1161573 has the same HW. Guess I'll have to contact some server guys tomorrow | 19:01 |
jmux | "Following MCE logs were collected from serial consle." | 19:01 |
jmux | It's not page_fault but intel_idle, but otherwise… | 19:03 |
apw | jmux, i note it talks about using "nomodeset" which if you find a box which reproduces regularly is worth a shot | 19:03 |
jmux | The "reproduces regularly" is the problem. So actually a kernel update might help, or the solution will just state "get a new CPU" | 19:06 |
jmux | I hope my local crash will happen again in the next 12 hours | 19:06 |
* jmux hopes for a software solution | 19:07 | |
jmux | apw: thanks for your support. /me will leave for the cinema in a few minutes | 19:08 |
dsmythies | apw: The build failed, and for the reason you predicted, ZONES_SHIFT, which cascades into a bunch of other errors. | 22:08 |
dsmythies | apw: #error ZONES_SHIFT -- too many zones configured adjust calculation | 22:09 |
apw | dsmythies, yep thanks, will have a look at it perhaps in the morning -- if the other plates stay up | 22:18 |
=== JanC_ is now known as JanC |
Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!