[01:22] <rtg> sbeattie, re bug #1551894 - I'm building a test kernel with 3dfb7d8cdbc7ea0c2970450e60818bb3eefbad69 applied. I'll post in the bug when its done.
[11:50] <Odd_Bloke> apw: Don't know if you're around (and have a minute), but if you could check my working in https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1551419/comments/8 to confirm that my reading of things is correct, it'd be much appreciated.
[11:51] <Odd_Bloke> apw: (This is super-hot, because it breaks all Azure trusty instances on post-kernel-upgrade reboot)
[12:14] <apw> Odd_Bloke, is that the one where cloud-init loses its mind and thinks scorched-earth on a clearly installed instance is a good idea 
[12:16] <apw> Odd_Bloke, you sound convincing .... as this is stables, i recon we want bjf to chime in
[12:16] <apw> Odd_Bloke, i assume we are going to also fix cloud-init to not juju the machine on reboot anyhow
[12:20] <Odd_Bloke> apw: Yeah, I've asked bjf to chime in already; just wanted to make sure that I hadn't got the wrong end of the stick as soon as possible. :p
[12:20] <Odd_Bloke> apw: The problem with fixing it in cloud-init is that we support using a snapshot of an instance as an image.
[12:21] <Odd_Bloke> apw: Which means that distinguishing between 'first boot of a snapshotted image' and 'second boot of any image' is difficult.
[12:21] <apw> but applying scortched earth should be something that snapshot has to tell you surely, else there is a high risk
[12:21] <Odd_Bloke> apw: So each data source provides a way of uniquely identifying instances.
[12:21] <apw> of exactly the kind of thing that occured here, an uninteded consequnce of another bug is "world ending"
[12:21] <apw> which is _never_ acceptable, think of ps4.5
[12:22] <apw> that'd be like newfs'ing / because i think it is blank
[12:22] <apw> without being told it is really expected to be blank
[12:23] <apw> Odd_Bloke, what i am saying is those snapshots should be marked, and it should ahve said "this is not a snapshot, but appears to have a different machine ID, this is utterly wrong, refusing to boot" not "yeah lets eat your machine"
[12:25] <Odd_Bloke> Yeah, I've never been wild about supporting the snapshot-as-image-without-modification workflow.
[12:26] <Odd_Bloke> So we should probably re-visit this.
[12:26] <apw> i can see the convienience, i can also see the potential for disaster for all instances it creates
[12:26] <apw> even if the original bug is a kernel one
[12:27] <Odd_Bloke> Yeah, agreed.
[12:29] <Odd_Bloke> We'll need to think about it, though, because I'm pretty sure changing this behaviour in trusty will break a lot of people; I expect that 'launch base image; perform customisation; snapshot' is the most common way that people produce derivative images, and they'd have to add a step in there.
[12:30] <apw> Odd_Bloke, which series are affected by this, is it primarily trusty ?
[12:30] <Odd_Bloke> apw: wily is fine, I haven't checked precise.
[12:30] <Odd_Bloke> I will do so now.
[12:31] <apw> great.  it being only one release makes life a heck of a lot less upsetting
[12:31] <apw> Odd_Bloke, the problem with fixing this is anyone running a newly created image with -8 or indeed anyone affected and rebooted and zapped
[12:32] <apw> Odd_Bloke, will  suffer the same problem again on update to the next version without the bug, right?  how are you going to mitigate that
[12:32] <Odd_Bloke> Will my tears fix it?
[12:32] <Odd_Bloke> :p
[12:32] <apw> s/-8/the broken version/
[12:33] <Odd_Bloke> So I was thinking about this yesterday.
[12:33] <apw> i think we need a paired fix for cloud-init which knows how to reconstruct the machine id for the broken kernel version which is only applied on the broken kernle version
[12:33] <Odd_Bloke> Yeah, that.
[12:33] <Odd_Bloke> I _think_ it could happen in cloud-init's packaging, rather than in cloud-init itself.
[12:34] <rtg> apw, is there a way to tell that the kernel is broken other then version number ? I was thinking about folks that try mainline crack, etc.
[12:34] <apw> as we will need that anyway, and such a fix would immediatly mitigate the broken kernel version, and would let us confirm the fix and roll it out kernel side in a more leisurely manner
[12:35] <apw> rtg, we might be able to say its "upstream x.y.z-cktN" which is fail perhaps
[12:35] <apw> though i doubt there are a lot of mainline kernel crack runners in Vms
[12:35] <Odd_Bloke> And we could make it work both ways: current_endian_version = $(cat /sys/...); reverse_endian_version = ...; if [ -d /var/lib/cloud/instances/$reverse_endian_version ]; then mv <that> <fixed path; fi
[12:36] <apw> Odd_Bloke, as if we rush out the kernel change and its wrong again, we just have two wrogo's to mitigate.
[12:36] <Odd_Bloke> Yeah.
[12:36] <apw> Odd_Bloke, i also think we should be getting that out "now" regardless
[12:36] <apw> if we are blowing people up
[12:36] <apw> as we have to get that out before the fixed kernel can go out safely too
[12:37] <rtg> apw, "that" being the cloud_init fix ?
[12:37] <rtg> and how do we fix the kernel besides reverting the endian patch ?
[12:38] <apw> rtg, yes if we don't mitigate the broken kernel, then updating the kernle to fixed will re-blow up the instances and zap them a second time
[12:38] <apw> rtg, so we have to mitigate that by fixing cloud-init, and that has to occur sooner than the kernel
[12:39] <rtg> agreed
[12:41] <Odd_Bloke> We also have to handle this in case Azure decide to upgrade their reported SMBIOS version to >=2.6 (probably without changing the endian-ness of what they report).
[12:45] <apw> Odd_Bloke: so we shoulf add a cloud-init task to the bug and cordinate thay getting out before the kernel
[12:46] <Odd_Bloke> apw: Yep, task already added and I'm looking in to it.
[12:46] <apw> are you on the hook for that?
[12:46] <Odd_Bloke> apw: I think so.
[12:46] <apw> ok good, that makes it make sense to try ans get this fix into the next upload but not expedite it out
[12:46] <apw> bjf
[12:47] <apw> bjf: fyi discussion above
[12:49] <apw> Odd_Bloke: did you say we might be able to detect the mallformed version? rather than thinking in terms of kernel versions? in your code fragment above. that would be safer for aure
[12:50] <apw> ahh yes you are saying that, good
[12:51] <Odd_Bloke> apw: Yeah, so my plan is to assume that if cloud-init currently has an instance id of "11223344-5566-7788-DEAD-BEEFDEADBEEF" and it now thinks it should have "44332211-6655-8877-DEAD-BEEFDEADBEEF" then we assume it's the same system.
[12:52] <apw> doesmthat account for the shift as well?
[12:52] <apw> but i think the plan is sound
[12:53] <Odd_Bloke> apw: "the shift"?
[12:54] <Odd_Bloke> (There is a remote possiblity that this could produce false positives, but (I think) only on a snapshotted image which was then launched with a UUID that appeared exactly oppositely-endian in its first three fields)
[12:55] <Odd_Bloke> But that's unlikely to happen before the heat death of the universe. :p
[12:57] <ruchlos> never underestimate the mind of a computer that wants to punish humans to trigger such unlikely scenarios ;)
[13:03] <Odd_Bloke> Ah, hmph, doing this in cloud-init packaging would mean it would only work once per cloud-init version.
[13:04] <Odd_Bloke> So I do need to do it in cloud-init proper.
[13:04] <Odd_Bloke> On the bright side, I won't have to work out to reverse endian-ness of parts of a UUID in shell. :p
[15:24] <apw> heh
[15:26] <apw> Odd_Bloke, you are able to test a kernel for us I assume to confirm a fix ...
[15:27] <Odd_Bloke> apw: Yep.
[16:27] <xnox> hello =)
[16:27] <xnox> apw, got a fresh bug for udebs
[16:27] <xnox> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1552314
[16:27] <xnox> apw, just what you wanted to hear right? =))))) enough info to fix it?
[16:28] <rtg> xnox, it'll do
[16:29] <xnox> rtg, tah.