smoser | Odd_Bloke: the case is just "I have to manually clean out state". Which I think is limited to /var/lib/cloud/instance link. | 12:50 |
---|---|---|
smoser | Odd_Bloke:i do see your point about reading /var/lib/cloud/cloud-config.txt as config and then basically persisting that across, but i dont think thats really a big deal. do you see a case where it is? | 12:53 |
Odd_Bloke | I could imagine a use case (perhaps in the broken cloud scenario, but perhaps also more generally) where users might want to be able to temporarily "unlock" an instance (so it's in "check" mode) when they are performing a capture, and then lock it back again (to "trust" mode) after they've captured the image. | 13:16 |
Odd_Bloke | And I think you could do this today manually, but you'd have to know when to move the state aside and when to move it back. | 13:18 |
Odd_Bloke | I guess there's also the case where you mistakenly passed manual_cache_clean when launching an instance; there's no good way to undo that once the instance has launched. | 13:20 |
Odd_Bloke | So, to be clear, I do think what `manual_cache_clean` does today is reasonable and, in some cases, desirable; if you pass it as user-data, you really do have no choice but to clean the cache (or fake a clean with mv) to switch from "trust" to "check", and that ensures that your cache state will never leak out into a captured image. | 13:24 |
Odd_Bloke | I do think there's a gap of sorts for a more easily reversible version of this. | 13:27 |
Odd_Bloke | By "in some cases, desirable", I mean that for many cases it is sufficient (but people might prefer a reversible option), and there are some cases where it is exactly what people want (because they don't want it to be easy to undo). | 13:28 |
Odd_Bloke | smoser: ^ | 13:29 |
smoser | Odd_Bloke:but why would you ever want to "undo that once the instance is launched" | 13:40 |
smoser | it really only affects *future* instances. | 13:40 |
smoser | i do not have a use case for flipping it on and off for *this* instance. | 13:41 |
smoser | Odd_Bloke: you can (i think) get what you were after by writing a ocnfig file somewhere. | 13:44 |
smoser | a.) launch instance with userdata setting of 'manual_cache_clean' | 13:44 |
smoser | b.) write file in /etc/cloud/cloud.cfg.d/manual_clean.cfg (manual_cache_clean: true) | 13:44 |
smoser | c. now you're in manual clean mode next time | 13:45 |
smoser | d.) to disable this setting, you now just have to 'rm /etc/cloud/cloud.cfg.d/manual_clean.cfg /var/lib/cloud/instance/manual-cache-clean' | 13:46 |
smoser | i think your use case essentially boils down to "i can't change the content of cached user-data during the lifetime of an instance". | 13:46 |
smoser | which is true for *all* settings there. | 13:46 |
smoser | manual-cache-clean is just made more annoying because of the marker file. | 13:47 |
smoser | that marker file exists only so that ds-identify can avoid parsing cloud config files itself. | 13:47 |
Odd_Bloke | smoser: Should that be "without" in (a)? | 14:09 |
smoser | yes | 14:13 |
* smoser curses his feeble mind | 14:13 | |
Odd_Bloke | OK, good, then I agree. | 14:13 |
smoser | and actually.. | 14:14 |
smoser | i think the "enable" for this instance can be simplified to just | 14:14 |
Odd_Bloke | I think the confusing part is that manual_cache_clean modifies "the lifetime of an instance" and I wasn't thinking about it in that way. | 14:14 |
smoser | touch /var/lib/cloud/instance/manual-cache-clean | 14:14 |
smoser | i woudln't want to advertise that interface (because its only ithere to help ds-identify) | 14:14 |
Odd_Bloke | (I'm not saying that it confusing me means anything needs changing, to be clear. :) | 14:15 |
smoser | Odd_Bloke:well... although it should not necessarily be true, *userdata* modifies "the lifetime of an instance" | 14:15 |
paride | hey falcojr3, the ubuntu-sru manual verification PRs LGTM. Can I go ahead and merge? | 14:16 |
paride | falcojr3, we should be able to tick many boxes in the manual verification cards then :) | 14:16 |
Odd_Bloke | smoser: I'm not sure I followed that point, could you expand on it? | 14:17 |
smoser | your complaint is that user-data modfiied the instance "permenantly" (you can't change the user-data ... /var/lib/cloud/instance/cloud-config.txt is an artifact of user-data, right?) | 14:18 |
smoser | on AWS at least, user-data *can* be changed within the lifetime of an instance. | 14:18 |
smoser | but in cloud-init such changes will not be recognized. | 14:19 |
Odd_Bloke | I wouldn't characterise it as a complaint: I just meant that I wasn't understanding what manual_cache_clean was doing (and why) because I wasn't thinking about it in those terms. | 14:20 |
smoser | ok. but my point is that manual_cache_clean (when set in user-data) has the same lifetime as *all* things set in user-data. | 14:20 |
smoser | i'm not certain that is true, but i think so. | 14:20 |
Odd_Bloke | I think it's true that manual_cache_clean (regardless of value) has the same lifetime as the other user-data that is specified alongside it. But it affects what that lifetime is: if it is false (i.e. "check"), then the lifetime ends once a new instance ID is detected, but if it's true then the lifetime is the same as the lifetime of the state directory (i.e. until a manual cache clean). | 14:27 |
Odd_Bloke | And I think that makes total sense, it's literally named "manual cache clean" (as you pointed out yesterday afternoon). | 14:29 |
smoser | +1. i think that is excactly the point. | 14:29 |
smoser | :) | 14:29 |
smoser | yeah. | 14:29 |
smoser | but you wrote that very well. | 14:29 |
Odd_Bloke | Thanks. :) | 14:29 |
Odd_Bloke | So I think where I got confused is how you would switch the lifetime _back_ to being scoped to instance ID. | 14:30 |
Odd_Bloke | And right now, I think you can't do that (non-hackily) if you've specified manual_cache_clean in user-data. | 14:30 |
Odd_Bloke | (But that's separate, and I was conflating the two.) | 14:30 |
smoser | yeah. | 14:31 |
smoser | and i think the reason for that is... that user-data cannot be changed period in a non-hacky way | 14:31 |
smoser | manual_cache_clean has the additional hack marker file. but other than that, it would be the same as other settings. | 14:32 |
smoser | possibly a bettter name would just make this all obvious | 14:32 |
Odd_Bloke | Yes and no: _if_ (and I'm not proposing we change to this, to be very clear) manual_cache_clean was processed only by writing out the manual-cache-clean file based on the configuration determined from a datasource (i.e. _not_ using the cached user-data at all), and _only_ the flag file were used to determine what mode we were in, then it would be true that you couldn't modify user-data, but you also | 14:34 |
Odd_Bloke | wouldn't need to. | 14:34 |
Odd_Bloke | Because if the flag were removed, then the old user-data would be disregarded entirely. | 14:35 |
Odd_Bloke | So I think we could implement something like this that wouldn't require hacking at user-data to modify, if we wanted to. | 14:36 |
Odd_Bloke | Do we want to implement such a thing? I don't think so ATM. | 14:37 |
smoser | +1. | 14:37 |
falcojr3 | paride: Yes, thank you | 14:41 |
paride | falcojr3, merging! | 14:41 |
smoser | Odd_Bloke:thank you for pushing/investigating this. | 14:50 |
Odd_Bloke | smoser: Sure thing! I'll ping you for doc review once I've figured out the best way of capturing the above. | 14:57 |
AnhVoMSFT | what does it typically mean when udevadm settle failed within cloud-init https://paste.ubuntu.com/p/rYfhBxyvbZ/ | 15:42 |
smoser | i'd say "typically" udevadm-settle doesn't fail | 15:45 |
smoser | my only 2 guesses for why: | 15:46 |
smoser | a.) timeout of some disk io | 15:46 |
smoser | b.) i forget | 15:47 |
AnhVoMSFT | So this error happened during ephemeral dhcp | 15:47 |
AnhVoMSFT | Found unstable nic names ['eth0']; calling udevadm settle | 15:47 |
AnhVoMSFT | util.py[DEBUG]: Running command ['udevadm', 'settle'] with allowed return codes [0] (shell=False, capture=True) | 15:48 |
AnhVoMSFT | util.py[DEBUG]: Waiting for udev events to settle took 120.147 seconds | 15:48 |
AnhVoMSFT | Then there's a Traceback on ProcessExecutionError. I think after 2 minutes udevadm timeout-ed and returned error 1 | 15:48 |
smoser | yeah, it clearly timed out. | 15:48 |
smoser | maybe dmesg has some info | 15:48 |
AnhVoMSFT | let me check | 15:54 |
smoser | i really suspect that there is very slow disks attached. or network attached disks and a bad network. | 16:04 |
AnhVoMSFT | looks like the vm got reboot sometimes after that and the dmesg log that was collected was after the reboot | 16:10 |
Odd_Bloke | AnhVoMSFT: I would expect `journalctl -k` to include (most? IDK exactly) of the dmesg logs from previous boots. | 16:11 |
AnhVoMSFT | it's from one of our automated nightly run - the VM is already gone. What can we collect in cloud-init log to make it easier to root cause the issue next time? | 16:12 |
Odd_Bloke | falcojr3: What can I pick up SRU-wise? It looks like I could do SoftLayer or OpenStack, based on the board, but I know you mentioned you were looking at OpenStack access so I don't want to start on that if you're already partially through it. :) | 16:13 |
Odd_Bloke | (Also "Falco Junior the Third"?? ;) | 16:13 |
smoser | AnhVoMSFT: running 'cloud-init collect-logs' | 16:16 |
smoser | as that would contain the journal for the current boot | 16:17 |
falcojr3 | Odd_Bloke yes, I'll pick up openstack. Pretty much anything else including any of the bugs in the bug card would be good | 16:32 |
falcojr3 | and yeah, rebooted the box my instance of the lounge is hosted on and now I'm magically falcojr3 | 16:33 |
blackboxsw | ok falcojr3 I'm back. and ready to start cloud-init SRU verification work. what would you like me to work? | 16:34 |
blackboxsw | smoser I finished a driveby xenial cloud-utils PR that you probably have a *lot* more context on (daily maas image builds broke yesterday), hence by absence from cloud-init stuff. | 16:34 |
blackboxsw | falcojr3: I'll grab SRU reviews first | 16:35 |
blackboxsw | and then get manual SRU verification tasks | 16:35 |
smoser | blackboxsw:link ? | 16:44 |
blackboxsw | smoser: https://code.launchpad.net/~chad.smith/ubuntu/+source/cloud-utils/+git/cloud-utils/+merge/390318 | 16:44 |
blackboxsw | backport of two of your separate overlayfs fixes into xenial | 16:44 |
blackboxsw | could have combined them, but thought maybe separate cherry-pick backports may be easier to read/review | 16:45 |
blackboxsw | cloud-init SRU-wise just grabbed the cloud-init query decode user-data test | 16:46 |
* blackboxsw wonders if we should make our SRU trello process board public, so external folks could get visibility to the verification process (and contribute manual SRU tests for some of the one off bugs) | 16:48 | |
Odd_Bloke | falcojr3: Aha, right; now we have checklist assignment we aren't creating separate cards for all of those, right? | 16:48 |
blackboxsw | right I believe Odd_Bloke we assign our avatar to each checklist item | 16:49 |
blackboxsw | and check it off once done | 16:49 |
Odd_Bloke | blackboxsw: falcojr3: Ack, I've updated our template to reflect this for next SRU: https://trello.com/c/6ym50IN3/9-create-trello-cards-for-each-commit-that-could-represent-a-functional-change-to-ubuntu | 16:50 |
blackboxsw | +1 Odd_Bloke I just updated the top checklist item there so we get a trello formatted log2dch output, which makes it easier for us to link to the individ issues from the card | 16:59 |
blackboxsw | log2dch --trello | 16:59 |
Odd_Bloke | Nice, thanks! | 17:01 |
smoser | blackboxsw: i acke'd. but i would appreciate a fix for 'new' to 'knew' | 17:27 |
blackboxsw | smoser: thanks! checking and will address it | 17:27 |
lucasmoura | blackboxsw, falcojr3 it seems that PRs #357 and #335 are already in cloud-init 20.2 | 19:16 |
lucasmoura | I checked that #357 was already verified in the last SRU, but could not find verification for #335 | 19:16 |
blackboxsw | double checking too | 19:16 |
blackboxsw | lucasmoura: git describe 7dceb9882590fb738ac0ff3429908cc6c945485a | 19:18 |
blackboxsw | 20.2-3-g7dceb9882 | 19:18 |
blackboxsw | yep looks like it was in that SRU and already released. | 19:18 |
blackboxsw | lucasmoura: probably don't need to recheck that content unless you want to | 19:19 |
blackboxsw | it's *just* schema validation which generates a warning log at best if people are providing invalid schema | 19:19 |
lucasmoura | I think we can remove them from the sru list | 19:20 |
blackboxsw | so might just add ~ before around the text in that checklist item or remove it | 19:20 |
blackboxsw | yep | 19:20 |
blackboxsw | save yourself time there | 19:20 |
lucasmoura | Got it. Also, I think we can skip this PR too: https://github.com/canonical/cloud-init/pull/443 | 19:21 |
lucasmoura | What do you think ? | 19:21 |
lucasmoura | Oh wait, I have found a place where it is used, maybe I can still directly test that | 19:23 |
blackboxsw | lucasmoura: I think, right, we need to test the actual logic change that is using that call with the False param | 19:23 |
lucasmoura | ack | 19:24 |
blackboxsw | minor tooling improvement for log2dch | 19:57 |
blackboxsw | to create links we can click in the bug verification checklist https://github.com/canonical/uss-tableflip/pull/61 | 19:57 |
blackboxsw | I'm updating that trello card now | 19:57 |
blackboxsw | with the markdown output by log2dch --trello | 19:57 |
blackboxsw | using this branch | 19:58 |
falcojr3 | a lot of our manual sru scripts look specifically for "Trace" | 20:34 |
falcojr3 | shouldn't that be "TRACE"? | 20:34 |
falcojr3 | or is that looking for a specific "Trace" message? | 20:34 |
=== falcojr3 is now known as falcojr | ||
Odd_Bloke | falcojr: That's looking for a Python traceback, I believe. | 20:44 |
Odd_Bloke | (i.e. "Traceback (most recent call last)") | 20:44 |
falcojr | Ah | 20:45 |
Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!