=== hjensas__ is now known as hjensas|afk === cpaelzer_ is now known as cpaelzer [15:31] hey looking for some help on cloud-init and growfs / resizefs. [15:31] ``` growpart: ignore_growroot_disabled: false mode: auto devices: - "/" - "/home" - "/var" - "/var/log" - "/var/log/audit" - "/var/tmp" resizefs: true resize_rootfs: true``` [15:31] err, no markdown raw or I suck at webchat ;) [15:33] so, maybe another way to ask - does resizefs need anything special to get other filesystems than /? I can see growpart has resized the partitions but the fs was not grown. Not sure if this is os / cloud-init version specific but this is on rhel7 and cloud-init version 18.5.6 ( yea, rhel7 :shame: but ... ) [16:08] bilsch: I'm not 100% sure I understand the question. Could you perhaps use https://paste.ubuntu.com/ to paste your configuration, and describe what it is you want/expect it to do? [16:21] https://github.com/canonical/cloud-init/pull/375 landed. thanks lucasmoura [16:21] * blackboxsw is reviewing the vmware PR now [16:22] Odd_Bloke: sorry looks like you just rebased https://github.com/canonical/cloud-init/pull/464 [16:22] I think it's stale again because of my merge [16:52] Odd_Bloke https://paste.ubuntu.com/p/rhhrBnVKY7/ [17:27] hi smoser/rharper: I added the following response to the vmware review https://github.com/canonical/cloud-init/pull/441/files#r447132910 suggesting maybe a datasource config option to override vmware's default image customization behavior. If either of you disagree with approach, just let me or the PR know. [17:27] smoser: had reviewed the earlier invocation of that PR, but I think very little has changed from the first attempt [17:28] at least as far as overrides from cloud-init user-data image customization side [17:42] bilsch: Thanks! Can you paste the output of `mount`? [17:45] blackboxsw: ok [18:25] blackboxsw, eta on the SRU? [18:25] Odd_Bloke https://paste.ubuntu.com/p/VJRfjt7sJV/ [18:29] looking at the properties for cc_resizefs.py I don't think it even takes an array / list or looks up devices / mounts. I also don't see where growpart calls a resizefs etc. I kinda wonder if this works at all or is just cryptic. I'm hoping its as easy as "yea dummy set this config / yaml key" ;) [18:52] bilsch: This line suggest to me that your configuration isn't being consumed by cloud-init: cc_growpart.py[DEBUG]: No 'growpart' entry in cfg. Using default: {'ignore_growroot_disabled': False, 'mode': 'auto', 'devices': ['/']} [18:53] So that would probably be the next thing to debug. :) [19:23] its set as user data for the vm in ec2 ... [19:24] powersj: only waiting on cdoqa. test run. it's queued, but I don't think it's run yet [19:25] powers, /me needs to attach all our verification logs. cloud-init side of testing is complete [19:25] blackboxsw, awesome, thanks! [19:26] bilsch: What does `sudo cloud-init query userdata` give you? [19:31] good one Odd_Bloke. bilsch if the following was your full user-data. it's missing a leading header line containing "#cloud-config" [19:31] https://paste.ubuntu.com/p/rhhrBnVKY7/ [19:33] but that query cmd mentioned would tell you for certain [19:43] ah, I did not paste the full file. That "#cloud-config" header is there [19:43] `sudo cloud-init query userdata #cloud-config runcmd: - yum -y remove ansible growpart: ignore_growroot_disabled: false mode: auto devices: - "/" - "/home" - "/var" - "/var/log" - "/var/log/audit" - "/var/tmp" resizefs: true resize_rootfs: true` [19:44] bah newlines and such [19:44] https://paste.ubuntu.com/p/SZwVc9mMRj/ [19:45] would the spacing in there cause issues? Its technically valid yaml but not sure how strict the parser is [19:48] bilsch: the newlines/whitespace is probably what is breaking cloud-init's interpretation of the growpart key maybe? Try: grep "Failed at merging" /var/log/cloud-init.log. I presume if it was invalid cloud-config or yaml you'd get that message. [19:49] the grep returned nothing [19:49] though, I think I jus t re-created without the leading spaces [19:49] also something I sometimes do: sudo cloud-init query userdata > my.yaml; cloud-init devel schema --config-file my.yaml [19:49] oh thats handy thanks [19:50] yea those leading spaces are a problem [19:50] yeah we are building that schema validation cmdline utility up, so it's still considered a 'devel' subcommand. It'll at least validate proper yaml (and all keys and config values for about 10 of the cloud-config modules) [19:50] Cloud config schema errors: format-l1.c1: File my.yaml needs to begin with "#cloud-config" [19:51] ok there it is. silly white space [19:51] and about 90% of the problems in cloud-init deploys that people have.... darn you YAML [19:52] yea we need another file markup language. someone should fix that! [19:52] and strict yaml cloud-init processing :) [19:52] yea assume you guys know of yamllint? saves so much time ( and yea I know... I should have done that ) [19:52] bilsch: I am curious about your deploy still (while we have something 'broken') I'm wondering why cloud-init didn't match the failure log [19:53] do you have a match from grep Trace /var/log/cloud-init.log ? [19:53] yea sure whats up [19:53] yea 2 tracebacks [19:53] ok those should point to the specific type of trace in trying to process invalid user-data [19:54] unfortunately, cloud-init tries hard to succeed, even if vendor data or user-data is "broken" so the VMs still come up. We are generally trying to move it from "bring the system up as best you can" approach to "complain loudly because nobody looks in logs for warnings :)" [19:54] so, both look like permission denied. These boxes have selinux on them [19:55] ok. I'll get a test system with leading whitespace and reproduce locally then. Thanks for peeking [19:55] yea I second that motion make it fail so I learn and fix my broken crap [19:55] yea for sure thanks for the help! [19:55] yeah same for me too. I don't want to dig into a log to find out why I'm not 100% in line with my config [19:55] surely thx Odd_Bloke [19:56] yea I prefer determinism - if something is not configured right say so, break, seg fault the kernel whatever [19:56] agreed [19:56] oh ha terraform was happily waitng for me to say yes to test the fixed yaml ( heredoc with spacing proper to the tf file ) [20:00] are the resizefs and resize_rootfs top-level or nested within growpart? [20:04] also rathher than constantly re-creating a vm is there a way to just apply the modified yaml locally? [20:04] eg, save the yaml like before, tweak / whatever and use cloud-init to do all the things? [20:06] bilsch: can you run cloud-id ( I think terraform == NoCloud datasource type right?) [20:06] or "cloud-init status --long" [20:06] cloud-init status --longstatus: donetime: Mon, 29 Jun 2020 20:03:25 +0000detail:DataSourceEc2 [20:06] bunch of new lines in there, want it in pastebin? [20:06] that 2nd command will tell you (if on NoCloud, where your seed directory is coming from) [20:07] nah it's good [20:07] so is growpart intended to also automagically expand the filesystems? [20:08] I see the devices expanded but not the filesystems [20:08] post init of a fresh vm [20:08] data blocks changed from 523776 to 5242619 [20:08] that is only after I ran xfs_growfs. cloud-init had expanded the partition just fine [20:13] bilsch: the hammer that let's you re-run everything in cloud-init is `cloud-init clean --logs --reboot` that'll wipe the system and re-run. The problem with ec2 datasource is that you have already set the user-data on the metadata service in ec2 I think. So, even though you cleaned cloud-init, it'll still grab the original user-data. [20:13] ahh [20:13] ok [20:13] nocloud datasource is different in that it has a seed directory that you can re-write after the fact and re-run with new user-data content [20:13] no biggie just looking to debug in place / reduce cycle time to get it right is all [20:14] almost as much time to go mess with ec2 console and muck with it vs just letting tf rebuild it so [20:14] yeah we reduce cycles by using 'lxc launch ubuntu-daily:focal mylocalvm' which also has images:centos/7 sles/* etc [20:14] ah ok that makes sense [20:19] https://github.com/canonical/cloud-init/blob/master/cloudinit/config/cc_resizefs.py#L242-L247 so I'm back to questioning if resizefs works for anything other than the root filesystem.... [20:23] https://cloudinit.readthedocs.io/en/latest/topics/modules.html#growpart so what you want is https://github.com/canonical/cloud-init/blob/master/cloudinit/config/cc_growpart.py#L276 I think? [20:35] well, growpart and resizefs appear to be separate modules / tools [20:36] growpart is just the partition table change [20:36] resizefs looks for / runs the proper command to grow the filesystem for an already-expanded partition / volume etc [20:41] blackboxsw: bilsch: I'm catching up, but the complication with completely aborting a boot is that makes it harder for people to get log files off of it to diagnose the issue; obviously kernel panics have a similar effect, but you generally can't cause a kernel panic by passing misformatted YAML to your cloud provider. ;) [20:42] heh - yea even kill -9 1 does not work anymore ;( [20:43] and yea it does make sense that you won't want to make the pain too bad, gotta give people a chance to find the information [20:44] I've tried a few incantations on the growpart devices - I get the partition expanded via growpart but only / via resizefs [20:44] @blackboxsw @rharper at which point during cloud-init-local.service that systemd will trigger other units that are marked "after" cloud-init-local.service [20:47] AnhVoMSFT: since it's in oneshot more, after the first Exec= line is complete, [20:47] AnhVoMSFT: so cloud-init init --mode=local must exit before units depending on it can start [20:48] I see - thanks @rharper. We're seeing this strange issue in RHEL with cloud-init (18.5) where if an NFS mount exists in /etc/fstab, cloud-init will hang in "mount -a" during deallocate/restart of a VM [20:48] https://www.freedesktop.org/software/systemd/man/systemd.service.html ; "Behavior of oneshot is similar to simple; however, the service manager will consider the unit up after the main process exits. It will then start follow-up units. RemainAfterExit= is particularly useful for this type of service. Type=oneshot is the implied default if neither Type= nor ExecStart= are specified. Note that if this option is used without [20:48] RemainAfterExit= the service will never enter "active" unit state, but directly transition from "activating" to "deactivating" or "dead" since no process is configured that shall run continously. In particular this means that after a service of this type ran (and which has RemainAfterExit= not set) it will not show up as started afterwards, but as dead. [20:49] AnhVoMSFT: is the mount entry marked with _net ... what's the bit [20:49] _netdev [20:50] the mount is added manually by the customer to /etc/fstab , not through cloud-init mounts config [20:50] cloud-init local does not call mount -a, that happens in cloud-init init (network mode); [20:50] ok, it must have _netdev in the options field if it depends on networking [20:51] this informs systemd-fstab-generator which creates .mount files to set them to run After=network-online.target [20:52] it does not have _netdev in the options field [20:52] that said, mount -a only runs in cloud-init init (stage 2) and networking should be up [20:52] so I'm not sure why it would hang; so I suspect that maybe networking isn't coming all the way up (or no route to the mount) [20:53] the mount unit indicates type=nfs, which will have after=network-online.target added automatically by systemd I believe [20:53] yep [20:54] yeah, but you're right it's in cloud-init's init phase, not init-local [20:54] but a mount -a will force mounting of all entries when it's run; meant to bring up any new entries added since fstab-generator ran [20:54] the fstab generator runs before cloud-init local does; so if we add a new mount, then we trigger a mount -a ; the ephemeral disk in azure's case [20:55] but ... it should come up; so that means networking issues (or possibly missing nfs client) [20:55] if I move "mounts" to the config phase it works fine [20:55] the ubuntu image, I don't think has nfs-common package included by default [20:55] sounds like networking isn't fully yp [20:55] up [20:55] this issue does not happen in Ubuntu, but in RHEL only, which is strange [20:55] I suspect it's network-manager related [20:55] it does look like some sort of issue with networking [20:55] I know otubo was chasing NM "being all the way up" issues [20:56] this was maybe 6 months ago, but I thought the workaround there was to ensure the Network-Manager-wait-online.service was also waited upon by cloud-init.service [20:58] the init's phase also runs before network-online ? [20:59] https://paste.ubuntu.com/p/h4yftHdYbC/ [20:59] that's what the cloud-init.service looks like in RHEL [21:00] that doesn't look right to me [21:01] upstream we run After=networking.serivce NetworkManager.service; and I thought otubo added a drop-in to include NetworkManager-wait-online.service; [21:02] basically cloud-init.service runs after OS networking is up; but before network-online.target; which means that cloud-init knows that networking is up; and can fetch networking based #include cloud-configs, which need to be present before we run cloud-config.service [21:05] let us try that quickly, then we can open a support ticket on Redhat and get that fixed [21:09] https://bugs.launchpad.net/cloud-init/+bug/1869181/comments/12 ; I poked around with getting NM to fully come up in Ubuntu and it needed more work; especially tricky w.r.t the ordering NM needs dbus and strange things happen (boot dep cycle) [21:09] Ubuntu bug 1869181 in cloud-init "[Focal] cloud-init service never get nework actived during MaaS deploy." [Undecided,Incomplete] [21:10] AnhVoMSFT: the DefaultDependencies=no to both NM and NM-wait-online.service I think and then adding the After=NetworkManager-wait-online.service helped ; that may be enough on Centos/RHEL; Ubuntu the netplan bits convering to NM config weren't quite there; on Cent/RH they use the sysconfig rh plugin, so I don't think you'll see the rest of the issues I saw in that bug [21:11] we tried adding the After=NetworkManager-wait-online.service to cloud-init.service but that did not help [21:11] i did not add DefaultDependencies=no [21:14] AnhVoMSFT: journalctl -b -o short-monotonic -u NetworkManager.service -u NetworkManager-wait-online.service -u cloud-init.service -u network-online.target ; [21:14] that should print in timestamp order ... if you see cloud-init.service dumping the netinfo table and not everything is then, something isn't ordered correctly (or NM is failng to bring everything online) [21:14] s/then/there [21:21] adding the DefaultDependencies=no also did not help [21:21] let me check the journalctl output to see what is missing