[16:26] <larsks> smoser: if you're around, what is the thought behind having DefaultDependencies=no in cloud-init.service?  It turns out this this is causing cloud-init to start before dbus on fedora, so things like setting the hostname fall over.  The fix may be an explicit dependency on dbus.(service|socket)...
[16:37] <smoser> larsks, 1 minute.
[16:39] <larsks> (the root cause is that DefaultDependencies=no means there is no implicit dependency on basic.target)
[16:52] <smoser> more than on minute.
[16:52] <smoser> ok
[16:53] <smoser> larsks, yeah, so we had issues with that too, starting before dbus.
[16:53] <larsks> If we removed DefaultDependencies=no it would in theory Just Work.
[16:54] <larsks> ...unless there was a specific problem that was meant to solve.
[16:54] <smoser> well, no. then cloud-init is not running early enough in order to block netoworking.
[16:54] <larsks> Ah, I see.
[16:54] <smoser> theres a series of bugs, probably referenced in commits.
[16:54] <smoser> let me look
[16:54] <larsks> So you think an explicit dependency on dbus.service is probably the way to go.
[16:54] <smoser> and this is, agreed, really a pita
[16:54] <smoser> i dont think that will actually work. let me see.
[16:58] <smoser> larsks, there isa lot of information in git log systemd/
[16:58] <smoser> https://bugs.launchpad.net/ubuntu/+source/cloud-init/+bug/1629797 is what had us adding the dbus.socket
[16:58] <larsks> Indeed, I've had to look at several of those.
[16:58] <smoser> it was resolved.
[16:59] <larsks> Looking...
[16:59] <smoser> we could ask pitti for advice...
[17:00] <smoser> he's the one who holds most knowledge in his head about systemd and aided in this slew of bugs.
[17:00] <smoser> pitti works for rh now
[17:01] <larsks> So https://git.launchpad.net/cloud-init/commit/?id=6e45ffb21e9622780585b4fe15890f009ca8fa71 added Bfore=dbus.socket, but that appears to have been removed by your commit e568aec31051674901047ee577f6e229785cbfc3
[17:02] <larsks> I will bug pitti for thoughts.
[17:03] <smoser> ask him to join here, nice to have conversation logged if we do have irc conversation.
[17:05] <smoser> larsks, so you're suggesting
[17:05] <smoser>  http://paste.ubuntu.com/23864232/
[17:05] <smoser> righ t?
[17:06] <larsks> Or After=dbus.socket, maybe, but yeah.  I'm going to test that later today and see if it corrects out issue.
[17:06] <larsks> s/out/our/
[17:08] <larsks> Well, that doesn't work:
[17:08] <larsks> Breaking ordering cycle by deleting job cloud-init.service/start
[17:15] <smoser> yeah, i saw that quickly in a lxd container (well, for some reason  journalctl doesnt show the error... but cloud-init.service did not run)
[17:15] <rharper> larsks: we switched dbus for sysinit.target
[17:15] <larsks> rharper: yes, I spotted that.
[17:17] <rharper> typically, you need to run after dbus.socket (which is required for networkd/timesyncd) but before sysinit.target
[17:18] <rharper> larsks: which fedora release?
[17:18] <smoser> larsks, what is the ordering cycle that you see ?
[17:18] <smoser> can you paste journalctl ?
[17:19] <larsks> The problem we have is that we have Before=sysinit.target, which means we run before dbus.service.  But adding After=dbus.service results in an ordering loop: http://chunk.io/f/b9c11198c8c14c6b84dccec25cd1f3cc
[17:20] <larsks> This is F25 right now, using cloud-init 0.7.9.
[17:21] <larsks> Note that those log entries are what we get after adding After=dbus.socket to cloud-init.service.
[17:21] <larsks> I can produce logs for other scenarios if you would like.
[17:26] <rharper> larsks: yes, that sounds familiar, it should be ok to run before dbus.service; just not before dbus.socket;   we had to split the  gap so-to-speak;  we also have a resolv service that does libnss before dbus.service is up (which then resolv takes over)
[17:26] <larsks> rharper: the point is that with After=dbus.socket results in the ordering loop.
[17:27] <smoser> right, and larsks needs to run after dbus.socket so that he can update the hostname on the system
[17:27] <rharper> yeah, we'll need to detangle that
[17:27] <smoser> because obviously calling a kernel api should go through a user space daemon witih a dbus socket
[17:27] <rharper> and something else may need a 'DefaultDependencies=no'
[17:27] <smoser> well i think its just that you cant be before sysinit.target and after dbus.socket
[17:27] <larsks> Right.
[17:28] <larsks> But I'm still not clear on why we need defaultdependencies=no.  We have explicit ordering on most of the network stuff now, I think.
[17:28] <smoser> if you drop defaultdependencies=no, then you get defaultdependencies
[17:28] <smoser> which include sysinit.target
[17:28] <smoser> i think
[17:28] <larsks> Yeeeeeees.
[17:28] <rharper> it's because default adds more deps that do not allow the correct ordering for placing befor networking
[17:29] <larsks> rharper: do you know exactly what the problem is?  Because there are now explicit dependencies on serveral network units.
[17:29] <larsks> Are those insufficient?
[17:29] <rharper> it's the additional deps that push things with default deps further up in the cycle
[17:29] <larsks> What do we hit if we come after sysinit.target?  Is there a test case that demonstrates an actual problem?
[17:29] <rharper> we cannot block networking
[17:30] <smoser> we so need better integration test. :-(
[17:30] <smoser> its coming.
[17:30] <larsks> I understand the problem is "we cannot block networking", but why, and why are the explicit Before= deps not sufficient to permit us to block networking?
[17:30] <rharper> cloud-init local needs to be able to write out a network config
[17:30] <rharper> they are
[17:30] <larsks> rharper: sure, but this isn't about cloud-init-local
[17:30] <rharper> but defaultdeps bring in *MORE* deps
[17:30] <larsks> This is cloud-init.service.
[17:30] <larsks> There is no problem with cloud-init-local having defaultpdeps=no
[17:30] <smoser> i thinkt hat the issue is probably thatAfter sysinit.target (which is added unless you are Defaultdependencies=no) will mean that we run After dbus.socket
[17:30] <rharper> which runs right before networking is considered online
[17:31] <smoser> which, on ubuntu, with resolved , means dns queries block until timeout
[17:31] <smoser> because resolved wasnt up but the socket was
[17:31] <larsks> Since it doesn't sound like there is any evidence that defaultdeps=no is necessary, I'd like to produce some so that we can actually test things. smoser, what is a scenario that requires that we "block networking"...something passing in a static network config?
[17:32] <larsks> I would like to find something that will fail somewhere (fedora/ubuntu/whatever) if I remove the defaultdependencies line.
[17:32] <smoser> it is necessary
[17:32] <smoser> other wise you run After=sysinit.target
[17:32] <smoser> which is After=dbus.socket
[17:33] <larsks> This discussion has developed an ordering problem :).  I am just asking for some sort of scenario that would demostrate the problem you and rharper have described.
[17:33] <rharper> openstack boot
[17:34] <larsks> But a normal openstack boot doesn't require cloud-init to do *anything* w/r/t networking.
[17:34] <rharper> when cloud-init.service runs it will attempt to poke at network based metadata services
[17:34] <smoser> right
[17:34] <rharper> it can and usually does
[17:34] <larsks> So it needs to run *after* networking in that case.
[17:34] <smoser> and it will attempt dns lookup on the gce .internal  name
[17:34] <rharper> but before it's online
[17:34] <smoser> and that blocks
[17:34] <rharper> a cloud may provide network configuration via metadata services
[17:34] <larsks> rharper: I think I just missed a distinction there.
[17:35] <rharper> so all other units that need network, run after 'network-online.target' is reached
[17:35] <larsks> Ah, I see.  So in that case, you expect cloud-init to...bring up interfaces manually first, in order to contact the metadata service?
[17:35] <larsks> I mean, how do the interfaces come up in that situation to permit access to the network metadata service, if we're running before the system brings up networking?
[17:35] <smoser> hey, i'm really sorry, but ih ave got to work on some other things .
[17:35] <rharper> smoser: np
[17:36] <larsks> Yeah sure.  I just want to understand the problem we're trying to solve with these dependency settings.
[17:36] <larsks> I also have other things I need to work on :)
[17:36] <rharper> larsks: we use the hosts network service (so ifupdown or netword) a fallback network config (typically dhcp on first nic) is done
[17:36] <larsks> But aren't those going to depend on, e.g., networkmanager already running?
[17:36] <rharper> right
[17:36] <rharper> but not reaching the network-online.target
[17:36] <rharper> it's really threading a needle
[17:37] <rharper> cloud-init is expected to do things iwth the network which could affect network-based services (like sshd host key gen)
[17:37] <larsks> Hmmmm.  But we already have Before=network-online.target, right?
[17:37] <rharper> we don't want sshd to be running (it runs after network-online.target is reached)
[17:37] <larsks> So even if we exclude default dependencies, we're still okay.
[17:37] <larsks> We also have Before=sshd.service
[17:38] <rharper> what we really need is a list of default deps that get added unless you add DD=no
[17:38] <larsks> I am pretty sure that means sysinit.target and basic.target.  But I suppose you mean you'd like the transitive deps in that case?
[17:38] <rharper> then we can walk each of those to see if they order themselves after network-online.target or something else that forces cloud-init.service to run later than we need
[17:39] <rharper> right, DefaultDeps is larger than just those two right ?
[17:39] <larsks> No.  From systemd.service: Unless DefaultDependencies= in the "[Unit]" is set to false, service units will implicitly have dependencies of type Requires= and After= on sysinit.target, a dependency of type After= on basic.target as well as dependencies of type Conflicts= and Before= on shutdown.target.
[17:39] <larsks> What exactly is implied by those depends on a lot on how other services are ordered w/r/t to basic.target and sysinit.target
[17:40] <rharper> right, so the default deps of sysinit.target and basic.target make cloud-init.service run too late
[17:40] <larsks> Maaaybe.  I've noticed that between ubuntu and rhel/fedora there are substantial differences in service ordering.  And even between fedora and rhel, I think.
[17:40] <rharper> very likely
[17:40] <rharper> the list of units and ordering is massively fragile
[17:41] <larsks> So we still don't have a clear test case that demonstrates an actual problem.  It sounds like you are suggesting that an openstack config that passes in an explicit network configuration should help demonstrate one?
[17:41] <larsks> I can try putting that together later this week.
[17:41] <rharper> you don't even need a network config
[17:42] <rharper> just use the default metadata service (ie, not a configdrive)
[17:42] <rharper> in our case, if you run systemctl list-dependencies sysinit.target
[17:43] <rharper> honestly; we can try moving it back in as well and see what breaks
[17:43] <rharper> in general, it's such a mess that it gets paged out of my memory once things work as expected
[17:44] <larsks> Yeah, the thing is, running cloud-init *after* network-online.target will work just fine in that case (since it doesn't need to touch the network config).
[17:44] <larsks> Since networking is up, it will have no problem contacting the metadata service.
[17:44] <rharper> right
[17:44] <larsks> I am trying to produce a failure :)
[17:49] <larsks> Speaking of paging things out, I should get lunch before my next meeting and the meeting after that.  I've also pointed pitti at the problem, although he's got devconf going on right now and may not be able to look at things until next week or so.
[17:49] <rharper> cool
[21:49] <rharper> smoser:  https://code.launchpad.net/~raharper/cloud-init/+git/cloud-init/+merge/315633