[12:07] <mitsuhiko> does anyone have any ideas why upstart sometimes i unable to pickup changes?
[12:07] <mitsuhiko> how does one go about debugging upstart?
[12:29] <mitsuhiko> does this ring a bell to anyone? https://gist.github.com/5f9061af79bb8b38d240
[12:29] <mitsuhiko> service start just hangs
[12:31] <mitsuhiko> i can confirm it does not even manage to start the executable
[12:31] <mitsuhiko> god knows what it waitpids for
[12:32] <mitsuhiko> i replaced the config file with one that just logs something
[12:32] <mitsuhiko> and it still does not do anything
[12:33] <mitsuhiko> force initctl reload-configuration does not do anything either
[12:34] <mitsuhiko> what the fucking fuck. i just renamed the service and that fixes it
[12:44] <jodh> mitsuhiko: is 'salt-master' a daemon by any chance? If you haven't already done so, I'd recommend reading http://upstart.ubuntu.com/cookbook/#precepts-for-creating-a-job-configuration-file.
[12:45] <jodh> mitsuhiko: also, if you are creating this .conf from scratch, don't add respawn until you are convinced the job is working as you expect (http://upstart.ubuntu.com/cookbook/#respawn).
[12:45] <jodh> mitsuhiko: also, you can remove all that redirection and either specify "console none" or just remove it any have Upstart auto-log any output to /var/log/upstart/<job>.log.
[12:46] <jodh> mitsuhiko: another point - you don't need the script/end-script - just use 'exec'.
[12:47] <jodh> mitsuhiko: to understand what is going on with your original job, 'initctl status <job>'. I suspect salt-master is forking but you haven't told Upstart that it does that (respawn stanza), which will lead to interesting results.
[12:47] <jodh> mitsuhiko: oops - I meant 'expect' stanza, not 'respawn' above.
[12:59] <mitsuhiko> jodh: the file works just fine
[12:59] <mitsuhiko> i renamed it and it works
[13:00] <mitsuhiko> so something in upstart is broken
[13:00] <mitsuhiko> also the file worked for two months without changes
[13:00] <mitsuhiko> jodh: the script/end-script was there for debugging purposes
[13:00] <mitsuhiko> before it was expect daemon and salt-master -d as only exec line
[13:01] <jodh> mitsuhiko: Upstart is not broken. The problem I think is that your original unrenamed job is not working as you expect, and since Upstart was unable to track the PID of the initial job, you were not able to start it (as Upstart thought it was still running).
[13:02] <mitsuhiko> jodh: so how do i fix the unrename job?
[13:02] <mitsuhiko> and no, the script was always correct
[13:02] <mitsuhiko> i verified by spawning instances of all permutations of scripts i tried under the new name
[13:02] <mitsuhiko> all of them work
[13:02] <jodh> mitsuhiko: well, if you had 'expect daemon' and tried to start the job but salt-master does *not* fork, that too will confuse Upstart, hence your problem.
[13:02] <mitsuhiko> jodh: as i said, all permutations work
[13:03] <mitsuhiko> i tried expect daemon with -d and no expect with the redirected script block
[13:03] <mitsuhiko> both work fine as expected
[13:03] <mitsuhiko> my salt-master.conf file still does not start
[13:03] <mitsuhiko> the same file as salt-wtf.conf does start
[13:03] <jodh> mitsuhiko: and you verified that the PID Upstart reported in 'status job' reflected the real pid?
[13:03] <mitsuhiko> jodh: the service is not running
[13:03] <mitsuhiko> it *does not start*
[13:03] <mitsuhiko> it hangs
[13:03] <mitsuhiko> as shown by the strace above
[13:04] <jodh> mitsuhiko: what does 'status job' show?
[13:04] <mitsuhiko> $ status salt-master
[13:04] <mitsuhiko> salt-master stop/killed, process 7240
[13:04] <jodh> mitsuhiko: and does pid 7240 exist on your system?
[13:04] <mitsuhiko> no
[13:05] <mitsuhiko> it might have at one point but it's hard to say because upstart is completely unable to do anything with salt-master.conf at this point
[13:05] <mitsuhiko> i cannot start it
[13:05] <mitsuhiko> i cannot stop it
[13:05] <mitsuhiko> yet if i rename it to salt-wtf.conf i can start and stop it properly
[13:05] <mitsuhiko> intictl reload-configuration does not help
[13:05] <jodh> mitsuhiko: right, so I think you or someone else has at some point changed the original conf file and attempted to start that job. Upstart is unable to track the pid as 'expect' probably wasn't specified correctly. Hence, you cannot start that job as it's in a bad state. Copying the job file creates a brand new job, so is not encumbered by that problem ;)
[13:06] <mitsuhiko> jodh: trust me, there was nothing wrong with the file in the first place
[13:06] <mitsuhiko> jodh: but assuming it was, how do you let upstart forget about the job?
[13:07] <jodh> mitsuhiko: unfortunately, currently you can't without using gross hacks (or rebooting): the whole point is that Upstart is supposed to be supervising your services so it should not be possible to say "just forget about this one". That said, we are considering adding a feature as this does catch folks out occasionally.
[13:08] <mitsuhiko> and you're telling me that is not an upstart bug
[13:08] <mitsuhiko> seriously. i fix this by rebooting?
[13:08] <jodh> mitsuhiko: ultimately, building up a new .conf file step-by-step should allow you to be assured the job is behaving correctly such that you'd never get into that scenario.
[13:08] <mitsuhiko> jodh: seriously. that file was never wrong
[13:09] <mitsuhiko> feel free to mistrust me on that one, but it was always correct
[13:09] <mitsuhiko> how do i know? because we used this for two months
[13:09] <mitsuhiko> what changed? i did a service stop, upgrade on salt, service start
[13:09] <mitsuhiko> stopped functioning
[13:09] <jodh> mitsuhiko: it admit, it's not ideal, but I've explained the rationale. Patches welcome of course ;-)
[13:09] <jodh> mitsuhiko: but did you test the job with stop/start/restart followed by killing the PID to see how it behaves on respawn?
[13:09] <mitsuhiko> jodh: it does not start
[13:09] <mitsuhiko> there is no pid to kill
[13:10] <mitsuhiko> it does not start
[13:10] <mitsuhiko> and in case you have not noticed it, i am kinda pissed off right now because i am debugging this problem for two hours by now
[13:10] <mitsuhiko> and now it turns out to be a bug in upstart with the only solution being a … restart?
[13:11] <mitsuhiko> there is also zero information that upstart gives even at highest debug levels
[13:11] <mitsuhiko> i wonder if telinit c would fix it, but i am too scared to try that
[13:11] <jodh> mitsuhiko: I don't know the full details of what has happened on your system. However, the limitation in Upstart is that is it not currently possible to rectify a problem cause by job misconfiguration.
[13:11] <mitsuhiko> there was no job misconfiguration
[13:12] <mitsuhiko> jodh: what would the gross hack be?
[13:12] <mitsuhiko> i much rather not reboot that machine
[13:14] <jodh> mitsuhiko:  it's not something I would consider - it involves exhausting the PID namespace until you get back to the PID shown in 'status <job>' to allow that (pid) to  be stopped.
[14:02] <mitsuhiko> jodh: exhausting does not work
[14:02] <mitsuhiko> it did not even try to issue a signal
[14:05] <mitsuhiko> readlink("/proc/7340/root", "/", 4096)  = 1
[14:05] <mitsuhiko> wat
[14:16] <mitsuhiko> ah yes.  recorded the wrong pid
[14:16] <mitsuhiko> 7340 != 7240.  Not sure why upstart showed the wrong one on the status message
[14:16] <mitsuhiko> i suppose what happend is that on salt upgrade the process did not deamonize properly and upstart responded badly to it.
[14:16] <mitsuhiko> race?
[22:59] <freerobby> Can anybody explain why this works in a shell, but when I start a process via torquebox upstart, it doesn't honor the system ulimits? https://gist.github.com/7fff364c0bca20c27aa5
[22:59] <freerobby> https://gist.github.com/6b22f5e1742f2bd63a75