=== hatchetation_ is now known as hatchetation === balkamos_ is now known as balkamos [01:33] thanks again xnox, you rock === SpamapS_ is now known as SpamapS === PaulePan1er is now known as PaulePanter [15:32] hi guys. what's the proper way to execute something when an app crashes but hasn't reached the respawn limit. post-stop is also run when the process is manualy stopped so that's not working for me. is there an event i can have another task listen to? [15:34] like, maybe i can check $PROCESS on the 'stopped' event? [16:10] basically looking for `start on respawned ` [16:47] guise? [17:47] dstokes: I'm a newbie so I may not have the proper guidance available, but you might be able to `initctl emit your-event-name` within your stop routine and then `start on your-event-name` [17:49] styol: afaik there is no way to detect a process failure / respawn in stop routines. only difference in a respawn is that pre-stop isn't run [17:53] dstokes: ah I see what you're saying [17:54] easy to detect when the process stops, not easy to make a distinction btwn `stop ` and process crashing [17:54] dstokes: this post suggests PROCESS=respawn might be able to be used https://bugs.launchpad.net/upstart/+bug/716802 [17:55] styol: process=respawn indicates a process that reached it's respawn limit. it's triggered once after n unsuccessful respawn attempts. i'm currently using a task to monitor those failures. [17:56] goal now is to detect when apps crash but successfully restart, for debugging purposes [22:33] styol: dstokes: the events emitted are - stopped/failed JOB='test' INSTANCE='' RESULT='failed' PROCESS='respawn' [22:33] styol: which is emited after e.g. the main process segfaulted. [22:34] styol: after that there will be new starting/started events from respawning. [22:34] wait no, that one is when the respawn limit is reached [22:34] ah i see [22:34] dstokes: ping [22:35] yup [22:36] respawn does not represent a respawn, but reaching the respawn limit [22:36] styol: dstokes: for the intermediate failures one gets: started/failed JOB='test' INSTANCE='' [22:36] styol: dstokes: let me paste the log of events [22:37] job failure is indicative of the upstart job failing, not the managed process right? pretty sure i tested that case.. [22:46] dstokes: so without respawn the event i see is - stopped JOB=test EXIT_STATUS=1 PROCESS=main RESULT=failed [22:46] dstokes: or e.g. - stopped JOB=test RESULT=failed PROCESS=main EXIT_SIGNAL=FPE [22:46] (floating point exception, core-dump) [22:47] dstokes: but with respawn one gets less information. You could leave out respawn, and instead have a second job - e.g. monitor.conf which is "start on stopped mainjob" which then can act appropriate and do "$ start --no-wait mainjob" to "respawn" and/or do other clean-up things [22:48] i'm only seeing that event after respawn limit with - stopped JOB=test PROCESS=respawn RESULT=failed [22:48] dstokes: looks ugly, and i think it's a bug that /less/ info is passed. [22:48] maybe i need to check my exit code.. [22:48] xnox: then you have to hack together your own respawn limit right? [22:48] dstokes: "so without respawn the event i see is" as in if the main job _does not have "respawn" stanza_ I see more info in the failed events. [22:48] dstokes: yeap. [22:49] i think it's bug that we don't emit as to /why/ we failed. [22:49] xnox: sry, main job _does_ have respawn stanza, along with limit [22:49] i'm only seeing stopped and stopping emitted at the end of several respawn attempts (when the limit is reached) [22:49] dstokes: respawn -> only one failed event, when respawn limit reached without info; without respawn -> more details as to why main process failed. [22:50] xnox: i see. so that's just the way it is ;) [22:50] to workarnd, should be writing my own respawn task [22:51] often times a process will fail, then startup successfully. those are the cases i'm after so i can debug why it failed before the logs fill up etc [22:51] dstokes: it is weird. I'll open a bug about it, but probably will not help much as it will take a while for such a fix to be created. [22:51] right, thx for your help anyway. happy to at least confirm that it's not misconfiguration on my end [22:51] dstokes: oh, i see. I wonder if you can just crank up upstart logging to get that. [22:51] dstokes: so do you actually want to take any automated scripts/job upon failures? or are you simply to collect data? [22:52] dstokes: after $ initctl log-level debug (the most debugging) [22:52] the ideal scenario is: setup main job to respawn a process when it fails, setup secondary task to curl when the process is respawned succesfully (curl for notification) [22:53] dstokes: I see - http://paste.ubuntu.com/7232773/ [22:53] i suppose i could watch the log, but that's a little more involved than i want to get ;) [22:55] dstokes: i think we can succeed your requirements! [23:02] xnox++ [23:03] dstokes: one sec, testing. [23:05] for context: main job http://paste.ubuntu.com/7232809/ [23:06] and associated task: http://paste.ubuntu.com/7232813/ [23:06] alrdy have the respawn limit task working properly. notifies me when a process fails to respawn (after limit) [23:10] dstokes: so my "main" job simply does "main() { pause()};" [23:10] dstokes: that's the process, and then externally i send FPE (kill -8) to it. [23:10] $ cat /etc/init/test.conf [23:10] respawn [23:10] exec /tmp/a.out [23:10] that's main job. [23:10] and here is my "monitor" [23:10] $ cat /etc/init/monitor.conf [23:10] start on stopping test [23:10] stop on stopped test [23:10] script [23:10] sleep 2; echo "Main job respawned successfully" [23:10] end script [23:10] .. [23:11] dstokes: so when job under test fails (stopping test) monitor kicks in. If the job is getting normally stopped or reached respawn limit, the monitor will be stopped. [23:12] dstokes: however, if respawn succeeds and the job stays alive for 2 seconds then in /var/log/upstart/monitor.log i get notification that respawn was successful. [23:12] dstokes: but the sleep2 needs to be adjusted. You could do better without a sleep [23:13] clever.. [23:13] dstokes: e.g. "stop on stopped test or started test" [23:13] dstokes: and then instead of script, you'd have a post-stop script -> which checks the stop event environment. [23:13] dstokes: if the reason for getting stopped is "started test" it means a succssful respawn happened. [23:14] let me code that. [23:21] dstokes: nah, needs sleep / appropriate matching for respawn limit none-the-less.