=== hatchetation_ is now known as hatchetation | ||
=== balkamos_ is now known as balkamos | ||
styol | thanks again xnox, you rock | 01:33 |
---|---|---|
=== SpamapS_ is now known as SpamapS | ||
=== PaulePan1er is now known as PaulePanter | ||
dstokes | hi guys. what's the proper way to execute something when an app crashes but hasn't reached the respawn limit. post-stop is also run when the process is manualy stopped so that's not working for me. is there an event i can have another task listen to? | 15:32 |
dstokes | like, maybe i can check $PROCESS on the 'stopped' event? | 15:34 |
dstokes | basically looking for `start on respawned <job>` | 16:10 |
dstokes | guise? | 16:47 |
styol | dstokes: I'm a newbie so I may not have the proper guidance available, but you might be able to `initctl emit your-event-name` within your stop routine and then `start on your-event-name` | 17:47 |
dstokes | styol: afaik there is no way to detect a process failure / respawn in stop routines. only difference in a respawn is that pre-stop isn't run | 17:49 |
styol | dstokes: ah I see what you're saying | 17:53 |
dstokes | easy to detect when the process stops, not easy to make a distinction btwn `stop <job>` and process crashing | 17:54 |
styol | dstokes: this post suggests PROCESS=respawn might be able to be used https://bugs.launchpad.net/upstart/+bug/716802 | 17:54 |
dstokes | styol: process=respawn indicates a process that reached it's respawn limit. it's triggered once after n unsuccessful respawn attempts. i'm currently using a task to monitor those failures. | 17:55 |
dstokes | goal now is to detect when apps crash but successfully restart, for debugging purposes | 17:56 |
xnox | styol: dstokes: the events emitted are - stopped/failed JOB='test' INSTANCE='' RESULT='failed' PROCESS='respawn' | 22:33 |
xnox | styol: which is emited after e.g. the main process segfaulted. | 22:33 |
xnox | styol: after that there will be new starting/started events from respawning. | 22:34 |
xnox | wait no, that one is when the respawn limit is reached | 22:34 |
styol | ah i see | 22:34 |
styol | dstokes: ping | 22:34 |
dstokes | yup | 22:35 |
dstokes | respawn does not represent a respawn, but reaching the respawn limit | 22:36 |
xnox | styol: dstokes: for the intermediate failures one gets: started/failed JOB='test' INSTANCE='' | 22:36 |
xnox | styol: dstokes: let me paste the log of events | 22:36 |
dstokes | job failure is indicative of the upstart job failing, not the managed process right? pretty sure i tested that case.. | 22:37 |
xnox | dstokes: so without respawn the event i see is - stopped JOB=test EXIT_STATUS=1 PROCESS=main RESULT=failed | 22:46 |
xnox | dstokes: or e.g. - stopped JOB=test RESULT=failed PROCESS=main EXIT_SIGNAL=FPE | 22:46 |
xnox | (floating point exception, core-dump) | 22:46 |
xnox | dstokes: but with respawn one gets less information. You could leave out respawn, and instead have a second job - e.g. monitor.conf which is "start on stopped mainjob" which then can act appropriate and do "$ start --no-wait mainjob" to "respawn" and/or do other clean-up things | 22:47 |
dstokes | i'm only seeing that event after respawn limit with - stopped JOB=test PROCESS=respawn RESULT=failed | 22:48 |
xnox | dstokes: looks ugly, and i think it's a bug that /less/ info is passed. | 22:48 |
dstokes | maybe i need to check my exit code.. | 22:48 |
dstokes | xnox: then you have to hack together your own respawn limit right? | 22:48 |
xnox | dstokes: "so without respawn the event i see is" as in if the main job _does not have "respawn" stanza_ I see more info in the failed events. | 22:48 |
xnox | dstokes: yeap. | 22:48 |
xnox | i think it's bug that we don't emit as to /why/ we failed. | 22:49 |
dstokes | xnox: sry, main job _does_ have respawn stanza, along with limit | 22:49 |
dstokes | i'm only seeing stopped and stopping emitted at the end of several respawn attempts (when the limit is reached) | 22:49 |
xnox | dstokes: respawn -> only one failed event, when respawn limit reached without info; without respawn -> more details as to why main process failed. | 22:49 |
dstokes | xnox: i see. so that's just the way it is ;) | 22:50 |
dstokes | to workarnd, should be writing my own respawn task | 22:50 |
dstokes | often times a process will fail, then startup successfully. those are the cases i'm after so i can debug why it failed before the logs fill up etc | 22:51 |
xnox | dstokes: it is weird. I'll open a bug about it, but probably will not help much as it will take a while for such a fix to be created. | 22:51 |
dstokes | right, thx for your help anyway. happy to at least confirm that it's not misconfiguration on my end | 22:51 |
xnox | dstokes: oh, i see. I wonder if you can just crank up upstart logging to get that. | 22:51 |
xnox | dstokes: so do you actually want to take any automated scripts/job upon failures? or are you simply to collect data? | 22:51 |
xnox | dstokes: after $ initctl log-level debug (the most debugging) | 22:52 |
dstokes | the ideal scenario is: setup main job to respawn a process when it fails, setup secondary task to curl when the process is respawned succesfully (curl for notification) | 22:52 |
xnox | dstokes: I see - http://paste.ubuntu.com/7232773/ | 22:53 |
dstokes | i suppose i could watch the log, but that's a little more involved than i want to get ;) | 22:53 |
xnox | dstokes: i think we can succeed your requirements! | 22:55 |
dstokes | xnox++ | 23:02 |
xnox | dstokes: one sec, testing. | 23:03 |
dstokes | for context: main job http://paste.ubuntu.com/7232809/ | 23:05 |
dstokes | and associated task: http://paste.ubuntu.com/7232813/ | 23:06 |
dstokes | alrdy have the respawn limit task working properly. notifies me when a process fails to respawn (after limit) | 23:06 |
xnox | dstokes: so my "main" job simply does "main() { pause()};" | 23:10 |
xnox | dstokes: that's the process, and then externally i send FPE (kill -8) to it. | 23:10 |
xnox | $ cat /etc/init/test.conf | 23:10 |
xnox | respawn | 23:10 |
xnox | exec /tmp/a.out | 23:10 |
xnox | that's main job. | 23:10 |
xnox | and here is my "monitor" | 23:10 |
xnox | $ cat /etc/init/monitor.conf | 23:10 |
xnox | start on stopping test | 23:10 |
xnox | stop on stopped test | 23:10 |
xnox | script | 23:10 |
xnox | sleep 2; echo "Main job respawned successfully" | 23:10 |
xnox | end script | 23:10 |
xnox | .. | 23:10 |
xnox | dstokes: so when job under test fails (stopping test) monitor kicks in. If the job is getting normally stopped or reached respawn limit, the monitor will be stopped. | 23:11 |
xnox | dstokes: however, if respawn succeeds and the job stays alive for 2 seconds then in /var/log/upstart/monitor.log i get notification that respawn was successful. | 23:12 |
xnox | dstokes: but the sleep2 needs to be adjusted. You could do better without a sleep | 23:12 |
dstokes | clever.. | 23:13 |
xnox | dstokes: e.g. "stop on stopped test or started test" | 23:13 |
xnox | dstokes: and then instead of script, you'd have a post-stop script -> which checks the stop event environment. | 23:13 |
xnox | dstokes: if the reason for getting stopped is "started test" it means a succssful respawn happened. | 23:13 |
xnox | let me code that. | 23:14 |
xnox | dstokes: nah, needs sleep / appropriate matching for respawn limit none-the-less. | 23:21 |
Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!