monit doesn't run stop action

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

monit doesn't run stop action

Marc Rossi
Looking through source right now but figured I'd throw it out to list to see if this is something obvious I'm doing wrong.

Long time monit user but on a few of our apps we have recently been having problems with the shutdown action possibly not running.

For the app that DOES shut down properly logs show the following:

[CST Mar  4 17:00:02] info     : 'foo' stop on user request
[CST Mar  4 17:00:02] info     : Monit daemon with PID 17733 awakened
[CST Mar  4 17:00:02] info     : Awakened by User defined signal 1
[CST Mar  4 17:00:02] info     : 'foo' stop: '/usr/bin/pkill -u nobody -f /usr/local/bin/foo.py'
[CST Mar  4 17:00:02] info     : 'foo' stop action done

For the app that is not stopping properly logs show the following:

[CST Mar  4 15:15:01] info     : 'bar' stop on user request
[CST Mar  4 15:15:01] info     : Monit daemon with PID 17733 awakened
[CST Mar  4 15:15:01] info     : Awakened by User defined signal 1
[CST Mar  4 15:15:01] info     : 'bar' stop action done

Could be a red herring but where is the stop action line in the second log excerpt? Now the shutdown commands are indeed different between foo & bar but still would expect to see the stop action listed.

TIA
Marc


--
To unsubscribe:
https://lists.nongnu.org/mailman/listinfo/monit-general
Reply | Threaded
Open this post in threaded view
|

Re: monit doesn't run stop action

martinp@tildeslash.com
Hi,

please can you add the configuration of "foo" and "bar" services?

There are for example these possible reasons:

1.) the "bar" service is a process and monit detected that the process is not running - in this case it gets a fast path and stop is skipped (the process is not running)

2.) there was a problem if you used "check program" in combination with the "every" statement ... fixed in monit 5.25.3: https://bitbucket.org/tildeslash/monit/issues/759

Best regards,
Martin


On 5 Mar 2019, at 16:24, Marc Rossi <[hidden email]> wrote:

Looking through source right now but figured I'd throw it out to list to see if this is something obvious I'm doing wrong.

Long time monit user but on a few of our apps we have recently been having problems with the shutdown action possibly not running.

For the app that DOES shut down properly logs show the following:

[CST Mar  4 17:00:02] info     : 'foo' stop on user request
[CST Mar  4 17:00:02] info     : Monit daemon with PID 17733 awakened
[CST Mar  4 17:00:02] info     : Awakened by User defined signal 1
[CST Mar  4 17:00:02] info     : 'foo' stop: '/usr/bin/pkill -u nobody -f /usr/local/bin/foo.py'
[CST Mar  4 17:00:02] info     : 'foo' stop action done

For the app that is not stopping properly logs show the following:

[CST Mar  4 15:15:01] info     : 'bar' stop on user request
[CST Mar  4 15:15:01] info     : Monit daemon with PID 17733 awakened
[CST Mar  4 15:15:01] info     : Awakened by User defined signal 1
[CST Mar  4 15:15:01] info     : 'bar' stop action done

Could be a red herring but where is the stop action line in the second log excerpt? Now the shutdown commands are indeed different between foo & bar but still would expect to see the stop action listed.

TIA
Marc

--
To unsubscribe:
https://lists.nongnu.org/mailman/listinfo/monit-general


--
To unsubscribe:
https://lists.nongnu.org/mailman/listinfo/monit-general
Reply | Threaded
Open this post in threaded view
|

Re: monit doesn't run stop action

Marc Rossi
Yeah was looking through the code and saw the call to check if process is running before issuing stop (ProcessTree_findProcess), so that was only thought I had as well.

check process foo matching /usr/local/bin/foo.py
      start program = "/bin/bash -l -c 'nohup /usr/local/bin/foo.py &'" as uid "nobody"
      stop program = "/usr/bin/pkill -u nobody -f /usr/local/bin/foo.py" as uid "nobody"
      if uptime > 11 hours then alert
      if uptime > 12 hours then exec "/usr/bin/pkill -u nobody -f -9 /usr/local/bin/foo.py" as uid "nobody"
      if 2 restarts within 3 cycles then timeout
      group apps
      depends foo.py

check process bar matching ^/usr/local/bin/bar
      start program = "/bin/bash -lc 'HOME=/home/someuser nohup /usr/local/bin/bar.sh > /tmp/bar-startup.out 2>&1 &'"
      stop program = "/bin/bash -c '/usr/bin/pkill -f ^/usr/local/bin/bar; sleep 1; /usr/bin/pkill -f ^/usr/local/bin/bar'"
      onreboot nostart
      if uptime > 12 hours then exec "/usr/bin/pkill -9 -f ^/usr/local/bin/bar"
      group apps
      mode passive

Here are logs from yesterday and today wrt to "bar"

[CST Mar  1 15:15:01] info     : 'bar' stop action done
[CST Mar  4 07:02:01] info     : 'bar' start on user request
[CST Mar  4 07:02:01] info     : 'bar' start action done
[CST Mar  4 07:02:01] error    : 'bar' uptime test failed for /usr/local/bin/bar-- current uptime is 259177 seconds
<we get above since it failed to shutdown on 3/1>
[CST Mar  4 07:02:01] info     : 'bar' exec: '/usr/bin/pkill -9 -f /usr/local/bin/bar'
[CST Mar  4 07:02:21] error    : 'bar' process is not running
<above line repeats every 20 seconds until we manually start it via monit>
[CST Mar  4 07:51:11] info     : 'bar' start: '/bin/bash -lc HOME=/home/someuser nohup /usr/local/bin/bar.sh > /tmp/bar-startup.out 2>&1 &'
[CST Mar  4 07:51:11] info     : 'bar' start action done
[CST Mar  4 07:51:11] info     : 'bar' process is running with pid 4897
[CST Mar  4 07:51:11] info     : 'bar' uptime test succeeded [current uptime = 1 seconds]
[CST Mar  4 15:15:01] info     : 'bar' stop on user request
[CST Mar  4 15:15:01] info     : 'bar' stop action done
<below same thing repeats itself the following morning>
[CST Mar  5 07:02:01] info     : 'bar' start on user request
[CST Mar  5 07:02:01] info     : 'bar' start action done
[CST Mar  5 07:02:01] error    : 'bar' uptime test failed for /usr/local/bin/bar-- current uptime is 83451 seconds
[CST Mar  5 07:02:01] info     : 'bar' exec: '/usr/bin/pkill -9 -f /usr/local/bin/bar'

Thanks again for looking. Worst case I'll just build a debug version of monit with some extra logging to see what is going on.



On Tue, Mar 5, 2019 at 2:40 PM [hidden email] <[hidden email]> wrote:
Hi,

please can you add the configuration of "foo" and "bar" services?

There are for example these possible reasons:

1.) the "bar" service is a process and monit detected that the process is not running - in this case it gets a fast path and stop is skipped (the process is not running)

2.) there was a problem if you used "check program" in combination with the "every" statement ... fixed in monit 5.25.3: https://bitbucket.org/tildeslash/monit/issues/759

Best regards,
Martin


On 5 Mar 2019, at 16:24, Marc Rossi <[hidden email]> wrote:

Looking through source right now but figured I'd throw it out to list to see if this is something obvious I'm doing wrong.

Long time monit user but on a few of our apps we have recently been having problems with the shutdown action possibly not running.

For the app that DOES shut down properly logs show the following:

[CST Mar  4 17:00:02] info     : 'foo' stop on user request
[CST Mar  4 17:00:02] info     : Monit daemon with PID 17733 awakened
[CST Mar  4 17:00:02] info     : Awakened by User defined signal 1
[CST Mar  4 17:00:02] info     : 'foo' stop: '/usr/bin/pkill -u nobody -f /usr/local/bin/foo.py'
[CST Mar  4 17:00:02] info     : 'foo' stop action done

For the app that is not stopping properly logs show the following:

[CST Mar  4 15:15:01] info     : 'bar' stop on user request
[CST Mar  4 15:15:01] info     : Monit daemon with PID 17733 awakened
[CST Mar  4 15:15:01] info     : Awakened by User defined signal 1
[CST Mar  4 15:15:01] info     : 'bar' stop action done

Could be a red herring but where is the stop action line in the second log excerpt? Now the shutdown commands are indeed different between foo & bar but still would expect to see the stop action listed.

TIA
Marc

--
To unsubscribe:
https://lists.nongnu.org/mailman/listinfo/monit-general

--
To unsubscribe:
https://lists.nongnu.org/mailman/listinfo/monit-general

--
To unsubscribe:
https://lists.nongnu.org/mailman/listinfo/monit-general
Reply | Threaded
Open this post in threaded view
|

Re: monit doesn't run stop action

Szépe Viktor
Idézem/Quoting Marc Rossi <[hidden email]>:

> Yeah was looking through the code and saw the call to check if process is
> running before issuing stop (ProcessTree_findProcess), so that was only
> thought I had as well.
>
> check process foo matching /usr/local/bin/foo.py
>       start program = "/bin/bash -l -c 'nohup /usr/local/bin/foo.py &'" as
> uid "nobody"
>       stop program = "/usr/bin/pkill -u nobody -f /usr/local/bin/foo.py" as
> uid "nobody"
>       if uptime > 11 hours then alert
>       if uptime > 12 hours then exec "/usr/bin/pkill -u nobody -f -9
> /usr/local/bin/foo.py" as uid "nobody"
>       if 2 restarts within 3 cycles then timeout
>       group apps
>       depends foo.py
>
> check process bar matching ^/usr/local/bin/bar
>       start program = "/bin/bash -lc 'HOME=/home/someuser nohup
> /usr/local/bin/bar.sh > /tmp/bar-startup.out 2>&1 &'"
>       stop program = "/bin/bash -c '/usr/bin/pkill -f ^/usr/local/bin/bar;
> sleep 1; /usr/bin/pkill -f ^/usr/local/bin/bar'"
>       onreboot nostart
>       if uptime > 12 hours then exec "/usr/bin/pkill -9 -f
> ^/usr/local/bin/bar"
>       group apps
>       mode passive

BTW it is highly dangerous to run pid file-less and interpreted  
software with Monit as you may meet some unwanted incidents

Try implementing a pid file in your scripts.

All the best!


SZÉPE Viktor, webes alkalmazás üzemeltetés / Running your application
https://github.com/szepeviktor/debian-server-tools/blob/master/CV.md
~~~
ügyelet/hotline: +36-20-4242498  [hidden email]  skype: szepe.viktor
Budapest, III. kerület






--
To unsubscribe:
https://lists.nongnu.org/mailman/listinfo/monit-general
Reply | Threaded
Open this post in threaded view
|

Re: monit doesn't run stop action

Marc Rossi
Agree on the pidfile stuff (and we have ran into those "somewhat unwanted incidents" by not using them). We usually do but sometimes it is out of my control and you can't fight'em all.

On Tue, Mar 5, 2019 at 3:22 PM SZÉPE Viktor <[hidden email]> wrote:
Idézem/Quoting Marc Rossi <[hidden email]>:

> Yeah was looking through the code and saw the call to check if process is
> running before issuing stop (ProcessTree_findProcess), so that was only
> thought I had as well.
>
> check process foo matching /usr/local/bin/foo.py
>       start program = "/bin/bash -l -c 'nohup /usr/local/bin/foo.py &'" as
> uid "nobody"
>       stop program = "/usr/bin/pkill -u nobody -f /usr/local/bin/foo.py" as
> uid "nobody"
>       if uptime > 11 hours then alert
>       if uptime > 12 hours then exec "/usr/bin/pkill -u nobody -f -9
> /usr/local/bin/foo.py" as uid "nobody"
>       if 2 restarts within 3 cycles then timeout
>       group apps
>       depends foo.py
>
> check process bar matching ^/usr/local/bin/bar
>       start program = "/bin/bash -lc 'HOME=/home/someuser nohup
> /usr/local/bin/bar.sh > /tmp/bar-startup.out 2>&1 &'"
>       stop program = "/bin/bash -c '/usr/bin/pkill -f ^/usr/local/bin/bar;
> sleep 1; /usr/bin/pkill -f ^/usr/local/bin/bar'"
>       onreboot nostart
>       if uptime > 12 hours then exec "/usr/bin/pkill -9 -f
> ^/usr/local/bin/bar"
>       group apps
>       mode passive

BTW it is highly dangerous to run pid file-less and interpreted 
software with Monit as you may meet some unwanted incidents

Try implementing a pid file in your scripts.

All the best!


SZÉPE Viktor, webes alkalmazás üzemeltetés / Running your application
https://github.com/szepeviktor/debian-server-tools/blob/master/CV.md
~~~
ügyelet/hotline: +36-20-4242498  [hidden email]  skype: szepe.viktor
Budapest, III. kerület






--
To unsubscribe:
https://lists.nongnu.org/mailman/listinfo/monit-general

--
To unsubscribe:
https://lists.nongnu.org/mailman/listinfo/monit-general
Reply | Threaded
Open this post in threaded view
|

Re: monit doesn't run stop action

Marc Rossi
Just to follow-up, I figured out what was causing the shutdown issue. The process giving me shutdown issues (foo) has a dependency on a different process (bar) for which I do not control the startup/shutdown.  So the config looks as follows:

   check process foo
       ...
       depends bar

Before "bar" is shutdown by a method outside of my control I issue a "monit unmonitor bar". What I was unaware of is issuing this command on the "bar" process results in it being issued internally for all other processes that are dependent on it. A minute later when I issue the "monit stop foo" command it does nothing as it no longer believes the "foo" process is running.

I would argue that in this situation monit should perform the "stop program" action for safety instead of the "unmonitor" action as "foo" shouldn't be running if "bar" isn't and monit no longer knows if "bar" is running.

So my options are to either flip the steps (stop foo, then unmonitor bar) or just remove the dependency. I'll probably go with the first option as the second could have give us some bad outcomes.

Marc

On Thu, Mar 7, 2019 at 8:00 AM Marc Rossi <[hidden email]> wrote:
Agree on the pidfile stuff (and we have ran into those "somewhat unwanted incidents" by not using them). We usually do but sometimes it is out of my control and you can't fight'em all.

On Tue, Mar 5, 2019 at 3:22 PM SZÉPE Viktor <[hidden email]> wrote:
Idézem/Quoting Marc Rossi <[hidden email]>:

> Yeah was looking through the code and saw the call to check if process is
> running before issuing stop (ProcessTree_findProcess), so that was only
> thought I had as well.
>
> check process foo matching /usr/local/bin/foo.py
>       start program = "/bin/bash -l -c 'nohup /usr/local/bin/foo.py &'" as
> uid "nobody"
>       stop program = "/usr/bin/pkill -u nobody -f /usr/local/bin/foo.py" as
> uid "nobody"
>       if uptime > 11 hours then alert
>       if uptime > 12 hours then exec "/usr/bin/pkill -u nobody -f -9
> /usr/local/bin/foo.py" as uid "nobody"
>       if 2 restarts within 3 cycles then timeout
>       group apps
>       depends foo.py
>
> check process bar matching ^/usr/local/bin/bar
>       start program = "/bin/bash -lc 'HOME=/home/someuser nohup
> /usr/local/bin/bar.sh > /tmp/bar-startup.out 2>&1 &'"
>       stop program = "/bin/bash -c '/usr/bin/pkill -f ^/usr/local/bin/bar;
> sleep 1; /usr/bin/pkill -f ^/usr/local/bin/bar'"
>       onreboot nostart
>       if uptime > 12 hours then exec "/usr/bin/pkill -9 -f
> ^/usr/local/bin/bar"
>       group apps
>       mode passive

BTW it is highly dangerous to run pid file-less and interpreted 
software with Monit as you may meet some unwanted incidents

Try implementing a pid file in your scripts.

All the best!


SZÉPE Viktor, webes alkalmazás üzemeltetés / Running your application
https://github.com/szepeviktor/debian-server-tools/blob/master/CV.md
~~~
ügyelet/hotline: +36-20-4242498  [hidden email]  skype: szepe.viktor
Budapest, III. kerület






--
To unsubscribe:
https://lists.nongnu.org/mailman/listinfo/monit-general

--
To unsubscribe:
https://lists.nongnu.org/mailman/listinfo/monit-general
Reply | Threaded
Open this post in threaded view
|

Re: monit doesn't run stop action

Lutz Mader
Hello Marc,
yes, this is ugly, but this is the way how monit work.
Unfortunately Monit handle stopped resources like unmonitored resources,
or vice versa. The "Not monitored" resources are ignored by monit, only
"start" or "monitor" commands are honored.

> Before "bar" is shutdown by a method outside of my control I issue a "monit
> unmonitor bar". What I was unaware of is issuing this command on the "bar"
> process results in it being issued internally for all other processes that
> are dependent on it. A minute later when I issue the "monit stop foo"
> command it does nothing as it no longer believes the "foo" process is

I remove the dependencies for some resources to fix problems like this
or add add additional resources to define implicit dependencies. And I
check the application log files to find applications/resources are
started/stopped by other tools.

The old way:
"proc3" depends to "proc2" depends to "proc1"

check process proc3
if failed port 1234 then alert
depends proc2

My new way:
"proc1"
"proc2" depends to "host1"
"proc3" depends to "host2"

check process proc3
depends host2

check host host2
if failed port 1234 then alert

As long as nobody disable the host check the applications/resources
depends to the host/port are ready to start if the port became available.
To start an or all applications/resources I use "unmonitor" and Monit
will start the applications/resources step by step. And "stop" is used
to stop the applications/resources, but I never stop the host checks.

A suggestion only,
Lutz

--
To unsubscribe:
https://lists.nongnu.org/mailman/listinfo/monit-general