Monit not restarting a service reliably

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Monit not restarting a service reliably

Jan Rychter
Hi,

I'm looking for help, because I can't figure out what I'm doing wrong. I have a simple monit setup, which is supposed to monitor a web server and restart it if anything seems wrong.

This seems to work but not always. Monit does restart the service, but on subsequent failures it just notices that the service isn't working and doesn't act anymore.

Example from the log, where the service was restarted, but went down again, and monit didn't do anything:

[CEST May 31 06:44:11] info     : 'triac.mysite.com' Monit 5.16 started
[CEST May 31 09:36:29] error    : 'mysite.com' failed protocol test [HTTP] at [mysite.com]:443 [TCP/IP SSL] -- HTTP: Error receiving data -- Resource temporarily unavailable
[CEST May 31 09:37:39] error    : 'mysite.com' failed protocol test [HTTP] at [mysite.com]:443 [TCP/IP SSL] -- HTTP: Error receiving data -- Resource temporarily unavailable
[CEST May 31 09:37:39] info     : 'mysite.com' exec: /usr/bin/supervisorctl
[CEST May 31 09:38:49] error    : 'mysite.com' failed protocol test [HTTP] at [mysite.com]:443 [TCP/IP SSL] -- HTTP: Error receiving data -- Resource temporarily unavailable
[CEST May 31 09:39:59] error    : 'mysite.com' failed protocol test [HTTP] at [mysite.com]:443 [TCP/IP SSL] -- HTTP: Error receiving data -- Resource temporarily unavailable
[CEST May 31 09:41:09] error    : 'mysite.com' failed protocol test [HTTP] at [mysite.com]:443 [TCP/IP SSL] -- HTTP: Error receiving data -- Resource temporarily unavailable
[CEST May 31 09:42:19] error    : 'mysite.com' failed protocol test [HTTP] at [mysite.com]:443 [TCP/IP SSL] -- HTTP: Error receiving data -- Resource temporarily unavailable
[CEST May 31 09:43:29] error    : 'mysite.com' failed protocol test [HTTP] at [mysite.com]:443 [TCP/IP SSL] -- HTTP: Error receiving data -- Resource temporarily unavailable
[CEST May 31 09:44:39] error    : 'mysite.com' failed protocol test [HTTP] at [mysite.com]:443 [TCP/IP SSL] -- HTTP: Error receiving data -- Resource temporarily unavailable
[CEST May 31 09:45:50] error    : 'mysite.com' failed protocol test [HTTP] at [mysite.com]:443 [TCP/IP SSL] -- HTTP: Error receiving data -- Resource temporarily unavailable
[CEST May 31 09:47:00] error    : 'mysite.com' failed protocol test [HTTP] at [mysite.com]:443 [TCP/IP SSL] -- HTTP: Error receiving data -- Resource temporarily unavailable
[CEST May 31 09:48:10] error    : 'mysite.com' failed protocol test [HTTP] at [mysite.com]:443 [TCP/IP SSL] -- HTTP: Error receiving data -- Resource temporarily unavailable

The net result is that the service doesn't work and monit just sits there, knowing that the service failed the protocol test, but doing nothing about it.

I suspect this is because monit does not notice that the service was OK after restarting for a moment, so it does not notice another transition from OK to failed.

Here is the relevant part of the configuration (nearly all of it):

set daemon 60
check host mysite.com with address mysite.com
if failed
  port 443
  protocol https
  with ssl options {verify: enable}
  for 2 cycles
then exec "/usr/bin/supervisorctl restart mysite"
if 20 restarts within 60 cycles then unmonitor

Is there a way to achieve unconditional actions? E.g. "even though I haven't noticed the service to transition from failed to working, restart it anyway after 60 seconds if it is still in the failed state"

Any help would be much appreciated.

--J.


--
To unsubscribe:
https://lists.nongnu.org/mailman/listinfo/monit-general
Reply | Threaded
Open this post in threaded view
|

Re: Monit not restarting a service reliably

martinp@tildeslash.com
Hi,

since monit 5.16.0, the exec action is executed only on a state change. In your case the service didn't transition to the "succeeded" state, so the exec action wasn't repeated.

If you want to retry the exec action if the service remains in failure state, you can use the "repeat" option.

Snip from monit 5.16.0 changelog which provides more details:

--8<--
New: The exec action is now executed only once, on state change, same way as the alert
action. The new "repeat" option allows to repeat the exec action after given number of
cycles if the error persists.  Syntax:
        if <test> then exec <script> repeat every <x> cycles
If you want to get the old behaviour, use "repeat every 1 cycle". Example:
        if failed port 1234 then exec "/usr/bin/myscript.sh" repeat every 5 cycles
--8<--

Best regards,
Martin


> On 31 May 2019, at 19:14, Jan Rychter <[hidden email]> wrote:
>
> Hi,
>
> I'm looking for help, because I can't figure out what I'm doing wrong. I have a simple monit setup, which is supposed to monitor a web server and restart it if anything seems wrong.
>
> This seems to work but not always. Monit does restart the service, but on subsequent failures it just notices that the service isn't working and doesn't act anymore.
>
> Example from the log, where the service was restarted, but went down again, and monit didn't do anything:
>
> [CEST May 31 06:44:11] info     : 'triac.mysite.com' Monit 5.16 started
> [CEST May 31 09:36:29] error    : 'mysite.com' failed protocol test [HTTP] at [mysite.com]:443 [TCP/IP SSL] -- HTTP: Error receiving data -- Resource temporarily unavailable
> [CEST May 31 09:37:39] error    : 'mysite.com' failed protocol test [HTTP] at [mysite.com]:443 [TCP/IP SSL] -- HTTP: Error receiving data -- Resource temporarily unavailable
> [CEST May 31 09:37:39] info     : 'mysite.com' exec: /usr/bin/supervisorctl
> [CEST May 31 09:38:49] error    : 'mysite.com' failed protocol test [HTTP] at [mysite.com]:443 [TCP/IP SSL] -- HTTP: Error receiving data -- Resource temporarily unavailable
> [CEST May 31 09:39:59] error    : 'mysite.com' failed protocol test [HTTP] at [mysite.com]:443 [TCP/IP SSL] -- HTTP: Error receiving data -- Resource temporarily unavailable
> [CEST May 31 09:41:09] error    : 'mysite.com' failed protocol test [HTTP] at [mysite.com]:443 [TCP/IP SSL] -- HTTP: Error receiving data -- Resource temporarily unavailable
> [CEST May 31 09:42:19] error    : 'mysite.com' failed protocol test [HTTP] at [mysite.com]:443 [TCP/IP SSL] -- HTTP: Error receiving data -- Resource temporarily unavailable
> [CEST May 31 09:43:29] error    : 'mysite.com' failed protocol test [HTTP] at [mysite.com]:443 [TCP/IP SSL] -- HTTP: Error receiving data -- Resource temporarily unavailable
> [CEST May 31 09:44:39] error    : 'mysite.com' failed protocol test [HTTP] at [mysite.com]:443 [TCP/IP SSL] -- HTTP: Error receiving data -- Resource temporarily unavailable
> [CEST May 31 09:45:50] error    : 'mysite.com' failed protocol test [HTTP] at [mysite.com]:443 [TCP/IP SSL] -- HTTP: Error receiving data -- Resource temporarily unavailable
> [CEST May 31 09:47:00] error    : 'mysite.com' failed protocol test [HTTP] at [mysite.com]:443 [TCP/IP SSL] -- HTTP: Error receiving data -- Resource temporarily unavailable
> [CEST May 31 09:48:10] error    : 'mysite.com' failed protocol test [HTTP] at [mysite.com]:443 [TCP/IP SSL] -- HTTP: Error receiving data -- Resource temporarily unavailable
>
> The net result is that the service doesn't work and monit just sits there, knowing that the service failed the protocol test, but doing nothing about it.
>
> I suspect this is because monit does not notice that the service was OK after restarting for a moment, so it does not notice another transition from OK to failed.
>
> Here is the relevant part of the configuration (nearly all of it):
>
> set daemon 60
> check host mysite.com with address mysite.com
> if failed
>  port 443
>  protocol https
>  with ssl options {verify: enable}
>  for 2 cycles
> then exec "/usr/bin/supervisorctl restart mysite"
> if 20 restarts within 60 cycles then unmonitor
>
> Is there a way to achieve unconditional actions? E.g. "even though I haven't noticed the service to transition from failed to working, restart it anyway after 60 seconds if it is still in the failed state"
>
> Any help would be much appreciated.
>
> --J.
>
>
> --
> To unsubscribe:
> https://lists.nongnu.org/mailman/listinfo/monit-general


--
To unsubscribe:
https://lists.nongnu.org/mailman/listinfo/monit-general
Reply | Threaded
Open this post in threaded view
|

Re: Monit not restarting a service reliably

Jan Rychter
Hi,

Thanks for the information, this is exactly what I needed. This setting will make monit work for me again :-)

best regards,
--Jan

> On 2019-06-03, at 08:37, [hidden email] wrote:
>
> Hi,
>
> since monit 5.16.0, the exec action is executed only on a state change. In your case the service didn't transition to the "succeeded" state, so the exec action wasn't repeated.
>
> If you want to retry the exec action if the service remains in failure state, you can use the "repeat" option.
>
> Snip from monit 5.16.0 changelog which provides more details:
>
> --8<--
> New: The exec action is now executed only once, on state change, same way as the alert
> action. The new "repeat" option allows to repeat the exec action after given number of
> cycles if the error persists.  Syntax:
>        if <test> then exec <script> repeat every <x> cycles
> If you want to get the old behaviour, use "repeat every 1 cycle". Example:
>        if failed port 1234 then exec "/usr/bin/myscript.sh" repeat every 5 cycles
> --8<--
>
> Best regards,
> Martin
>
>
>> On 31 May 2019, at 19:14, Jan Rychter <[hidden email]> wrote:
>>
>> Hi,
>>
>> I'm looking for help, because I can't figure out what I'm doing wrong. I have a simple monit setup, which is supposed to monitor a web server and restart it if anything seems wrong.
>>
>> This seems to work but not always. Monit does restart the service, but on subsequent failures it just notices that the service isn't working and doesn't act anymore.
>>
>> Example from the log, where the service was restarted, but went down again, and monit didn't do anything:
>>
>> [CEST May 31 06:44:11] info     : 'triac.mysite.com' Monit 5.16 started
>> [CEST May 31 09:36:29] error    : 'mysite.com' failed protocol test [HTTP] at [mysite.com]:443 [TCP/IP SSL] -- HTTP: Error receiving data -- Resource temporarily unavailable
>> [CEST May 31 09:37:39] error    : 'mysite.com' failed protocol test [HTTP] at [mysite.com]:443 [TCP/IP SSL] -- HTTP: Error receiving data -- Resource temporarily unavailable
>> [CEST May 31 09:37:39] info     : 'mysite.com' exec: /usr/bin/supervisorctl
>> [CEST May 31 09:38:49] error    : 'mysite.com' failed protocol test [HTTP] at [mysite.com]:443 [TCP/IP SSL] -- HTTP: Error receiving data -- Resource temporarily unavailable
>> [CEST May 31 09:39:59] error    : 'mysite.com' failed protocol test [HTTP] at [mysite.com]:443 [TCP/IP SSL] -- HTTP: Error receiving data -- Resource temporarily unavailable
>> [CEST May 31 09:41:09] error    : 'mysite.com' failed protocol test [HTTP] at [mysite.com]:443 [TCP/IP SSL] -- HTTP: Error receiving data -- Resource temporarily unavailable
>> [CEST May 31 09:42:19] error    : 'mysite.com' failed protocol test [HTTP] at [mysite.com]:443 [TCP/IP SSL] -- HTTP: Error receiving data -- Resource temporarily unavailable
>> [CEST May 31 09:43:29] error    : 'mysite.com' failed protocol test [HTTP] at [mysite.com]:443 [TCP/IP SSL] -- HTTP: Error receiving data -- Resource temporarily unavailable
>> [CEST May 31 09:44:39] error    : 'mysite.com' failed protocol test [HTTP] at [mysite.com]:443 [TCP/IP SSL] -- HTTP: Error receiving data -- Resource temporarily unavailable
>> [CEST May 31 09:45:50] error    : 'mysite.com' failed protocol test [HTTP] at [mysite.com]:443 [TCP/IP SSL] -- HTTP: Error receiving data -- Resource temporarily unavailable
>> [CEST May 31 09:47:00] error    : 'mysite.com' failed protocol test [HTTP] at [mysite.com]:443 [TCP/IP SSL] -- HTTP: Error receiving data -- Resource temporarily unavailable
>> [CEST May 31 09:48:10] error    : 'mysite.com' failed protocol test [HTTP] at [mysite.com]:443 [TCP/IP SSL] -- HTTP: Error receiving data -- Resource temporarily unavailable
>>
>> The net result is that the service doesn't work and monit just sits there, knowing that the service failed the protocol test, but doing nothing about it.
>>
>> I suspect this is because monit does not notice that the service was OK after restarting for a moment, so it does not notice another transition from OK to failed.
>>
>> Here is the relevant part of the configuration (nearly all of it):
>>
>> set daemon 60
>> check host mysite.com with address mysite.com
>> if failed
>> port 443
>> protocol https
>> with ssl options {verify: enable}
>> for 2 cycles
>> then exec "/usr/bin/supervisorctl restart mysite"
>> if 20 restarts within 60 cycles then unmonitor
>>
>> Is there a way to achieve unconditional actions? E.g. "even though I haven't noticed the service to transition from failed to working, restart it anyway after 60 seconds if it is still in the failed state"
>>
>> Any help would be much appreciated.
>>
>> --J.
>>
>>
>> --
>> To unsubscribe:
>> https://lists.nongnu.org/mailman/listinfo/monit-general
>
>
> --
> To unsubscribe:
> https://lists.nongnu.org/mailman/listinfo/monit-general


--
To unsubscribe:
https://lists.nongnu.org/mailman/listinfo/monit-general