Results 1 to 5 of 5

Thread: No Service Down Notification after Host Recovery

  1. #1
    Junior Member
    Join Date
    Sep 2012
    Posts
    15

    No Service Down Notification after Host Recovery

    I had a situation recently where a host when down and recovered, however not all of the service recovered. Shinken didn't kick off a service down alert and we didn't know about the service outage until the customer reported it :x

    I found something logged in Github that may be related here: https://github.com/naparuba/shinken/issues/857

    Is it a known issue that service down alerts don't get sent after a host down/recovery?

    Here is the alert history for this host:

    Timestamp Event Detail Service Message
    01/13/2014 18:18:54 Service Critical smtp-simple CRITICAL;SOFT;1;Connection refused
    01/13/2014 18:19:51 Host Down DOWN;SOFT;1;PING CRITICAL - Packet loss = 100%
    01/13/2014 18:20:10 Service Unknown smtp-simple UNKNOWN;SOFT;2;(Service Check Timed Out)
    01/13/2014 18:20:24 Host Down DOWN;HARD;2;PING CRITICAL - Packet loss = 100%
    01/13/2014 18:21:05 Service Critical http-simple CRITICAL;SOFT;1;CRITICAL - Socket timeout after 10 seconds
    01/13/2014 18:21:23 Service Critical smtp-simple CRITICAL;HARD;3;CRITICAL - Socket timeout after 10 seconds
    01/13/2014 18:22:07 Host Up UP;HARD;2;PING OK - Packet loss = 0%, RTA = 31.83 ms
    01/13/2014 18:22:08 Service Critical http-simple CRITICAL;SOFT;2;Connection refused
    01/13/2014 18:23:22 Service Unknown http-simple UNKNOWN;HARD;3;(Service Check Timed Out)
    01/13/2014 18:25:27 Service Ok http-simple OK;HARD;3;HTTP OK: HTTP/1.1 200 OK - 955 bytes in 1.081 second response time
    01/13/2014 18:59:20 Service started flapping smtp-simple STARTED; Service appears to have started flapping (32.0% change >= 30.0% threshold)
    01/13/2014 19:28:26 Service stoppedflapping smtp-simple STOPPED; Service appears to have stopped flapping (19.0% change < 20.0% threshold)
    01/13/2014 19:39:31 Service started flapping smtp-simple STARTED; Service appears to have started flapping (33.4% change >= 30.0% threshold)
    01/13/2014 22:24:12 Service stoppedflapping smtp-simple STOPPED; Service appears to have stopped flapping (18.8% change < 20.0% threshold)
    01/13/2014 22:48:43 Service started flapping smtp-simple STARTED; Service appears to have started flapping (33.2% change >= 30.0% threshold)
    01/14/2014 02:40:31 Service stoppedflapping smtp-simple STOPPED; Service appears to have stopped flapping (18.8% change < 20.0% threshold)
    01/14/2014 02:51:37 Service started flapping smtp-simple STARTED; Service appears to have started flapping (33.1% change >= 30.0% threshold)
    01/14/2014 08:12:07 Service Ok smtp-simple OK;HARD;3;SMTP OK - 0.204 sec. response time
    01/14/2014 08:17:01 Service Critical http-simple CRITICAL;SOFT;1;CRITICAL - Socket timeout after 10 seconds
    01/14/2014 08:17:16 Host Down DOWN;SOFT;1;PING CRITICAL - Packet loss = 100%
    01/14/2014 08:17:58 Host Down DOWN;HARD;2;PING CRITICAL - Packet loss = 100%
    01/14/2014 08:18:15 Service Critical http-simple CRITICAL;SOFT;2;CRITICAL - Socket timeout after 10 seconds
    01/14/2014 08:18:27 Service Critical smtp-simple CRITICAL;SOFT;1;CRITICAL - Socket timeout after 10 seconds
    01/14/2014 08:19:30 Service Critical http-simple CRITICAL;HARD;3;CRITICAL - Socket timeout after 10 seconds
    01/14/2014 08:19:38 Service Critical smtp-simple CRITICAL;SOFT;2;Connection refused
    01/14/2014 08:19:38 Host Up UP;HARD;2;PING OK - Packet loss = 83%, RTA = 33.68 ms
    01/14/2014 08:20:52 Service Critical smtp-simple CRITICAL;HARD;3;CRITICAL - Socket timeout after 10 seconds
    01/14/2014 08:21:37 Service Ok http-simple OK;HARD;3;HTTP OK: HTTP/1.1 200 OK - 955 bytes in 4.338 second response time
    01/14/2014 08:22:57 Service Ok smtp-simple OK;HARD;3;SMTP OK - 0.463 sec. response time
    01/14/2014 08:54:51 Service Critical http-simple CRITICAL;SOFT;1;CRITICAL - Socket timeout after 10 seconds
    01/14/2014 08:55:55 Service Ok http-simple OK;SOFT;2;HTTP OK: HTTP/1.1 200 OK - 955 bytes in 0.068 second response time
    01/14/2014 08:56:07 Service stoppedflapping smtp-simple STOPPED; Service appears to have stopped flapping (17.8% change < 20.0% threshold)

  2. #2
    Administrator Frescha's Avatar
    Join Date
    May 2011
    Posts
    183

    Re: No Service Down Notification after Host Recovery

    • [li]Shinken version (or if you're pulling directly from the Git repo, your current commit SHA - use git rev-parse HEAD)?[/li]
      [li]OS version?[/li]

  3. #3
    Junior Member
    Join Date
    Sep 2012
    Posts
    15

    Re: No Service Down Notification after Host Recovery

    I suppose I should include those ;D

    - Shinken 1.2
    - CentOS 6.3

  4. #4
    Junior Member
    Join Date
    Sep 2012
    Posts
    15

    Re: No Service Down Notification after Host Recovery

    I've been digging into this issue some more, thinking it's perhaps a fault in my configuration, but I haven't had anything turn up wrong from what I can tell.

    I receive notifications for service up/down. This problem only seems to occur when the host goes down and then recovers without it's services. The services never send a down alert for the recovered host.

    I'm going to lab this with some more verbose logging to see if this is a dependency logic problem somewhere. Suggestions welcome!

  5. #5
    Junior Member
    Join Date
    Sep 2012
    Posts
    15

    Re: No Service Down Notification after Host Recovery

    I've been doing some more testing and have it nailed down to this:

    If the service goes into soft down before the host goes into hard down, a service down notification is not sent if the host recovers and that service still remains down. Here are two scenarios which i tested this:

    Scenario #1: host goes soft down first and then hard, host recovers, service (smtp) remains down and the alert gets kicked off

    [1389895447] HOST ALERT: SECCO-TEST-VM;DOWN;SOFT;1;PING CRITICAL - Packet loss = 100%
    [1389895457] SERVICE ALERT: SECCO-TEST-VM;http-simple;CRITICAL;SOFT;1;No route to host
    [1389895464] HOST ALERT: SECCO-TEST-VM;DOWN;HARD;2;CRITICAL - Host Unreachable (10.10.1.224)
    [1389895464] HOST NOTIFICATION: UTIL-dlist;SECCO-TEST-VM;DOWN;notify-host-by-email;CRITICAL - Host Unreachable (10.10.1.224)
    [1389895513] SERVICE ALERT: SECCO-TEST-VM;smtp-simple;CRITICAL;SOFT;1;No route to host
    [1389895522] SERVICE ALERT: SECCO-TEST-VM;http-simple;CRITICAL;SOFT;2;No route to host
    [1389895581] SERVICE ALERT: SECCO-TEST-VM;smtp-simple;CRITICAL;SOFT;2;No route to host
    [1389895581] HOST ALERT: SECCO-TEST-VM;UP;HARD;2;PING OK - Packet loss = 80%, RTA = 1.21 ms
    [1389895581] HOST NOTIFICATION: UTIL-dlist;SECCO-TEST-VM;UP;notify-host-by-email;PING OK - Packet loss = 80%, RTA = 1.21 ms
    [1389895591] SERVICE ALERT: SECCO-TEST-VM;http-simple;OK;SOFT;3;HTTP OK: HTTP/1.1 200 OK - 1641 bytes in 3.391 second response time
    [1389895645] SERVICE ALERT: SECCO-TEST-VM;smtp-simple;CRITICAL;HARD;3;Connection refused
    [1389895645] SERVICE NOTIFICATION: UTIL-dlist;SECCO-TEST-VM;smtp-simple;CRITICAL;notify-service-by-email;Connection refused
    Scenario #2: service(SMTP) goes down soft down before host goes hard down, service (smtp) remains down, but no alert sent

    [1389896265] SERVICE ALERT: SECCO-TEST-VM;smtp-simple;CRITICAL;SOFT;1;Connection refused
    [1389896339] SERVICE ALERT: SECCO-TEST-VM;smtp-simple;CRITICAL;SOFT;2;CRITICAL - Socket timeout after 10 seconds
    [1389896340] SERVICE ALERT: SECCO-TEST-VM;http-simple;CRITICAL;SOFT;1;No route to host
    [1389896344] HOST ALERT: SECCO-TEST-VM;DOWN;SOFT;1;CRITICAL - Host Unreachable (10.10.1.224)
    [1389896354] HOST ALERT: SECCO-TEST-VM;DOWN;HARD;2;CRITICAL - Host Unreachable (10.10.1.224)
    [1389896354] HOST NOTIFICATION: UTIL-dlist;SECCO-TEST-VM;DOWN;notify-host-by-email;CRITICAL - Host Unreachable (10.10.1.224)
    [1389896406] SERVICE ALERT: SECCO-TEST-VM;smtp-simple;CRITICAL;HARD;3;No route to host
    [1389896406] SERVICE ALERT: SECCO-TEST-VM;http-simple;CRITICAL;SOFT;2;No route to host
    [1389896470] SERVICE ALERT: SECCO-TEST-VM;http-simple;OK;SOFT;3;HTTP OK: HTTP/1.1 200 OK - 1641 bytes in 0.005 second response time
    [1389896477] HOST ALERT: SECCO-TEST-VM;UP;HARD;2;PING OK - Packet loss = 0%, RTA = 1.46 ms
    [1389896477] HOST NOTIFICATION: UTIL-dlist;SECCO-TEST-VM;UP;notify-host-by-email;PING OK - Packet loss = 0%, RTA = 1.46 ms
    Logs taken from /usr/local/shinken/var/nagios.log

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •