HI there,

I'm running Shinken 1.4.1 and use two Shinken masters for HA.
One master (FOO) is usually the active one and BAR is the spare, waiting for the active one to die.

I set the timeout for each Shinken master daemon to 3 seconds. With 3 attempts this should declare the daemon dead and unreachable in 9 seconds:
Code:
define scheduler {
 data_timeout 3
 check_interval 30
 weight 2
 skip_initial_broks 0
 modules RedisRetention_bs
 spare 0
 timeout 3
 address FOO
 scheduler_name scheduler-FOO
 max_check_attempts 3
 realm All
 port 7768
}
This is what happens when I reboot the active master:
  • [li]Quite quickly the spare one detects that the other master is dead and tries to dispatch its config[/li]
    [li]This results in a sequential pinging of each Shinken daemon found in the shinken-specific.cfg, which includes the previous master FOO![/li]
    [li]Now each attempt to reach the dead master results in a 1 minute timeout times two for some reason, see logfile below[/li]


Code:
2014-06-23 15:54:59,442 [1403531699] Info :  Arbiter Master is dead. The arbiter Arbiter-Master-itinfra-mon-bap01 take the lead
2014-06-23 15:54:59,442 [1403531699] Info :  Begin to dispatch configurations to satellites
2014-06-23 15:54:59,442 [1403531699] Info :  Pinging scheduler-FOO
2014-06-23 15:54:59,444 [1403531699] Info :   (PYROLOC://FOO:7768/ForArbiter)
2014-06-23 15:56:02,540 [1403531762] Warning : Add failed attempt to scheduler-FOO (1/3) connection failed
2014-06-23 15:57:05,645 [1403531825] Info :  Pinging scheduler-satellite2
2014-06-23 15:57:05,647 [1403531825] Info :   (PYROLOC://satellite2:7768/ForArbiter)

2014-06-23 15:57:07,216 [1403531827] Info :  Pinging reactionner-FOO
2014-06-23 15:57:07,217 [1403531827] Info :   (PYROLOC://FOO:7769/ForArbiter)
2014-06-23 15:58:10,348 [1403531890] Warning : Add failed attempt to reactionner-FOO (1/3) connection failed
2014-06-23 15:59:13,453 [1403531953] Info :  Pinging reactionner-BAR
2014-06-23 15:59:13,453 [1403531953] Info :   (PYROLOC://BAR:7769/ForArbiter)
2014-06-23 15:59:13,455 [1403531953] Info :  Pinging poller-FOO
2014-06-23 15:59:13,456 [1403531953] Info :   (PYROLOC://FOO:7771/ForArbiter)
2014-06-23 16:00:16,556 [1403532016] Warning : Add failed attempt to poller-FOO (1/3) connection failed
2014-06-23 16:01:19,661 [1403532079] Info :  Pinging poller-satellite2
This goes on for each Shinken daemon on the dead master, which results in a long unnecessary downtime for the monitoring service.

Is this 1 minute downtime a generic Python timeout or did I miss a config setting somewhere?