Hi !

I followed the tutorial on shinken.io and turned shinken into a high availability supervision system. The problem is that it seems to be working but I have warning and errors in my logs and I am not sure if it is relevant or not. I have not yet implemented the pending state issue (store the pending states in mongodb and use retention-mongodb module).

My infrastructure:
centos-shinken1 IP XXX.XXX.XXX.XXX, master arbiter
ubuntu-server YYY.YYY.YYY.YYY just a server to monitor (and i will install mongodb on it to fix the issue with the pending states)
centos-shinken2 ZZZ.ZZZ.ZZZ.ZZZ, spare arbiter

So, I followed the tutorial, it seems to be working because when I stop the service on centos-shinken1, the webui on centos-shinken1 becomes unavailable and becomes available on centos-shinken2. Then I restart shinken on centos-shinken1 and it automatically puts webui on centos-shinken2 unavailable and puts webui on centos-shinken1 available.

I am not sure HA is working properly because I shoud have a "Dispatch OK" , that's not the case and the log files are filled with warning and errors

- root@centos-shinken2 $ cat /var/log/shinken/arbiter.log
Code:
....
2014-07-01 16:02:58,481 [1404223378] Error :  Failed sending configuration for scheduler-master: Connexion error to http://XX.XX.XX.XX:7768/ : couldn't connect to host
2014-07-01 16:02:58,482 [1404223378] Warning : [All] configuration dispatching error for scheduler scheduler-master
2014-07-01 16:02:58,719 [1404223378] Error :  [All] Dispatching failed for receiver receiver-master
2014-07-01 16:03:58,871 [1404223438] Warning : Add failed attempt to scheduler-master (2/3) Connexion error to http://XX.XX.XX.XX:7768/ : couldn't connect to host
2014-07-01 16:03:58,875 [1404223438] Warning : Add failed attempt to reactionner-master (2/3) Connexion error to http://XX.XX.XX.XX:7769/ : couldn't connect to host
2014-07-01 16:03:58,882 [1404223438] Warning : Add failed attempt to poller-master (2/3) Connexion error to http:/XX.XX.XX.XX:7771/ : couldn't connect to host
2014-07-01 16:03:58,883 [1404223438] Warning : Add failed attempt to broker-master (2/3) Connexion error to http://XX.XX.XX.XX:7772/ : couldn't connect to host
2014-07-01 16:03:58,887 [1404223438] Warning : Add failed attempt to receiver-master (2/3) Connexion error to http://XX.XX.XX.XX:7773/ : couldn't connect to host
2014-07-01 16:04:59,033 [1404223499] Warning : Add failed attempt to scheduler-master (3/3) Connexion error to http://XX.XX.XX.XX:7768/ : couldn't connect to host
2014-07-01 16:04:59,033 [1404223499] Warning : Setting the satellite scheduler-master to a dead state.
2014-07-01 16:04:59,038 [1404223499] Warning : Add failed attempt to reactionner-master (3/3) Connexion error to http://XX.XX.XX.XX:7769/ : couldn't connect to host
2014-07-01 16:04:59,038 [1404223499] Warning : Setting the satellite reactionner-master to a dead state.
2014-07-01 16:04:59,045 [1404223499] Warning : Add failed attempt to poller-master (3/3) Connexion error to http://XX.XX.XX.XX:7771/ : couldn't connect to host
2014-07-01 16:04:59,045 [1404223499] Warning : Setting the satellite poller-master to a dead state.
2014-07-01 16:04:59,047 [1404223499] Warning : Add failed attempt to broker-master (3/3) Connexion error to http://XX.XX.XX.XX:7772/ : couldn't connect to host
2014-07-01 16:04:59,048 [1404223499] Warning : Setting the satellite broker-master to a dead state.
2014-07-01 16:04:59,052 [1404223499] Warning : Add failed attempt to receiver-master (3/3) Connexion error to http://XX.XX.XX.XX:7773/ : couldn't connect to host
2014-07-01 16:04:59,052 [1404223499] Warning : Setting the satellite receiver-master to a dead state.
2014-07-01 16:54:21,162 [1404226461] Warning : Printing stored debug messages prior to our daemonization

- root@centos-shinken1 $ cat /var/log/shinken/arbiter.log
Code:
2014-07-01 16:33:47,466 [1404225227] Warning : Add failed attempt to scheduler-spare (2/3) Connexion error to http://ZZ.ZZ.ZZ.ZZ:7768/ : couldn't connect to host
2014-07-01 16:33:47,470 [1404225227] Warning : Add failed attempt to reactionner-spare (2/3) Connexion error to http://ZZ.ZZ.ZZ.ZZ:7769/ : couldn't connect to host
2014-07-01 16:33:47,471 [1404225227] Warning : Add failed attempt to poller-spare (2/3) Connexion error to http://ZZ.ZZ.ZZ.ZZ:7771/ : couldn't connect to host
2014-07-01 16:33:47,477 [1404225227] Warning : Add failed attempt to broker-spare (2/3) Connexion error to http://ZZ.ZZ.ZZ.ZZ:7772/ : couldn't connect to host
2014-07-01 16:33:47,481 [1404225227] Warning : Add failed attempt to receiver-spare (2/3) Connexion error to http://ZZ.ZZ.ZZ.ZZ:7773/ : couldn't connect to host
2014-07-01 16:33:47,483 [1404225227] Warning : Add failed attempt to arbiter-spare (2/3) Connexion error to http://ZZ.ZZ.ZZ.ZZ:7770/ : couldn't connect to host
2014-07-01 16:33:47,484 [1404225227] Error :  Failed sending configuration for arbiter-spare: Connexion error to http://ZZ.ZZ.ZZ.ZZ:7770/ : couldn't connect to host
2014-07-01 16:33:48,542 [1404225228] Error :  Failed sending configuration for arbiter-spare: Connexion error to http://ZZ.ZZ.ZZ.ZZ:7770/ : couldn't connect to host
2014-07-01 16:33:49,599 [1404225229] Error :  Failed sending configuration for arbiter-spare: Connexion error to http://ZZ.ZZ.ZZ.ZZ:7770/ : couldn't connect to host
2014-07-01 16:33:50,656 [1404225230] Error :  Failed sending configuration for arbiter-spare: Connexion error to http://ZZ.ZZ.ZZ.ZZ:7770/ : couldn't connect to host
2014-07-01 16:33:51,713 [1404225231] Error :  Failed sending configuration for arbiter-spare: Connexion error to http://ZZ.ZZ.ZZ.ZZ:7770/ : couldn't connect to host
... errors till the end of the log file
- arbiter-master.cfg
Code:
define arbiter {
  arbiter_name  arbiter-master
  host_name    centos-shinken1    ; CHANGE THIS if you have several Arbiters
  address     XX.XX.XX.XX  ; DNS name or IP
  port      7770
  spare      0      ; 1 = is a spare, 0 = is not a spare

  ## Interesting modules:
  # - named-pipe       = Open the named pipe nagios.cmd
  # - mongodb         = Load hosts from a mongodb database
  # - PickleRetentionArbiter = Save data before exiting
  # - nsca          = NSCA server
  # - VMWare_auto_linking   = Lookup at Vphere server for dependencies
  # - import-glpi       = Import configuration from GLPI (need plugin monitoring for GLPI in server side)
  # - TSCA          = TSCA server
  # - MySQLImport       = Load configuration from a MySQL database
  # - ws-arbiter       = WebService for pushing results to the arbiter
  # - Collectd        = Receive collectd perfdata
  # - SnmpBooster       = Snmp bulk polling module, configuration linker
  # - import-landscape		= Import hosts from Landscape (Ubuntu/Canonical management tool)
  # - AWS			= Import hosts from Amazon AWS (here EC2)
  # - ip-tag			= Tag an host based on it's IP range
  # - FileTag			= Tag an host if it's on a flat file
  # - CSVTag			= Tag an host from the content of a CSV file

  modules  	 named-pipe
  #modules   named-pipe, mongodb, nsca, VMWare_auto_linking, ws-arbiter, Collectd, mport-landscape, SnmpBooster, AWS

  # Enable https or not
  use_ssl	     0
  # enable certificate/hostname check, will avoid man in the middle attacks
  hard_ssl_name_check  0

  ## Uncomment these lines in a HA architecture so the master and slaves know
  ## how long they may wait for each other.
  timeout       3  ; Ping timeout
  data_timeout     120 ; Data send timeout
  max_check_attempts  3  ; If ping fails N or more, then the node is dead
  check_interval    60 ; Ping node every N seconds
}
-arbiter-spare.cfg
Code:
define arbiter {
  arbiter_name  arbiter-spare
  host_name    centos-shinken2    ; CHANGE THIS if you have several Arbiters
  address     ZZ.ZZ.ZZ.ZZ  ; DNS name or IP
  port      7770
  spare      1      ; 1 = is a spare, 0 = is not a spare

  ## Interesting modules:
  # - named-pipe       = Open the named pipe nagios.cmd
  # - mongodb         = Load hosts from a mongodb database
  # - PickleRetentionArbiter = Save data before exiting
  # - nsca          = NSCA server
  # - VMWare_auto_linking   = Lookup at Vphere server for dependencies
  # - import-glpi       = Import configuration from GLPI (need plugin monitoring for GLPI in server side)
  # - TSCA          = TSCA server
  # - MySQLImport       = Load configuration from a MySQL database
  # - ws-arbiter       = WebService for pushing results to the arbiter
  # - Collectd        = Receive collectd perfdata
  # - SnmpBooster       = Snmp bulk polling module, configuration linker
  # - import-landscape		= Import hosts from Landscape (Ubuntu/Canonical management tool)
  # - AWS			= Import hosts from Amazon AWS (here EC2)
  # - ip-tag			= Tag an host based on it's IP range
  # - FileTag			= Tag an host if it's on a flat file
  # - CSVTag			= Tag an host from the content of a CSV file

  modules  	 named-pipe
  #modules   named-pipe, mongodb, nsca, VMWare_auto_linking, ws-arbiter, Collectd, mport-landscape, SnmpBooster, AWS

  # Enable https or not
  use_ssl	     0
  # enable certificate/hostname check, will avoid man in the middle attacks
  hard_ssl_name_check  0

  ## Uncomment these lines in a HA architecture so the master and slaves know
  ## how long they may wait for each other.
  timeout       3  ; Ping timeout
  data_timeout     120 ; Data send timeout
  max_check_attempts  3  ; If ping fails N or more, then the node is dead
  check_interval    60 ; Ping node every N seconds
}
About the connexion errors: when shinken is running, the ports are all open (checked with nmap). That's why I don't understand the logs.

What do you think about it ?? Can you help me ?

If needed, I can provide log and/or cfg files.

tOHTor


Edit: here are my log files, I have now a poller named "poller-2" running on centos-shinken3 with IP WW.WW.WW.WW

- centos-shinken1, "arbiterd.log"
http://pastebin.com/X8Pm8ANS

- centos-shinken2, "arbiterd.log"
http://pastebin.com/0ZKvPYeb

-centos-shinken1, "schedulerd.log"
http://pastebin.com/WZpSYkhH

-centos-shinken2, "schedulerd.log"
http://pastebin.com/WF7MuZDL

-centos-shinken3, "pollerd.log"
http://pastebin.com/7QuZkkne

What I really don't understand is all the errors and warnings (saying "Dispatch failed&quot in the arbiterd.log files even though HA seems to be working when I stop shinken on centos-shinken1 (I have never had a "Dispatch OK" in my log files even if HA seems to be working).