Page 1 of 3 123 LastLast
Results 1 to 10 of 26

Thread: 2 datacenters and distributed monitoring

  1. #1

    2 datacenters and distributed monitoring

    Hello,

    I'm currently working on a project that aims to migrate from nagios to shinken.
    We have two datacenters and we're aiming to test the modular architecture provided by Shinken.
    I would like to ask your advices about the architecture planned :

    From now on i managed to setup 2 nodes as follow :

    node1 :
    arbiter (realm all, includes sitey & sitez)
    scheduler-1 (realm sitey)
    poller-1 (realm sitey)
    broker-1 (LiveStatus, ndoToMysql, NPCDMOD, Simplelog) (realm sitey)
    reactionner (realm all)
    receiver (to be honest i don't know its purpose for the moment) (realm all)

    node2 :
    scheduler-2 (realm sitez)
    poller-2 (realm sitez)
    broker-2 (realm sitez)

    Apparently the configuration dispatching is working well. The realms are managing their own hosts.
    My last test consisted of killing the node2 and see what happens : Then I noticed that the arbiter failed to connect to the scheduler-2 and all the hosts on node2/sitez disapeared from my Thruk UI.
    Is this the behavior expected ?
    If i have a service that depends of hosts on both realms/site how is this managed ? Is it still possible to setup business rules with hosts/services from both sites ?

    Now i just read that setting up two brokers (ndoToMysql) with one db is a mistake, i'll change that on monday like this picture http://wiki.monitoring-fr.org/_detai...d-architecture.

    NB : The high availability shall be handled by the VMWare HA features. What do you think about that ? is it still adviced to setup shinken spares ?

    PS : I'm working for an old french firm i hope shinken will be adopted. Today there's a single nagios server on sitey.

    Thank you,
    Sam

  2. #2
    Administrator
    Join Date
    Dec 2011
    Posts
    278

    Re: Realms over 2 datacenters.

    You have misunderstood the purpose of Realms.

    Realms are for segregating processing due to logical, physical, commercial purposes. Read the documentation about Realms.

    In your case you are interested in redundancy, which happens WITHIN a given realm.

    Redo your configuration with a single realm with a master and a spare for each daemon type you wish.

    The Receiver job is to receive passive input from protocols like NSCA, Web service (HTTP), TSCA (Thrift), etc. The same modules that can be defined on an arbiter can also be defined on a Receiver. The Receiver can be configured to send directly the input to the correct Scheduler for correlation and processing, bypassing completely the Arbiter who has a job that is purely administrative. For small installations, you do not need a Receiver, but for load-balanced and HA setups you need a Receiver.

    Cheers,

    xkilian

  3. #3

    Re: Realms over 2 datacenters.

    Hi xkilian,

    thanks for the answer. I'll redo the configuration without the use of realms as we don't need such segmentation.
    A single realm should indeed be what we need.

    I'll use "poller_tags" in order to dispatch my hosts accross the 2 dc.

    i'll let you know.

  4. #4

    Re: Realms over 2 datacenters.

    I did the configuration in the attached files and i've received the following errors in my arbiterd.log :

    Code:
    2012-12-31 16:40:30,174 [1356968430] Critical : Back trace of it: Traceback (most recent call last):
     File "/syststsup01/shinken/shinken/daemons/skonfdaemon.py", line 442, in main
      self.do_mainloop()
     File "/syststsup01/shinken/shinken/daemon.py", line 244, in do_mainloop
      self.do_loop_turn()
     File "/syststsup01/shinken/shinken/daemons/skonfdaemon.py", line 463, in do_loop_turn
      self.run()
     File "/syststsup01/shinken/shinken/daemons/skonfdaemon.py", line 566, in run
      srv = run(host=self.http_host, port=self.http_port, server=self.http_backend)
     File "/syststsup01/shinken/shinken/webui/bottle.py", line 2203, in run
      res = server.run(app)
     File "/syststsup01/shinken/shinken/webui/bottle.py", line 2088, in run
      return sa(self.host, self.port, **self.options).run(handler)
     File "/syststsup01/shinken/shinken/webui/bottle.py", line 1907, in run
      srv.serve_forever()
     File "/usr/lib64/python2.6/SocketServer.py", line 224, in serve_forever
      r, w, e = select.select([self], [], [], poll_interval)
    error: (4, 'Interrupted system call')
    2012-12-31 16:40:31,245 [1356968431] Warning : Printing stored debug messages prior to our daemonization
    2012-12-31 16:40:43,873 [1356968443] Warning : Printing stored debug messages prior to our daemonization
    2012-12-31 16:40:44,004 [1356968444] Critical : I got an unrecoverable error. I have to exit
    2012-12-31 16:40:44,005 [1356968444] Critical : You can log a bug ticket at https://github.com/naparuba/shinken/issues/new to get help
    2012-12-31 16:40:44,007 [1356968444] Critical : Exception trace follows: Traceback (most recent call last):
     File "/syststsup01/shinken/shinken/daemons/arbiterdaemon.py", line 553, in main
      self.do_mainloop()
     File "/syststsup01/shinken/shinken/daemon.py", line 244, in do_mainloop
      self.do_loop_turn()
     File "/syststsup01/shinken/shinken/daemons/arbiterdaemon.py", line 587, in do_loop_turn
      self.run()
     File "/syststsup01/shinken/shinken/daemons/arbiterdaemon.py", line 685, in run
      self.dispatcher.check_dispatch()
     File "/syststsup01/shinken/shinken/dispatcher.py", line 146, in check_dispatch
      arb.put_conf(self.conf.whole_conf_pack)
     File "/syststsup01/shinken/shinken/satellitelink.py", line 133, in put_conf
      self.con.put_conf(conf)
     File "/usr/lib/python2.6/site-packages/Pyro/core.py", line 384, in __call__
      return self.__send(self.__name, args, kwargs)
     File "/usr/lib/python2.6/site-packages/Pyro/core.py", line 459, in _invokePYRO
      return self.adapter.remoteInvocation(name, Pyro.constants.RIF_VarargsAndKeywords, vargs, kargs)
     File "/usr/lib/python2.6/site-packages/Pyro/protocol.py", line 440, in remoteInvocation
      return self._remoteInvocation(method, flags, *args)
     File "/usr/lib/python2.6/site-packages/Pyro/protocol.py", line 501, in _remoteInvocation
      answer.raiseEx()
     File "/usr/lib/python2.6/site-packages/Pyro/errors.py", line 73, in raiseEx
      raise self.excObj
    NameError: global name 'cPickle' is not defined
    I don't know what's that means ?? Could you please help me ?

    Regards,
    Sam

  5. #5
    Administrator
    Join Date
    Dec 2011
    Posts
    278

    Re: Realms over 2 datacenters.

    Basics:

    Is your Pyro, Python and Shinken version the same on both of your servers.

    Shinken core daemons use *distributed* programming(like .Net), so it needs a consistent base across all servers running Shinken daemons(arbiter, scheduler, reactionner, poller, broker, receiver) or else kaput.

    You can disable skonf for now.

    Cheers,

    xkilian

  6. #6

    Re: Realms over 2 datacenters.

    Hello xkilian,

    As you suggested i disabled Skonf for the moment.

    The 2 servers have exactly the same versions :

    Code:
    [master]~ # cat /etc/redhat-release
    Red Hat Enterprise Linux Server release 5.8 (Tikanga)
    
    [master] # for d in arbiter broker receiver reactionner poller; do /usr/local/shinken/bin/shinken-${d} --version;done
    shinken-arbiter: 1.2.2   with pyro: 3.10
    shinken-broker 1.2.2
    shinken-receiver 1.2.2
    shinken-reactionner 1.2.2
    shinken-poller 1.2.2
    
    [shinken@master]~ $ python -V
    Python 2.6.8
     
    [shinken@master]~ $ python
    >>> import Pyro
    >>> print Pyro.constants.VERSION
    3.10
    Do you have another idea please?

    Thanks,
    Sam

  7. #7
    Shinken project leader
    Join Date
    May 2011
    Location
    Bordeaux (France)
    Posts
    2,131

    Re: 2 datacenters and distributed monitoring

    Do you have a spare arbiter somewhere?
    No direct support by personal message. Please open a thread so everyone can see the solution

  8. #8

    Re: 2 datacenters and distributed monitoring

    Hi naparuba,

    Yes i configured a spare arbiter on node2.
    I think i figured out what was my mistake :
    The prefix installation path is different between the node1 and the node2.
    For that kind of distributed architecture they should be the same on both servers.
    Am i right ? I just created a symlink and the error disapeared.

    I'm currently recreating the vm's with the standard location (/usr/local).
    I'll let you know.

    Sam

  9. #9

    Re: 2 datacenters and distributed monitoring

    Fine ! i finished reinstalling shinken.

    First thing i noticed : the arbiter doesn't stop...

    Code:
    [node01]/usr/local/shinken/var # service shinken stop
    Stopping skonf
    Stopping scheduler                     [ OK ]
    Stopping poller                      [ OK ]
    Stopping reactionner                    [ OK ]
    Stopping broker                      [ OK ]
    Stopping receiver                     [ OK ]
    Stopping arbiter                      [ OK ]
    [node01]/usr/local/shinken/var # ps aux | grep shinken[ OK ]
    shinken  2946 0.0 0.0 30656  824 ?    S  16:03  0:00 /usr/local/pnp4nagios/bin/npcd -d -f /usr/local/pnp4nagios/etc/npcd.cfg
    shinken  3937 0.0 0.7 233528 31592 ?    S  16:04  0:00 /usr/bin/python26 /usr/local/shinken/bin/shinken-arbiter -d -c /usr/local/shinken/etc/nagios.cfg -c /usr/local/shinken/etc/shinken-specific.cfg
    root   4848 0.0 0.0 61184  748 pts/0  S+  16:07  0:00 grep shinken

    And i still have the same error :

    Code:
    2013-01-02 16:08:37,442 [1357139317] Critical : Exception trace follows: Traceback (most recent call last):
     File "/usr/local/shinken/shinken/daemons/arbiterdaemon.py", line 553, in main
      self.do_mainloop()
     File "/usr/local/shinken/shinken/daemon.py", line 244, in do_mainloop
      self.do_loop_turn()
     File "/usr/local/shinken/shinken/daemons/arbiterdaemon.py", line 587, in do_loop_turn
      self.run()
     File "/usr/local/shinken/shinken/daemons/arbiterdaemon.py", line 685, in run
      self.dispatcher.check_dispatch()
     File "/usr/local/shinken/shinken/dispatcher.py", line 146, in check_dispatch
      arb.put_conf(self.conf.whole_conf_pack)
     File "/usr/local/shinken/shinken/satellitelink.py", line 133, in put_conf
      self.con.put_conf(conf)
     File "/usr/lib/python2.6/site-packages/Pyro/core.py", line 384, in __call__
      return self.__send(self.__name, args, kwargs)
     File "/usr/lib/python2.6/site-packages/Pyro/core.py", line 459, in _invokePYRO
      return self.adapter.remoteInvocation(name, Pyro.constants.RIF_VarargsAndKeywords, vargs, kargs)
     File "/usr/lib/python2.6/site-packages/Pyro/protocol.py", line 440, in remoteInvocation
      return self._remoteInvocation(method, flags, *args)
     File "/usr/lib/python2.6/site-packages/Pyro/protocol.py", line 501, in _remoteInvocation
      answer.raiseEx()
     File "/usr/lib/python2.6/site-packages/Pyro/errors.py", line 73, in raiseEx
      raise self.excObj
    NameError: global name 'cPickle' is not defined

  10. #10
    Administrator
    Join Date
    Dec 2011
    Posts
    278

    Re: 2 datacenters and distributed monitoring

    I added the information to the pre-requisites and the FAQ.

    Cheers

    xkilian

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •