The company I work for does server and network monitoring for out clients. We setup one or more monitoring servers in our client networks. Until now, we've used Nagios but at our larger customers it was having trouble scaling.

So, when we were presented with a new client that would be our biggest monitoring setup so far we figured we should try Shinken for its superior scalability. We installed it on a Scientific Linux 6.5 server (essentially the same as CentOS 6.5), installed Shinken 2.0.3 and Thruk from EPEL and started adding hosts and services. Everything was going well until we had added added around 1000 hosts and 3000 or so services. Then, we noticed that the scheduler process was using a huge amount of RAM -- I was seeing it at nearly 4GB and it was dying periodically because the kernel was killing it when they system ran out of memory. We added memory to the server which helped for a bit, but we weren't done setting up hosts and services to be monitored either.

Next, the arbiter was timing out talking to the scheduler an so would pull the configurations from the broker and other daemons since there weren't enough schedulers.

I then setup a scheduler and poller on another server, but the arbiter still was having trouble. Its log would have something like this:

[tt] Warning : Add failed attempt to scheduler-1 (1/3) Connexion error to http://shinken01.prod.dc1.example.com:7768/ : Operation timed out after 3000 milliseconds with 0 bytes received
[/tt]

After which it would try to send the config again, and again it would timeout and then pull the configs from the other processes. It looks like adding the second scheduler didn't help much because the arbiter didn't split the config very evenly:

Code:
[1412562753] Info :  [Arbiter] Serializing the configurations...
[1412562753] Info :  Using the default serialization pass
[1412562753] Debug :  [All] Serializing the configuration 0
[1412562761] Debug :  [config] time to serialize the conf All:0 is 8.23942995071
[1412562761] Debug :  PICKLE LEN : 30600087
[1412562761] Debug :  [All] Serializing the configuration 1
[1412562761] Debug :  [config] time to serialize the conf All:1 is 0.00839400291443
[1412562761] Debug :  PICKLE LEN : 68267
[1412562770] Debug :  [config] time to serialize the global conf : 8.8419418335
TOTAL serializing in 17.0954930782
[1412562770] Info :  Configuration Loaded
So, my questions:

  • [li]How much memory should we expect the scheduler to use?[/li]
    [li]How does the arbiter decide to split up the config? Do I have to use poller_tags or can it be more dynamic?[/li]
    [li]Is there a general rule of thumb for how many hosts/services per scheduler?[/li]