Results 1 to 5 of 5

Thread: Shinken capacity planning

  1. #1
    Junior Member
    Join Date
    Oct 2014
    Posts
    6

    Shinken capacity planning

    The company I work for does server and network monitoring for out clients. We setup one or more monitoring servers in our client networks. Until now, we've used Nagios but at our larger customers it was having trouble scaling.

    So, when we were presented with a new client that would be our biggest monitoring setup so far we figured we should try Shinken for its superior scalability. We installed it on a Scientific Linux 6.5 server (essentially the same as CentOS 6.5), installed Shinken 2.0.3 and Thruk from EPEL and started adding hosts and services. Everything was going well until we had added added around 1000 hosts and 3000 or so services. Then, we noticed that the scheduler process was using a huge amount of RAM -- I was seeing it at nearly 4GB and it was dying periodically because the kernel was killing it when they system ran out of memory. We added memory to the server which helped for a bit, but we weren't done setting up hosts and services to be monitored either.

    Next, the arbiter was timing out talking to the scheduler an so would pull the configurations from the broker and other daemons since there weren't enough schedulers.

    I then setup a scheduler and poller on another server, but the arbiter still was having trouble. Its log would have something like this:

    [tt] Warning : Add failed attempt to scheduler-1 (1/3) Connexion error to http://shinken01.prod.dc1.example.com:7768/ : Operation timed out after 3000 milliseconds with 0 bytes received
    [/tt]

    After which it would try to send the config again, and again it would timeout and then pull the configs from the other processes. It looks like adding the second scheduler didn't help much because the arbiter didn't split the config very evenly:

    Code:
    [1412562753] Info :  [Arbiter] Serializing the configurations...
    [1412562753] Info :  Using the default serialization pass
    [1412562753] Debug :  [All] Serializing the configuration 0
    [1412562761] Debug :  [config] time to serialize the conf All:0 is 8.23942995071
    [1412562761] Debug :  PICKLE LEN : 30600087
    [1412562761] Debug :  [All] Serializing the configuration 1
    [1412562761] Debug :  [config] time to serialize the conf All:1 is 0.00839400291443
    [1412562761] Debug :  PICKLE LEN : 68267
    [1412562770] Debug :  [config] time to serialize the global conf : 8.8419418335
    TOTAL serializing in 17.0954930782
    [1412562770] Info :  Configuration Loaded
    So, my questions:

    • [li]How much memory should we expect the scheduler to use?[/li]
      [li]How does the arbiter decide to split up the config? Do I have to use poller_tags or can it be more dynamic?[/li]
      [li]Is there a general rule of thumb for how many hosts/services per scheduler?[/li]

  2. #2
    Shinken project leader
    Join Date
    May 2011
    Location
    Bordeaux (France)
    Posts
    2,131

    Re: Shinken capacity planning

    1: 4K éléments => ~900MB I think
    2: remove the /var/lib/shinken/pack_distribution.dat file. It ause more problems that it solve, I'll remove this in the next release
    3: it's quite linear for memory, but for cpu usage it depends on your scehduling and numbers of brokers.
    No direct support by personal message. Please open a thread so everyone can see the solution

  3. #3
    Junior Member
    Join Date
    Oct 2014
    Posts
    6

    Re: Shinken capacity planning


    Removing the pack_retention.dat file fixed my problem:


    Code:
    [1412608838] Info :  [Arbiter] Serializing the configurations...
    [1412608838] Info :  Using the default serialization pass
    [1412608838] Debug :  [All] Serializing the configuration 0
    [1412608846] Debug :  [config] time to serialize the conf All:0 is 8.04555106163
    [1412608846] Debug :  PICKLE LEN : 30471951
    [1412608846] Debug :  [All] Serializing the configuration 1
    [1412608855] Debug :  [config] time to serialize the conf All:1 is 8.77394795418
    [1412608855] Debug :  PICKLE LEN : 30471021
    [1412608864] Debug :  [config] time to serialize the global conf : 9.01173591614
    TOTAL serializing in 25.8404281139
    [1412608864] Info :  Configuration Loaded
    And it turns out we have a lot more service checks than I initially thought -- around 15,000 so with a single scheduler, so the 4GB I was seeing with a single scheduler sounds about right.

    Now that the configuration is split better things are working a lot better. Thanks for your help!

  4. #4
    Shinken project leader
    Join Date
    May 2011
    Location
    Bordeaux (France)
    Posts
    2,131

    Re: Shinken capacity planning

    You're welcome

    I'll remove the pack_distribution usage from the master version, it's useless with distributed retention after all
    No direct support by personal message. Please open a thread so everyone can see the solution

  5. #5
    Junior Member
    Join Date
    Nov 2019
    Posts
    1
    Some sites are working as the monitoring spots to get the orders from the customers. However, there is webpage that is giving the whole instructions to present the style that is related to the customers and give coding details.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •