Page 1 of 5 123 ... LastLast
Results 1 to 10 of 44

Thread: large Setup config problem

Hybrid View

  1. #1
    Junior Member
    Join Date
    Mar 2012
    Posts
    5

    large Setup config problem

    Hi,

    First of all, great work on shinken and really nice concept. Really like it

    but i have some trouble setting up a "large Setup" of Shinken.
    Actual we have about 5,5k hosts and 55k checks.
    All checks are active and 30k of them are nrpe checks.

    Actual Setup:
    OS: Debian Squeeze, Pyro 4, PYthon 2.6, Shinken 1.0.1 (also tried pyro 3.9 without luck)
    Shinken is installed by setup.py and Thruk from source.

    We have our Masterserver with 4vcpu, 10gb Ram, running Arbiter, Scheduler-1, Broker, Reciever, Reactionner
    Slave1 has 4vcpu and 4gb ram run poller-1
    Slave2 has 4vcpu and 4gb ram, run scheduler2 and poller-2 for testing
    More hardware is no problem, but actually i think this isn't the bottleneck

    Actual Memory with 5k hosts:
    Arbiter: 1,5gb
    Scheduler-1: 1,8gb, scheduler-2: 1,8gb
    Broker: 3 x 1,5gb & 1x900mb

    Config:
    Actually all notifications, dependencies, packs, escalations etc are disabled. We just have a falt config containing:
    Commands, contactgroups, contacts, hosts, hostgroups, services, templates and timeperiods.
    Services are assigned to hostgroup_names.

    The shinken-specific.cfg is nearly untouched. I disabled all unneeded modules, enabled distributed polling and nrpebooster. No rentation atm.
    The nagios.cfg is nearly default too. "Large_installations_tweaks are enabled.

    Problem:
    Major Problem is the size of the config. Arbiter needs more than 2 minutes to send config to one scheduler.
    With two schedulers 4min. ( For 1k hosts in config it takes 7sec to send the config)
    After all satelites have their config, it takes again some minutes, until broker is ready to accept results.
    As arbiter and scheduler seems to use only one core, the only possibillity seems to be getting the config smaller.
    If i add one more host, and restart the arbiter, it takes 14minutes until all satelites are ready again and checks are made.
    Within these 14minutes no checks are done.
    After everything works again, the checkrate is much faster than in nagios3, (11k/min vs. 5k/min)

    Question:
    My first question is now... how can i decrease/split the config? As with normal scheduler all are getting the "same" config,
    maybe while working with realms or tagged schedulers, the config is really splitted?
    Does the broker always fork hisself until mem is full? because thruk latency is getting slow as well with that amount of hosts.

    I would really like to switch to shinken, but with 14min without checks my boss will kill me
    So it would be really nice if i can solve this problem with tagged schedulers or realms??
    If there are any important parameters in config i should add please let me know.

    best, noone123

  2. #2
    Shinken project leader
    Join Date
    May 2011
    Location
    Bordeaux (France)
    Posts
    2,131

    Re: large Setup config problem

    Hi,

    So *5 elements means *100 time, sounds not good. What is teh activity while waiting? Mainly CPU for arbiter, then cpu for scheduler?

    Can you edit shinken-arbiter and active the perftace mode : (at the end of the file )

    # Protect for windows multiprocessing that will RELAUNCH all
    if __name__ == '__main__':
    daemon = Arbiter(debug=opts.debug_file is not None, **opts.__dict__)
    #daemon.main()
    # For perf tuning :
    import cProfile
    cProfile.run('''daemon.main()' '', '/tmp/arbiter.profile&#039

    Then launch, wait for the schedulers to have their conf, and stop (init.d/shinken-arbiter stop). Then we will look at the profile file
    No direct support by personal message. Please open a thread so everyone can see the solution

  3. #3
    Shinken project leader
    Join Date
    May 2011
    Location
    Bordeaux (France)
    Posts
    2,131

    Re: large Setup config problem

    I give a try in my devel box, by generating 5K hosts taht got each 10 services, and the scheduler conf send is done in 30s. (and another 30s for the broker to be ok with all data, webui active).

    With a pyro 4.10 and python2.7.

    We should look at where the time is lost in your configuration. It it's CPU in both side (arbiter is serializing the configuration with the python pickle module, and scehduler is receiving it and so un-serialize it, both are CPU consuming operation that cannot be tuned, it's already the max parameters for this module). If the cpu is idle, can be something else that we must hunt

    For the memory consumtion, I've got like your, and sounds inthe good values (I think we can reduce it a bit, but not a lot).
    No direct support by personal message. Please open a thread so everyone can see the solution

  4. #4
    Shinken project leader
    Join Date
    May 2011
    Location
    Bordeaux (France)
    Posts
    2,131

    Re: large Setup config problem

    My bad, the scheduler/poller send is around the minute, but the broker is a bit longer (for webui/LS to be fully loaded). I've got :
    T = 0 start
    T+40s = arbiter check end (can you give a look at this part?)
    T+ 1min20s = scheduler is loaded
    T+ 2min = broks are sent to the brokers, the monitoring job is fully loadd (so during 1min30 there is no checks)
    T + 4min : webui and LS are loaded. (this time is quite long? I try to see if we can enhance this part)

    Can you try to see what are your times?

    Thanks,


    Jean
    No direct support by personal message. Please open a thread so everyone can see the solution

  5. #5
    Junior Member
    Join Date
    Mar 2012
    Posts
    5

    Re: large Setup config problem

    hi naparuba,

    thanks for your reply. Actually i'm on vacation till tuesday, so i can't test a lot. Hope i can test settings tomorrow.

    I guess my times are close to yours if starting shinken the first time. Really big problems came up, if i change config and restart only the arbiter.

    I'll check exact times and load etc. as soon as possible and report it to you.

    Just one more question, will the config be real splitted in smaler parts while using realms or tagged scheduler?

    have a nice weekend!
    noone123

  6. #6
    Shinken project leader
    Join Date
    May 2011
    Location
    Bordeaux (France)
    Posts
    2,131

    Re: large Setup config problem

    Yes, with realms hosts and services will be splitted. The others elements are common (contacts or commands). But the time is mostly fully on services in fact. I pushed some tweaxks in the last code, it's a bit better (20~30% I think), you can also give a try with it. But the time analysis is still important, so we will know if it's pure CPU time or we need something else to tweaks
    No direct support by personal message. Please open a thread so everyone can see the solution

  7. #7
    Junior Member
    Join Date
    Mar 2012
    Posts
    5

    Re: large Setup config problem

    Hi again,

    sorry for the late response... no time until totay to perform the timechecks.
    i have the following times with shinken 1.0.1: ( XEN VM with 4 vcpus and 10gb ram. Poller are on an external machine)

    T=0 (Start)
    T+45sec Arbiter check end
    T+2min10sec Scheduler has it's conf.
    T+2min12sec All other satelites had their conf.
    Now i immediately get timeouts (1/3) from all satelites.
    T+2min50sec Broker has it's conf ready and knows about other satelites.
    T+ 4min reactioneer get 3/3 timeouts and arbiter resend config
    T+5min scheduler recieved first checkresults.
    T+5min Broker get Timeout (3/3) --> Arbiter resend config
    T+8min broker created services and send broks. (Thruk shows first checkresults)

    i never saw one process consuming more than 67% cpu (htop). Most of the time it was 57%
    Dstat shows me 85% idle until everything is done. For a short duration... (some seconds) broker and scheduler work side by side on 2 vcpus, but the other time only one process take cpu time.

    One more question:
    While working with for example 3 realms... is it possible to use one tagged poller for all realms? For example: if i want to ping hosts, check http etc. from different routes?


  8. #8
    Shinken project leader
    Join Date
    May 2011
    Location
    Bordeaux (France)
    Posts
    2,131

    Re: large Setup config problem

    Huge time indeed. Can you try with the lastest master code? Should be a bit better. We will see how the time are reducing. It's strange because here should be at least one CPU at 100% every time (most of the time is for object serialization)

    For the poller, you can try to put the poller at teh top level realm, with manage_sub_realms 1, should do the trick. But a common poller for several realms is not really designed for it. But it should work
    No direct support by personal message. Please open a thread so everyone can see the solution

  9. #9
    Junior Member
    Join Date
    Mar 2012
    Posts
    5

    Re: large Setup config problem

    ok i'll check your new code soon.

    just tried to remove 2 hosts and restart the arbiter in the running shinken enviroment.
    As we have to modify our config ~5 times a day... this is also really important.

    Now i got these times: (onyl arbiter restart in fully working shinken enviroment. Setup as aboove (still 2gb ram free)
    T=0 arbiter stoped
    T+46sec arbiter check end
    T+3min scheduler has config
    T+4min reactioneer has config
    T+4min8sec poller has conf
    T+6min52sec broker has conf
    T+8min broker has conf.

    Well.. you're right... it was always 100% on one core. My htop shows me strage values.. but top runs ok.
    While restarting the arbiter, no checks are made or submitted from T+46sec to T+8min.
    I'll check the new code, but with this times there no chance to switch to shinken atm, cause we have to submit new conf several times a day.
    As i already told you, if it's running, it is much faster than our actual nagios3 setup that has much more hardware power
    I'll also see if i get the time to test a realm setup to decrease config size, but with one broker for all realms the big config might make trouble again, and i guess distinct realms can't handle "global pollers" that manage subrealms and that's important for us.

  10. #10
    Administrator
    Join Date
    Dec 2011
    Posts
    278

    Re: large Setup config problem

    The time spent not monitoring is indeed pretty critical for a system whose primary job is to monitor the state of things. (performance and availability)

    I think it might be time to review the configuration update mechanism.
    • [li]consider having incremental changes being delivered to the various satellites instead of reloading the whole configuration.[/li]
      [li]consider continuing to use the old configuration until the new configuration has been transferred to all satelittes, and then flush and load.[/li]


    As noone123 as mentionned, this is a show stopper.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •