Page 2 of 5 FirstFirst 1234 ... LastLast
Results 11 to 20 of 44

Thread: large Setup config problem

  1. #11
    Senior Member
    Join Date
    Oct 2011
    Posts
    139

    Re: large Setup config problem

    I really agree with xkilian major shinkens fails are now on the start side (push conf). I think we should all consider to focus on this. I think i can help to test various test configurations and publish statistics. So you can have better analysis.


  2. #12
    Shinken project leader
    Join Date
    May 2011
    Location
    Bordeaux (France)
    Posts
    2,131

    Re: large Setup config problem

    Yes, it's important. And it's even quite easy to do with tools with few links between elements. But in the Shinken core links are nearly everywhere! Incremental changes will be a real nightmare to manage. Don't expect it in few days of code.

    I think the first launch will be as long as now, there will be no miracle (most of the time is spend on object serialization for exchanges between daemons).

    It can be easily done with threads in fact (prepare on a side, and work on the other) but it's nearly useless in python for the GIL thing.

    One possible enhancement is to have more schedulers and reduce the statistic of a configuration change so add an host will only imply to hange one pack, and so resent to one pack too. So for each change, only a tiny part of the packs will change, and so the overall "reload" impact will be low (only one or two schedulers to reload, not a big deal). It's the spirit of the pack cutting after all ;D

    I'm currently working on a "element hash" so we can look if an element changed or not. Then the difficulty will be to make "packs" with the same elements between restarts (ids and order will change, and so the packs). I think I'll need to save this somewhere between restarts (not lot of data, so not a real perf problem).

    One other key time is the time between "scheduler is getting a new conf -> broks are sent" (then poller take the checks and monitoring is launched). What I don't understand for noone is the T+3min scheduler has config ->T+4min for reac/poller config. Or it is the time for sending others schedulers jobs maybe? :

    The problem is not that the broker do not have all data (it's a UI problem, not a monitoring one), but that during the sending, the scheduler is blocked. A solution can be to send the data in smaller packs than the whole 100K broks for the initial data. 10K blocks will be good enough (latency VS thoughtpout in fact).

    Not easy to deal with so
    No direct support by personal message. Please open a thread so everyone can see the solution

  3. #13
    Senior Member
    Join Date
    Oct 2011
    Posts
    139

    Re: large Setup config problem

    well here is an old but interesting post about serialisation performances :

    http://kbyanc.blogspot.fr/2007/07/py...enchmarks.html

    and a more recent benchmark

    http://www.tablix.org/~avian/blog/ar...ase_for_cjson/

  4. #14
    Shinken project leader
    Join Date
    May 2011
    Location
    Bordeaux (France)
    Posts
    2,131

    Re: large Setup config problem

    The problem is that we got "complex" items with objects and loops. For hosts and services, pickle look the only ay, but maybe for the broks we can try to find something else more efficient like cjson
    No direct support by personal message. Please open a thread so everyone can see the solution

  5. #15
    Junior Member
    Join Date
    Mar 2012
    Posts
    5

    Re: large Setup config problem

    the T+4min just should show, that sending config from arbiter to satelites is increased while they are running.
    If i do a clean start of complete shinken, sending the config to all satellites is done in seconds (besides the scheduler)

    Yesterday setup different realms with one main broker to keep the config smaller.
    In my case the scheduler get their conf faster but the broker still need much time so manage config. If restarting the arbiter it takes again really log until broker is ready.
    Haven't tried distinct realms now, but having pollers, scheduler and broker on the same machine seems to be a problem while restarting the arbiter and resend config.
    Because poller are still running if scheduler gets it's config... so the consume lot of cpu for queued checks. Same for broker.
    And setting up 2 machines in each realm to keep pollers seperate, or using more cores etc... in the end cost to much hardware.

    i really really like shinken but atm i have to search for an alternate monitoring solution.
    I'll keep an eye on your reviews, hopefully you'll find a way the speedup the config part of shinken so larger enviroments can use it ,too.
    Let me know if i can still help you with some tests in my freetime etc.

    Thanks a lot for all your help and keep going!!! noone123



  6. #16
    Administrator
    Join Date
    Dec 2011
    Posts
    278

    Re: large Setup config problem

    Further splitting the configuration in multiple schedulers is just a bandaid, so is faster data transfert between the arbiter and scheduler/broker/etc.

    It does not adress the core issue, which is, with large configurations there is a monitoring gap. This seems to be a core design issue. The thinking caps need to go on. I am not saying the solution will be simple.

    - incremental updates
    - partial flush and reload of configurations
    - transfers independant from flush and load
    - centralized distributed database, ex. memcache or mongodb to hold the configuration
    - push or pull model for the schedulers
    - simplejson for brok messages to brokers (that seems to be an interesting idea) to make it more efficient, but does not solve the initial configuration load.
    - others

    Cheers,

    Keep up the good work!

  7. #17
    Shinken project leader
    Join Date
    May 2011
    Location
    Bordeaux (France)
    Posts
    2,131

    Re: large Setup config problem

    Yop,

    [quote author=xkilian link=topic=344.msg1870#msg1870 date=1334280078]
    - incremental updates
    - partial flush and reload of configurations
    [/quote]
    I think they are the two main points. The main issue is serialisation, so the main solution will be to do less time the same thing, like send again and again the same hosts/services to the schedulers, and so to the brokers then. I'm pushing "hash" computation in the arbiter, so we will know if an host/service "changed" since last configuration read (with inheritance, just look at the configuration object is useless, we need to hash the real computed object). I think with an hash thing, we can get a huge boost for "reload" (first launch will be still slow, there will be no magic thing here, but if the first launch is one for each update, it's 2/3min each 3months, not a big deal ).

    I'm not sure it will be ok for 1.2, not in the whole incremental stuff, but I think a big boost already ;D

    [quote author=xkilian link=topic=344.msg1870#msg1870 date=1334280078]
    - transfers independant from flush and load
    [/quote]
    Hum... what? :

    [quote author=xkilian link=topic=344.msg1870#msg1870 date=1334280078]
    - centralized distributed database, ex. memcache or mongodb to hold the configuration
    [/quote]
    The problem is serialisation, and adding a new later between arbiter/scheduelr will just add new problems (it was for putting conf in memcache between arbiter/scheduler isn't it?)

    [quote author=xkilian link=topic=344.msg1870#msg1870 date=1334280078]
    - push or pull model for the schedulers
    [/quote]
    So the scheduler can gt it's conf from arbiter and not the arbiter push the conf? It's a cool thing for distributed in custumer LAN, but I don't see the point for perfs

    [quote author=xkilian link=topic=344.msg1870#msg1870 date=1334280078]
    - simplejson for brok messages to brokers (that seems to be an interesting idea) to make it more efficient, but does not solve the initial configuration load.
    [/quote]
    I more thing about "marshal", but this will need to simplify a lot the objects (no class objects for json, only simple types). I think we can't use json for all objects, but I think for most time consuming ones (host and services) it can be a good thing, even if it will complexify the "regenerator" pass (not a big deal, just time to code and relink).


    [quote author=xkilian link=topic=344.msg1870#msg1870 date=1334280078]
    - others
    [/quote]
    Let's try to apply all of the above before


    Jean
    No direct support by personal message. Please open a thread so everyone can see the solution

  8. #18
    Administrator
    Join Date
    Dec 2011
    Posts
    278

    Re: large Setup config problem

    The current issue is described as :

    - In a running system, updating the configuration causes a very noticeable interruption of the monitoring processes.

    - transfers independant from flush and load
    What I mean by this is to transfer the data from the arbiter, without first stopping the running scheduler. Meaning that transferring the new configuration is independant from stoppping the scheduler process for what it is currently doing. Thus, during the transfer process for a large configuration the scheduler would keep on running. Once the data has been transferred to the scheduler and it has been prepared for use by the scheduler, then, and only then would the scheduler *flush* its current config and load the new one. This may involve synchronizing with the arbiter for when it should load the new configuration. Ex. Once at least one of every process has reaceived its configuration and is ready *flush* the current configuration and load the new one. This way if there is a problem, you don't stop the currning system and bork(fail).

    Quote from: xkilian on April 13, 2012, 03:21:18 am
    - simplejson for brok messages to brokers (that seems to be an interesting idea) to make it more efficient, but does not solve the initial configuration load.
    I more thing about "marshal", but this will need to simplify a lot the objects (no class objects for json, only simple types). I think we can't use json for all objects, but I think for most time consuming ones (host and services) it can be a good thing, even if it will complexify the "regenerator" pass (not a big deal, just time to code and relink).

    Marshall and cjson are not viable options for a number of reasons such as support(marshall, cjson), stability(cjson), maintainership(cjson) and not meant to be used (marshall). Even if they are a wee bit slower, a well supported implementation of json is probably a better idea, whichever it is. Anyhow, I think your suggestion to use json is a good one, but is probably not material to the current issue. (Interruption of the monitoring process due to a configuration update!)

    I hope I have clarified what I meant.

    ps.
    • [li]Push/pull model. I hear you. :-)[/li]
      [li]MongoDB, etc. I hear you. :-)[/li]



    Have a good evening.

    X


  9. #19
    Shinken project leader
    Join Date
    May 2011
    Location
    Bordeaux (France)
    Posts
    2,131

    Re: large Setup config problem

    I'm working in the pre-serialize thing for the scheduler part. The main idea is to limit the arbiter blocking time when sending a configuration, so with several scheduler, the first can work quickly instead of waiting several minutes. I also boost the brok time generation by revmoving some useless and huge objects from broks and replace them by the names instead (like don't send the full timeperiod object, but instead give the name, the timeperiod is sent in it's own onject, and only one time!) :-*

    It give a huge boost in scheduler and broker jobtime. Let take the 50K sample, now :
    * the arbiter -> scheduler send came from 40-> 3s 8) (pre-serialisation in the arbiter, so in the check time, where the old arbiter is still alive so this time is not important).
    * but it add a 10s in scheduler "un-serialize" phase. So the load time came from 40 to 13s. I think the unserialize can be put in a thread but with caution so the old jobs can still runs.
    * in scheduler brok generation : take 9s for generating broks (was 35s before), the broker sent is something like 5s (but then the pollers are already working, so it's not a real problem)
    * broks are send to the modules Queue() in 8s (was 30s before), and the same time is saved in the modules to load them.

    Oh and the whole memory consumption for broker and each of it's external module came from 3GB to 1.5! (python is not dropping all allocated memory - stupid, but it's by design....- and so useless objects loaded like timeperiods were not totally dropped by the garbage collector).

    In the end, we got a "fresh start-> webui ok" in 2min30, and a restart is in 1min if I'm not wrong, and the scheduling "stop" time is less than 30s (I think we can win again 10s). And without even the incremental sent (will be quite hard to get a bug free incremental reload with all links between elements...). Theses time are quite better than before so ;D

    I just need to finish the arbiter pre-serialization to put this in the pre-kill-the-old-arbiter phase and you will be able to test all of this
    No direct support by personal message. Please open a thread so everyone can see the solution

  10. #20
    Shinken project leader
    Join Date
    May 2011
    Location
    Bordeaux (France)
    Posts
    2,131

    Re: large Setup config problem

    Ok the last commit is done, you can give a try with the new master version. Should be quicker for large setups, especially with several schedulers
    No direct support by personal message. Please open a thread so everyone can see the solution

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •