Page 1 of 2 12 LastLast
Results 1 to 10 of 11

Thread: Problem with performance ans check latency

  1. #1
    Junior Member
    Join Date
    Sep 2011
    Posts
    5

    Problem with performance ans check latency

    Hello,

    I was trying to to migrate from Nagios to Shinken on Thruk GUI, the feature list looks very interesting.

    The installation went without problem and I could start the Shinken wit my Nagios configuration. For the test installation I'm using Debian on XEN (2GB memory, 2 VCPU 2GHz). Nagios installation consists of 500 hosts and about 2000 checks, as we're monitoring the network devices the half of the checks are slow snmp requests.

    The first quite uncomfortable thing is the lunch time - the "arbiter" takes about 2-3 minutes to parse all config files and there is no such a start option as "reload", for production it's not very good. Ok, we have a lot of small config files but Nagios takes to start about 5 second.

    The second thing is the performance - the feature list said that the performance should be much more better then Nagios, but after first lunch the load on the server went upto 25 (!) and stays there the whole time, check latency rised up to 500 seconds.
    after some chages in config files (number of workers for scheduler and poller) the load stays quite stable about 3-4 (it OK) but sometimes goes upto 6-7. but the lanetcy is still very very high - more 100 seconds.
    Our Nagios installation shows the lataence about 10-20 seconds (it is mainly because the performace data processing is quite slow ) and the load 1-2 (2 CPU 2,8 server) , on Shinken I don't process performance data.

    If somebody has an idea where I should start to tune it would be very helpfull fo me.

    Update: I've increased the number of workers in poller to 32 per CPU (before it was 16), disabled Notification and changed some command in the way that they don't send any SNMP requests anymore. The server load is stil 2-4 and average latency has droped to 10 seconds.

  2. #2
    Shinken project leader
    Join Date
    May 2011
    Location
    Bordeaux (France)
    Posts
    2,131

    Re: Problem with performance ans check latency

    Hi,

    I never saw a so long time to read a configuration. How is it done? One file by host? For The load, I already saw a Shinken result in more load average versus, but it was with a Nagios that got a lot of latency (so just didn't launch all what we ask it).

    Can you give us a top output so we can see what's going on?

    Thanks,


    Jean
    No direct support by personal message. Please open a thread so everyone can see the solution

  3. #3
    Junior Member
    Join Date
    Sep 2011
    Posts
    5

    Re: Problem with performance ans check latency

    Hi Jean,

    exactly, we use one config file per host (for some reasons it's necessary) and we got around 600 configuration files, but if we start arbiter in a foreground mode (to see the debug output) we can see that the parsing itself takes not too much time, but after last config files it freezes for some minutes. It looks like it tries to run the "check_command" to see if they end succesfully.

    I don't think the top output will be really helpfull, but heir it is:
    Code:
    ┤top - 14:16:00 up 1 day, 6:02, 3 users, load average: 4.41, 4.43, 4.00
    Tasks: 130 total,  6 running, 124 sleeping,  0 stopped,  0 zombie
    Cpu(s): 47.6%us, 51.9%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.5%si, 0.0%st
    Mem:  2092692k total, 1458136k used,  634556k free,  135324k buffers
    Swap:    0k total,    0k used,    0k free,  631296k cached
    
     PID USER   PR NI VIRT RES SHR S %CPU %MEM  TIME+ COMMAND
    19212 shinken  20  0 40080 33m 1312 R  50 1.6 10:19.40 shinken-broker
    19213 shinken  20  0 15240 9772 1336 R  25 0.5  2:37.62 shinken-poller
    19111 shinken  20  0 104m 99m 2164 S  14 4.9  2:52.51 shinken-schedul
    11510 shinken  20  0 7052 5456 1700 R  11 0.3  0:00.35 check_rping.pl
    19128 shinken  20  0 64088 50m 2312 R  7 2.5  1:17.62 shinken-poller
    19132 shinken  20  0 33812 12m 1444 S  5 0.6  0:22.37 shinken-poller
    19211 shinken  20  0 41432 35m 1972 S  4 1.8  1:20.79 shinken-broker
    11509 shinken  20  0 2760 1220 952 S  2 0.1  0:00.06 check_snmp_ping
    11500 shinken  20  0 1924 756 636 S  2 0.0  0:00.05 check_ping
    19170 shinken  20  0 56828 35m 2584 S  2 1.7  0:23.54 shinken-broker
    11444 shinken  20  0 1920 760 636 S  1 0.0  0:00.04 check_ping
    11447 shinken  20  0 1920 756 636 S  1 0.0  0:00.04 check_ping
    11448 shinken  20  0 1920 756 636 S  1 0.0  0:00.04 check_ping
    11454 shinken  20  0 1920 756 636 S  1 0.0  0:00.04 check_ping
    11458 shinken  20  0 1920 760 636 S  1 0.0  0:00.04 check_ping
    11461 shinken  20  0 1920 760 636 S  1 0.0  0:00.04 check_ping
    11463 shinken  20  0 1920 760 636 S  1 0.0  0:00.04 check_ping
    11464 shinken  20  0 1920 760 636 S  1 0.0  0:00.04 check_ping
    11469 shinken  20  0 1920 760 636 S  1 0.0  0:00.04 check_ping
    11479 shinken  20  0 1920 760 636 S  1 0.0  0:00.04 check_ping
    11482 shinken  20  0 1920 760 636 S  1 0.0  0:00.04 check_ping
    11485 shinken  20  0 1920 760 636 S  1 0.0  0:00.04 check_ping
    11486 shinken  20  0 1920 760 636 S  1 0.0  0:00.04 check_ping
    11492 shinken  20  0 1920 760 636 S  1 0.0  0:00.04 check_ping
    11493 shinken  20  0 1920 760 636 S  1 0.0  0:00.04 check_ping
    11511 shinken  20  0 1920 748 632 S  1 0.0  0:00.04 check_ping
    11498 shinken  20  0 1920 760 636 S  1 0.0  0:00.03 check_ping
    13316 root   20  0 2464 1196 892 S  1 0.1  2:22.55 top
    19013 root   20  0 2484 1196 892 R  1 0.1  1:11.99 top
    11422 root   20  0 2484 1192 892 R  1 0.1  0:00.06 top
    19149 shinken  20  0 13736 9128 2272 S  1 0.4  0:03.22 shinken-reactio
    19204 shinken  20  0 51116 44m 1704 S  0 2.2  0:06.59 shinken-arbiter
      1 root   20  0 2052 460 364 S  0 0.0  0:04.11 init
      2 root   20  0   0  0  0 S  0 0.0  0:00.00 kthreadd
      3 root   RT  0   0  0  0 S  0 0.0  0:10.73 migration/0
      4 root   20  0   0  0  0 S  0 0.0  0:03.97 ksoftirqd/0
      5 root   RT  0   0  0  0 S  0 0.0  0:00.00 watchdog/0
      6 root   RT  0   0  0  0 S  0 0.0  0:11.11 migration/1
      7 root   20  0   0  0  0 S  0 0.0  0:02.94 ksoftirqd/1
      8 root   RT  0   0  0  0 S  0 0.0  0:00.00 watchdog/1
      9 root   20  0   0  0  0 S  0 0.0  0:10.83 events/0
      10 root   20  0   0  0  0 S  0 0.0  0:11.11 events/1
      11 root   20  0   0  0  0 S  0 0.0  0:00.00 cpuset
      12 root   20  0   0  0  0 S  0 0.0  0:00.00 khelper
      13 root   20  0   0  0  0 S  0 0.0  0:00.00 netns
      14 root   20  0   0  0  0 S  0 0.0  0:00.00 async/mgr
    I've activated the SNMP checks again and can see that latency became very unstable, sometimes it's about only 2-3 seconds anf 5 minutes later rises upto 60 seconds. I would ignore the latency at all, but we are planning to use the Performance Data for MRTG graphs and such instability will force MRTG graph to represent the data incorrectly.

  4. #4
    Shinken project leader
    Join Date
    May 2011
    Location
    Bordeaux (France)
    Posts
    2,131

    Re: Problem with performance ans check latency

    The arbiter thing is strange. It do not launch commands after reading the configuration, just in memory parsing of the objects, but lot of files or not is not a problem in this phase because all is in one buffer. Can you edit the arbiter file and change in the end of the file :

    Code:
    # Protect for windows multiprocessing that will RELAUNCH all
    if __name__ == '__main__':
      daemon = Arbiter(debug=opts.debug_file is not None, **opts.__dict__)
      daemon.main()
    # For perf tuning :
    #import cProfile
    #cProfile.run('''daemon.main()''', '/tmp/arbiter.profile')

    To :

    Code:
    # Protect for windows multiprocessing that will RELAUNCH all
    if __name__ == '__main__':
      daemon = Arbiter(debug=opts.debug_file is not None, **opts.__dict__)
    #  daemon.main()
    # For perf tuning :
      import cProfile
      cProfile.run('''daemon.main()''', '/tmp/arbiter.profile')
    It will launch the arbiter in "trace" mode, and so when you will stop it (not with kil -9, but a simple kil will be good, or even a stop with the service) we will get an arbtier.profile file that we can read an analyse with a profiling tool (runsnakerun for Python, you can send me the profile file, I'll ananyse it).

    For the latency, what Shinken version are you using? I remember add a "smooth" algorithm in the poller so it will better dispatch the load in all workers, and so will get far less "spikes" of load. You can see it with shinken-arbiter --version.

    A test with a recent git version can be great if you can.


    Jean
    No direct support by personal message. Please open a thread so everyone can see the solution

  5. #5
    Junior Member
    Join Date
    Sep 2011
    Posts
    5

    Re: Problem with performance ans check latency

    Hi Jean,

    thank you for the answer!

    that is my version info and arbieter.profile trace file
    Code:
    shinken-arbiter --version
    shinken-arbiter : 0.6.5   with pyro : 3.9.1

    The latency is now stable at 3-5 seconds - it's OK, but to reach this I've disabled all CRITICALS, I got a lot of them because the test monitoring system is a wrong subnet and doesn't have access to all devices. I'll try to get all services online and test the latency after that.

    But one another problem has been arised: I've disabled all CRITIICALS and provide them with a OK status with help of "Submit Passive result" but after some period of time I can see that many of those Service are in CRITICAL status again with the real plugin output. Why the disabled checks will be checking again? Is there some kind of "freshness" check for didsabled checks like in Nagios for Passive checks?


  6. #6
    Shinken project leader
    Join Date
    May 2011
    Location
    Bordeaux (France)
    Posts
    2,131

    Re: Problem with performance ans check latency

    Hi,

    I'll look at the profile thanks. The 0.6.5 did not got the smooth algorithm, from now only the git version got it.

    For the freshness yes there is such parameter, but it's not active by default. I'll give it a test


    Jean
    No direct support by personal message. Please open a thread so everyone can see the solution

  7. #7
    Shinken project leader
    Join Date
    May 2011
    Location
    Bordeaux (France)
    Posts
    2,131

    Re: Problem with performance ans check latency

    Hi,

    It's really the open files that take 2minutes. do you think it's possible to send me your configuration (obfuscate your IPs or names, it's not a problem) so I can try to reproduce it and see what part of this function is so slow?

    Thanks a lot,


    Jean
    No direct support by personal message. Please open a thread so everyone can see the solution

  8. #8
    Junior Member
    Join Date
    Sep 2011
    Posts
    5

    Re: Problem with performance ans check latency

    Hi Jean!

    thank you for the tip with "gilt" version, I just didn't get first time what you exactly meant under "gilt version". I've installed it now(and of course forgot to backup the old config ) and the start process is really quicker, it takes may be a tick longe as a Nagios but not really noticable. Let us wait what latency says.

    How can I "reload" the deamon with a new configuration without restarting it? because after restart it takes again some minute until the frontend shows the data.


    For the freshness yes there is such parameter, but it's not active by default. I'll give it a test
    I'm afraid you've uderstood me wrong , what I mean that - are the"disabled" checks after some period of time will be checked and updated?

    and one may usefull notice: the default Thruk frontend was not very performant, I think because it runs as a standalone Perl Webserver. We've migrated it to a normal Apache with perl-fcgi module and the performance of Frontend has been really improved.

  9. #9
    Shinken project leader
    Join Date
    May 2011
    Location
    Bordeaux (France)
    Posts
    2,131

    Re: Problem with performance ans check latency

    Hi,

    It's great for the loading time

    For the reload, the is not such mode, but I think you jsut need to enable the retention file for your scheduler so you will get as soon as you start the old values and so do not need to wait 5 or 10minutes to get data. you can enable it in the shinken-specific.cfg file, in your scheduler object, uncomment the modules lines.
    Look at the module PickleRetention definition in this same file to be sure that the path really exist. By default it's /tmp, but you can set a rue directory that will not be erase when the server will restart

    For the disable check you are right, there was a flaw : the element was not "re-scheduled", but the current check in progress was still running, and so we always get "one more check" after disable it. I just update the git for fixing this behavior

    Thanks for the Thruk tip, I think it will help a lot of people
    No direct support by personal message. Please open a thread so everyone can see the solution

  10. #10
    Junior Member
    Join Date
    Sep 2011
    Posts
    5

    Re: Problem with performance ans check latency

    Hi Jean,

    thank you for the explanation about "disabled checks".

    after update I got another small problem and don't really understand it - all hosts have now the "check disable" sign on it (see the picture). But all the hosts are active and are checked, I can manually set in the Host command "Disable active checks of this host" or "enable active checks of this host", I see that these commands are coming into the log files as EXTERNAL commands but the sign doesn't disappear. Do you have any idea?

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •