Results 1 to 4 of 4

Thread: Scaling after install

  1. #1

    Scaling after install

    Hey there,

    I'm planing move Shinken to our production environment, so there are some hardware procurements to happen.

    I wonder that everyone planned to start small and then grow as demand requests. So, my plan here is:

    Datacenter 1:
    * one physical hardware (HP rackmount, dual proc eight cores, 24GB RAM, 1.8TB disk in RAID 1+0 of 10k RPM disks), to run Shinken core (arbiter, scheduler, broker), plus MongoDB primary node, plus Graphite for performance metrics
    * one virtual machine to run MongoDB secondary node (backup)
    * one virtual machine to run Shinken poller (two proc, 2GB ram?)

    Datacenter 2:
    * one virtual machine to run Shinken poller (two proc, 2GB ram?)
    * one virtual machine to run MongoDB secondary node (also backup, priority 0 so never can be elected)

    Datacenters 3 to N:
    * one virtual machine to run Shinken poller (two proc, 2GB ram?)



    How do you think that would scale? I'll be monitoring a few hundreds of hosts per datacenter at first, then moving to few thousands in weeks term. Sensors per host are mostly SNMP, and can vary from 5 to 500 sensors (average is 15) per host.



    My main concern here is how to scale the MongoDB and Graphite once I start getting problems at the central host.
    Is it easy to move MongoDB around? Copy files and start daemon at new place, binding it to the mongo shard?
    Is it easy to move Graphite around? I have no idea how it would work.

    From my tests this setup is working pretty neatly for a hundred hosts with a thousand sensors, only in VMs.
    I'm not using Graphite but PNP4Nagios. How would this impact me?


    Cheers,
    openglx

  2. #2
    Administrator
    Join Date
    Dec 2011
    Posts
    278

    Re: Scaling after install

    Hello openglx,

    Read scaling Shinken in the wiki. Just search for scaling in the search box (upper right corner)

    - Pollers for small installs can be virtual, large installs need to be physical (no VMs) as they are all about network IO and CPU.
    - Schedulers and brokers should be on physical servers. They are the big CPU consumers and also do bit of network IO.
    - Arbiter and Reactionner should be virtual
    - Graphite is all about local disk IO and distributing the Graphite processes to different cores. For example an IBM X3650 or X3550 with six 10K disks can scale to over 20K per seconds of metrics with basic load balancing using carbon-relay and carbon-cache. And up to 80K metrics per second with more tuning. Which is pretty crazy numbers, so you will not need to scale it using hardware just a bit of config tuning.

    If your devices support SNMP getbulk, use snmp_booster, I can provide support for adding new OIDs if you need. To Shinken directly (or genDevConfig).

    You will have to test your backup servers, because if Shinken ever runs out of memory (Scheduler, poller, broker, arbiter) the Shinken daemon implodes and needs to be restarted.

    Swap = death, there is no backoff process in Shinken to recover and run at a degraded rate. Hmm that is an issue to open ;-).

    Graphite is easy to move as the databases are simple files, same with the sqlite DB.

    Graphite is much different then PNP! You can actually have Graphite running on a different server than Shinken, and it can receive data from other than Shinken. It can also have different UIs based on your needs. Graphite can actually run on commodity hardware, but it needs a good RAID array. Not a SAN mind you, that is just a waste of good money for what Graphite does. You can also replicate Graphite data across different servers if you need.

    Anyway, hope this helps. It is all a matter of knowing the number of hosts/services and the types of protocols to acquire the data.

    xkilian

  3. #3

    Re: Scaling after install

    Thanks once more, xkilian!

    What about disk space utilization? Who is going to be bigger?

    I saw someone saying that his MongoDB for Shinken was growing like 500MB a day. How many devices/sensors should one have to reach such rate?
    And how about Graphite? A several year period with minute precision is likely to be 5MB per .wsp, how much space are you guys using currently?

    I plan to go for multiple years with minute precision but am worried on how much space it would eat for 10's of thousands of sensors.
    Any figures in disk space are welcome

    Planing disk space is important as to plan for disk speed... buying the right hardware kit is an art

  4. #4
    Administrator
    Join Date
    Dec 2011
    Posts
    278

    Re: Scaling after install

    Hello,

    For Graphite disk space assignment is fixed based on the number of metrics, the use of consolidation methods and your retention period. Disk space is fully reserved at database creation time.

    For Mongo, that is incorrect. MongoDB will only store state logs, so this should not be very big unless you have a lot of state changes. The explanaion was that it pre-allocates some memory, but the on-disk size should be consequent with the actual data stored and retention period of your state change logs. The retention period is set in the Livestatus logstore module.

    We have 2TB on raid 6 arrays of 10K RPM disks reserved for our metrics. 1 to 5 minute precision for 2 months with consolidation to 30 minutes AVG and/or MAX for N months as appropriate.

    As I said before this can all be pre-calculated based on the number of data points (hosts + services), frequency, retention period and consolidation functions. If not sure, create one data point with the config you want and multiply the size of the Graphite .wsp file by the number of similar data points. ;-)

    Cheers,

    xkilian

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •