Results 1 to 2 of 2

Thread: nrpe_poller crash

  1. #1
    Junior Member
    Join Date
    Dec 2013
    Posts
    1

    nrpe_poller crash

    We have a distributed shinken setup with a dedicated poller, there are almost 400 hosts with 3500 services behind this poller.

    Here is the poller config

    Code:
    define poller {
      poller_name pop-eu-east-1
      address 10.0.0.10
      poller_tags pop-eu-east-1
      manage_sub_realms 0
      spare 0
      realm All
      port 7771
      min_workers 2
      modules NrpeBooster
    }
    Basically, at startup the poller work as expected.

    Code:
    2013-12-10 14:25:15,116 [1386685515] Info :  [pop-eu-west-1] Connection OK with scheduler scheduler-shinken-01
    2013-12-10 14:25:15,117 [1386685515] Info :  [pop-eu-west-1] Using max workers: 30
    2013-12-10 14:25:15,117 [1386685515] Info :  [pop-eu-west-1] Using min workers: 2
    2013-12-10 14:25:15,117 [1386685515] Info :  We have our schedulers: {0: {'wait_homerun': {}, 'name': u'scheduler-shinken-01', 'uri': u'PYROLOC://10.0.0.2:7768/Checks', 'actions': {}, 'instance_id': 0, 'running_id': '1386683580.0', 'address': u'10.0.0.2', 'active': True, 'push_flavor': 56000, 'port': 7768, 'con': <DynamicProxy for PYRO://10.0.0.2:7768/7f000001645d20494739a523d16154405b>}}
    2013-12-10 14:25:15,117 [1386685515] Debug :  Add module object {'configuration_errors': [], 'use': '', 'hash': '', 'name': '', 'tags': set([]), 'modules': [], 'customs': {}, 'configuration_warnings': [], 'module_name': u'NrpeBooster', 'plus': {}, 'module_type': u'nrpe_poller', 'id': 3, 'imported_from': u'/etc/shinken/architecture.conf'}
    2013-12-10 14:25:15,118 [1386685515] Info :  [pop-eu-west-1] Got module: nrpe_poller
    2013-12-10 14:25:15,325 [1386685515] Warning : Importing module logstore_mongodb: No module named pymongo
    2013-12-10 14:25:15,471 [1386685515] Warning : Importing module openldap_ui: No module named ldap
    2013-12-10 14:25:15,955 [1386685515] Info :  [NRPEPoller] Get a nrpe poller module for plugin NrpeBooster
    2013-12-10 14:25:15,955 [1386685515] Info :  Trying to init module: NrpeBooster
    2013-12-10 14:25:15,956 [1386685515] Info :  [NRPEPoller] Initialization of the nrpe poller module
    2013-12-10 14:25:15,956 [1386685515] Info :  I correctly loaded the modules: [NrpeBooster]
    2013-12-10 14:25:15,962 [1386685515] Info :  [pop-eu-west-1] Allocating new fork Worker: 0
    2013-12-10 14:25:16,012 [1386685516] Info :  [pop-eu-west-1] Allocating new nrpe_poller Worker: 1
    2013-12-10 14:25:16,055 [1386685516] Debug :  Loop turn
    2013-12-10 14:25:16,089 [1386685516] Info :  [NRPEPoller] Module started!
    2013-12-10 14:25:17,255 [1386685517] Debug :  ========================
    2013-12-10 14:25:17,257 [1386685517] Debug :  [0][scheduler-shinken-01][fork] Stats: Workers:0 (Queued:0 TotalReturnWait:0)
    2013-12-10 14:25:17,258 [1386685517] Debug :  [0][scheduler-shinken-01][nrpe_poller] Stats: Workers:1 (Queued:0 TotalReturnWait:0)
    2013-12-10 14:25:17,259 [1386685517] Debug :  Wait ratio: 1.000000
    2013-12-10 14:25:20,887 [1386685520] Debug :  Ask actions to 0, got 3985
    2013-12-10 14:25:29,630 [1386685529] Debug :  Loop turn
    
    2013-12-10 14:25:30,634 [1386685530] Debug :  ========================
    2013-12-10 14:25:30,635 [1386685530] Debug :  [0][scheduler-shinken-01][fork] Stats: Workers:0 (Queued:0 TotalReturnWait:0)
    2013-12-10 14:25:30,635 [1386685530] Debug :  [0][scheduler-shinken-01][nrpe_poller] Stats: Workers:1 (Queued:504 TotalReturnWait:0)
    2013-12-10 14:25:30,636 [1386685530] Debug :  I decide to up wait ratio
    2013-12-10 14:25:30,636 [1386685530] Debug :  Wait ratio: 1.199851
    2013-12-10 14:25:30,742 [1386685530] Debug :  Ask actions to 0, got 29
    2013-12-10 14:25:30,983 [1386685530] Debug :  Loop turn
    2013-12-10 14:25:32,192 [1386685532] Debug :  ========================

    After few minutes we got logs about worker going down unexpectedly and a new worker is allocated.
    But when the second worker also goes down, no workers are allocated again and worst nrpe booster simply disappears and the queue became empty...

    Code:
    2013-12-10 14:25:36,631 [1386685536] Debug :  ========================
    2013-12-10 14:25:36,633 [1386685536] Warning : [pop-eu-west-1] The worker 1 goes down unexpectedly!
    2013-12-10 14:25:36,639 [1386685536] Debug :  [0][scheduler-shinken-01][fork] Stats: Workers:0 (Queued:0 TotalReturnWait:118)
    2013-12-10 14:25:36,639 [1386685536] Debug :  Wait ratio: 1.283489
    2013-12-10 14:25:36,683 [1386685536] Info :  [pop-eu-west-1] Allocating new fork Worker: 2
    2013-12-10 14:25:37,115 [1386685537] Info :  [pop-eu-west-1] Allocating new nrpe_poller Worker: 3
    2013-12-10 14:25:37,516 [1386685537] Info :  [NRPEPoller] Module started!
    2013-12-10 14:25:37,890 [1386685537] Debug :  Ask actions to 0, got 8
    2013-12-10 14:25:38,985 [1386685538] Debug :  Loop turn
    2013-12-10 14:25:40,270 [1386685540] Debug :  ========================
    2013-12-10 14:25:40,271 [1386685540] Debug :  [0][scheduler-shinken-01][fork] Stats: Workers:0 (Queued:0 TotalReturnWait:49)
    2013-12-10 14:25:40,272 [1386685540] Debug :  [0][scheduler-shinken-01][fork] Stats: Workers:2 (Queued:0 TotalReturnWait:49)
    2013-12-10 14:25:40,273 [1386685540] Debug :  [0][scheduler-shinken-01][nrpe_poller] Stats: Workers:3 (Queued:0 TotalReturnWait:49)
    2013-12-10 14:25:40,274 [1386685540] Debug :  Wait ratio: 1.266827
    2013-12-10 14:25:40,401 [1386685540] Debug :  Ask actions to 0, got 11
    2013-12-10 14:25:41,229 [1386685541] Debug :  Loop turn
    2013-12-10 14:25:42,498 [1386685542] Debug :  ========================
    [...]
    2013-12-10 14:26:36,819 [1386685596] Debug :  ========================
    2013-12-10 14:26:36,820 [1386685596] Warning : [pop-eu-west-1] The worker 3 goes down unexpectedly!
    2013-12-10 14:26:36,827 [1386685596] Debug :  [0][scheduler-shinken-01][fork] Stats: Workers:0 (Queued:0 TotalReturnWait:8)
    2013-12-10 14:26:36,828 [1386685596] Debug :  [0][scheduler-shinken-01][fork] Stats: Workers:2 (Queued:0 TotalReturnWait:8)
    2013-12-10 14:26:36,829 [1386685596] Debug :  Wait ratio: 1.431813
    2013-12-10 14:26:36,959 [1386685596] Debug :  Ask actions to 0, got 2
    2013-12-10 14:26:37,086 [1386685597] Debug :  Loop turn
    2013-12-10 14:26:38,520 [1386685598] Debug :  ========================
    2013-12-10 14:26:38,521 [1386685598] Debug :  [0][scheduler-shinken-01][fork] Stats: Workers:0 (Queued:0 TotalReturnWait:10)
    2013-12-10 14:26:38,522 [1386685598] Debug :  [0][scheduler-shinken-01][fork] Stats: Workers:2 (Queued:0 TotalReturnWait:10)
    2013-12-10 14:26:38,523 [1386685598] Debug :  Wait ratio: 1.419792
    2013-12-10 14:26:38,628 [1386685598] Debug :  Ask actions to 0, got 3
    2013-12-10 14:26:38,798 [1386685598] Debug :  Loop turn
    2013-12-10 14:26:40,219 [1386685600] Debug :  ========================

    Here are the logs from the scheduler and the arbiter but both happen before the worker 3 goes down.

    scheduler.log
    Code:
    2013-12-10 14:25:13,128 [Tue Dec 10 14:25:13 2013] Warning : 3858 actions never came back for the satellite 'pop-eu-west-1'. I'm reenable them for polling
    2013-12-10 14:26:20,465 [Tue Dec 10 14:26:20 2013] Warning : 3880 actions never came back for the satellite 'pop-eu-west-1'. I'm reenable them for polling
    arbiter.log
    Code:
    2013-12-10 14:25:14,568 [Tue Dec 10 14:25:14 2013] Warning : [All] The poller pop-eu-west-1 seems to be down, I must re-dispatch its role to someone else.

  2. #2
    Shinken project leader
    Join Date
    May 2011
    Location
    Bordeaux (France)
    Posts
    2,131

    Re: nrpe_poller crash

    Sorry I never saw your post.

    Please give a try without the nrpe_booster module so we can look which part of the code is faulty (poller or module).
    No direct support by personal message. Please open a thread so everyone can see the solution

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •