星期六, 12月 31, 2016

nagios with openSUSE Leap 42.2 小記

nagios with openSUSE Leap 42.2 小記

目的: 監控 目前專案還有自己的設備
監控公共服務或是主機是否活著


OS: openSUSE Leap 42.2

安裝 nagios 相關套件, plugins 套件的名稱也改了, 現在叫 monitoring-plugins

# zypper  install   nagios  monitoring-plugins

設定 nagiosadmin 密碼

傳統的方式
# htpasswd2   -c   /etc/nagios/htpasswd.users   nagiosadmin
New password:
Re-type new password:
Adding password for user nagiosadmin

因為考慮之後自動化要結合 ansible 所以也嘗試了 -b 與 -i 選項, 這邊比較有趣的是 -b 或是 -i 都要相選項與 -c 放在一起, 也就是 -b -c 是不行的, 要 -bc 才行

-b  batch mode 密碼要放在使用者帳號後面
# htpasswd2  -bc   /etc/nagios/htpasswd.users    nagiosadmin    test
Adding password for user nagiosadmin

-i  read stdin, 透過 STDIN 來餵進去密碼
# echo  test  |  htpasswd2  -ic   /etc/nagios/htpasswd.users    nagiosadmin
Adding password for user nagiosadmin

確認 nagios 開機啟動

# systemctl   is-enabled   nagios
nagios.service is not a native service, redirecting to systemd-sysv-install
Executing /usr/lib/systemd/systemd-sysv-install is-enabled nagios
disabled

設定開機啟動 nagios
# systemctl   enable  nagios
nagios.service is not a native service, redirecting to systemd-sysv-install
Executing /usr/lib/systemd/systemd-sysv-install enable nagios


# systemctl   is-enabled   nagios
nagios.service is not a native service, redirecting to systemd-sysv-install
Executing /usr/lib/systemd/systemd-sysv-install is-enabled nagios
enabled

嘗試啟動 apache2, 這個時候會出現錯誤
# systemctl  restart  apache2.service
Job for apache2.service failed because the control process exited with error code. See "systemctl status apache2.service" and "journalctl -xe" for details.


使用 status 觀察, 原因是 apache2.4 與 apache2.2 寫法不一樣
# systemctl  status  apache2.service
12月 31 11:11:50 template start_apache2[7143]: AH00526: Syntax error on line 15 of /etc/apache2/conf.d/nagios.conf:
12月 31 11:11:50 template start_apache2[7143]: Invalid command 'Order', perhaps misspelled or defined by a module not included in the server configuration
12月 31 11:11:51 template systemd[1]: apache2.service: Main process exited, code=exited, status=1/FAILURE

可以參考

解法啟用 access_compat 模組 ( openSUSE / SUSE 預設 authz_host 已經啟動 )
# a2enmod   mod_access_compat

列出已經啟用的 apache2 module
# apache2ctl   -M
  • 會將設定寫入 /etc/apache2/sysconfig.d/loadmodule.conf 檔案內 LoadModule access_compat_module /usr/lib64/apache2-prefork/mod_access_compat.so

重新啟動 apache2
# systemctl  restart  apache2.service

觀察資訊
# systemctl   status  apache2.service


啟動  nagios
# systemctl  start  nagios

觀察資訊
# systemctl  status nagios

開啟 http 服務
#yast2   firewall

預設如果啟動 nagios, 他會去檢查本機 http 服務, 但是沒有預設網頁就會警告, 還有如果監控的項目比較多, total process 也會超標, 所以我調整了一下 /etc/nagios/objects/localhost.cfg  相關內容

#vi   /etc/nagios/objects/localhost.cfg
註解 HTTP, linux-servers 群組 以及調整 Total Process
# 2014/1/8 edit by sakana, temp disable HTTP monitor
#define service{
#        use         local-service         ; Name of service template to use
#        host_name                       localhost
#        service_description             HTTP
#       check_command                   check_http
#       notifications_enabled           0
#        }

# Define an optional hostgroup for Linux machines
#
#define hostgroup{
#        hostgroup_name  linux-servers ; The name of the hostgroup
#        alias           Linux Servers ; Long name of the group
#        members         localhost     ; Comma separated list of hosts that belong to this group
#        }


# 2014/1/8 edit by sakana change check_local_procs from 250 to 400, 400 to 800
define service{
       use                             local-service         ; Name of service template to use
       host_name                       localhost
       service_description             Total Processes
       check_command                   check_local_procs!400!800!RSZDT
       }

上面其實只是說明, 如果需求跟我一樣懶得動手改, 可以抓網路上我已經改好的
( 其實也是為了自己自動化 )

# wget  https://raw.githubusercontent.com/sakanamax/LearnAnsible/master/playbook/general/nagios/files/localhost.cfg
--2016-12-31 12:21:29--  https://raw.githubusercontent.com/sakanamax/LearnAnsible/master/playbook/general/nagios/files/localhost.cfg
正在查找主機 raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.100.133
正在連接 raw.githubusercontent.com (raw.githubusercontent.com)|151.101.100.133|:443... 連上了。
已送出 HTTP 要求,正在等候回應... 200 OK
長度: 5546 (5.4K) [text/plain]
Saving to: ‘localhost.cfg’

100%[=====================================================================================================================>] 5,546       --.-K/s   in 0s      

2016-12-31 12:21:30 (28.9 MB/s) - ‘localhost.cfg’ saved [5546/5546]

目前目錄下就會有 localhost.cfg
# ls
bin  Desktop  Documents  Downloads  inst-sys  localhost.cfg  Music  Pictures  Public  Templates  Videos

將localhost.cfg 取代 /etc/nagios/objects/localhost.cfg ( 謎之音: 記得先備份?? )
# mv    localhost.cfg   /etc/nagios/objects/localhost.cfg


修改通知 e-mail  
# vi   /etc/nagios/objects/contacts.cfg
修改預設的  e-mail
define contact{
       contact_name    nagiosadmin    ; Short name of user
       use    generic-contact         ; Inherit default values from generic-contact template (defined above)
       alias       Nagios Admin            ; Full name of user
       email  自己帳號@郵件 ; <<***** CHANGE THIS TO YOUR EMAIL ADDRESS ******
       }



檢視 Nagios 設定有沒有問題
#nagios  -v   /etc/nagios/nagios.cfg

重新啟動 Nagios
# systemctl  restart  nagios.service

結果如下

2016-12-31 12-37-21 的螢幕擷圖.png




安裝 nagios-nrpe 套件

# zypper  install  nrpe  monitoring-plugins-nrpe

下載之前自己建立的範本檔案
--2016-12-31 16:13:07--  https://raw.githubusercontent.com/sakanamax/LearnAnsible/master/playbook/general/nagios/files/templates.cfg
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.100.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.100.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 19833 (19K) [text/plain]
Saving to: ‘templates.cfg’

100%[=====================================================================================================================>] 19,833      --.-K/s   in 0.06s   

2016-12-31 16:13:07 (324 KB/s) - ‘templates.cfg’ saved [19833/19833]

確認目前目錄下有 templates.cfg
# ls
Desktop  Documents  Downloads  Music  Pictures  Public  Templates  Videos  bin  inst-sys  templates.cfg


覆蓋且移動原來的設定檔
# mv    templates.cfg   /etc/nagios/objects/templates.cfg

下載之前自己建立的範本commands.cfg檔案
--2016-12-31 16:18:12--  https://raw.githubusercontent.com/sakanamax/LearnAnsible/master/playbook/general/nagios/files/commands.cfg
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.100.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.100.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7876 (7.7K) [text/plain]
Saving to: ‘commands.cfg’

100%[=====================================================================================================================>] 7,876       --.-K/s   in 0s      

2016-12-31 16:18:13 (60.2 MB/s) - ‘commands.cfg’ saved [7876/7876]


確認 commands.cfg 已經下載
# ls
Desktop  Documents  Downloads  Music  Pictures  Public  Templates  Videos  bin  commands.cfg  inst-sys

覆蓋且移動原來的設定檔
# mv    commands.cfg    /etc/nagios/objects/commands.cfg

建立之後存放 Server 與 一般工作站的設定檔目錄
#mkdir   /etc/nagios/servers
#mkdir   /etc/nagios/pcs
#mkdir   /etc/nagios/racks
#mkdir  /etc/nagios/switches
#mkdir   /etc/nagios/projects
#mkdir   /etc/nagios/labs

取得事先寫好的 linuxPublic.cfg, windowsPublic.cfg 複製給公用服務使用並複製到 /etc/nagios/objects目錄

--2016-12-31 16:37:12--  https://raw.githubusercontent.com/sakanamax/LearnAnsible/master/playbook/general/nagios/files/linuxPublic.cfg
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.100.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.100.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2652 (2.6K) [text/plain]
Saving to: ‘linuxPublic.cfg’

100%[=====================================================================================================================>] 2,652       --.-K/s   in 0s      

2016-12-31 16:37:12 (28.7 MB/s) - ‘linuxPublic.cfg’ saved [2652/2652]

# mv   linuxPublic.cfg   /etc/nagios/objects/

--2016-12-31 16:38:37--  https://raw.githubusercontent.com/sakanamax/LearnAnsible/master/playbook/general/nagios/files/windowsPublic.cfg
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.100.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.100.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2345 (2.3K) [text/plain]
Saving to: ‘windowsPublic.cfg’

100%[=====================================================================================================================>] 2,345       --.-K/s   in 0s      

2016-12-31 16:38:37 (23.4 MB/s) - ‘windowsPublic.cfg’ saved [2345/2345]

# mv   windowsPublic.cfg   /etc/nagios/objects/

取得事先寫好的 switchSimple.cfg, rackHost.cfg 複製給switch, 機器服務使用並複製到 /etc/nagios/objects目錄

--2016-12-31 16:41:58--  https://raw.githubusercontent.com/sakanamax/LearnAnsible/master/playbook/general/nagios/files/switchSimple.cfg
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.100.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.100.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3365 (3.3K) [text/plain]
Saving to: ‘switchSimple.cfg’

100%[=====================================================================================================================>] 3,365       --.-K/s   in 0s      

2016-12-31 16:41:58 (29.6 MB/s) - ‘switchSimple.cfg’ saved [3365/3365]

# mv   switchSimple.cfg   /etc/nagios/objects/

--2016-12-31 16:43:38--  https://raw.githubusercontent.com/sakanamax/LearnAnsible/master/playbook/general/nagios/files/rackHost.cfg
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.100.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.100.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2716 (2.7K) [text/plain]
Saving to: ‘rackHost.cfg’

100%[=====================================================================================================================>] 2,716       --.-K/s   in 0s      

2016-12-31 16:43:38 (18.2 MB/s) - ‘rackHost.cfg’ saved [2716/2716]

# mv   rackHost.cfg   /etc/nagios/objects/


取得事先寫好的 windows.cfg 複製給windows 服務使用並複製到 /etc/nagios/objects目錄

--2016-12-31 16:53:25--  https://raw.githubusercontent.com/sakanamax/LearnAnsible/master/playbook/general/nagios/files/windows.cfg
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.100.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.100.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4023 (3.9K) [text/plain]
Saving to: ‘windows.cfg’

100%[=====================================================================================================================>] 4,023       --.-K/s   in 0s      

2016-12-31 16:53:25 (45.9 MB/s) - ‘windows.cfg’ saved [4023/4023]

# mv   windows.cfg    /etc/nagios/objects/


取得事先寫好的 nagios.cfg 主要是修改使用 cfg_dir= 並複製到 /etc/nagios目錄

--2016-12-31 16:48:01--  https://raw.githubusercontent.com/sakanamax/LearnAnsible/master/playbook/general/nagios/files/nagios.cfg
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.100.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.100.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 44650 (44K) [text/plain]
Saving to: ‘nagios.cfg’

100%[=====================================================================================================================>] 44,650      --.-K/s   in 0.1s    

2016-12-31 16:48:01 (397 KB/s) - ‘nagios.cfg’ saved [44650/44650]

# mv    nagios.cfg   /etc/nagios/

檢視 Nagios 設定有沒有問題
#nagios  -v   /etc/nagios/nagios.cfg
這邊可能會有警告, 因為 rack 主機我們只檢查主機不檢查服務

重新啟動 Nagios
# systemctl  restart  nagios.service

結果如下
2016-12-31 17-02-14 的螢幕擷圖.png

Part II: Nagios 用戶端-- Linux 伺服器用戶端建置

請於 Client

1.安裝 nagios-nrpe套件

# zypper  install  nrpe  monitoring-plugins-nrpe monitoring-plugins

2.設定nagios-nrpe 套件
(另外最好去確認 /etc/services 有沒有 nrpe 5666/tcp # nagios nrpe 的設定)
#grep  5666  /etc/services

修改設定檔 允許 Nagios 伺服器連線 ( 設定檔位置有改變, 目前為 /etc/nrpe.cfg )
# vi  /etc/nrpe.cfg

allowed_hosts=127.0.0.1,192.168.100.199

(請依照實際的IP 作為修正, 可能為 10.x.x.x )
這邊要注意 127.0.0.1後面要加上 逗點 , 然後主機 IP 不能有空格
不然會出現沒有辦法建立 SSL HandShake
(這個部份可以解釋, 如果nrpe是使用 SystemV 的形式啟動後面都會出現不能建立SSL Handshake ,但是以Xinetd 就不會)


啟動 NRPE
# systemctl   start   nrpe

檢查狀態 ( 這邊有看到 /run/nrpe/nrpe.pid 無法建立, 但是目前測試沒有影響查詢, 如果不放心就建立 /run/nrpe 然後給 nagios 寫入權限 )
# systemctl  status nrpe
● nrpe.service - Daemon to remotely execute Nagios plugins
  Loaded: loaded (/usr/lib/systemd/system/nrpe.service; enabled; vendor preset: disabled)
  Active: active (running) since 六 2016-12-31 19:17:50 CST; 3min 50s ago
 Process: 3457 ExecStart=/usr/sbin/nrpe -c /etc/nrpe.cfg -d (code=exited, status=0/SUCCESS)
Main PID: 3461 (nrpe)
   Tasks: 1 (limit: 512)
  CGroup: /system.slice/nrpe.service
          └─3461 /usr/sbin/nrpe -c /etc/nrpe.cfg -d

12月 31 19:17:50 template systemd[1]: Starting Daemon to remotely execute Nagios plugins...
12月 31 19:17:50 template systemd[1]: Started Daemon to remotely execute Nagios plugins.
12月 31 19:17:50 template nrpe[3461]: Starting up daemon
12月 31 19:17:50 template nrpe[3461]: Cannot write to pidfile '/run/nrpe/nrpe.pid' - check your privileges.

設定開機啟動 NRPE
# systemctl   enable  nrpe

確認開機啟動
# systemctl  is-enabled    nrpe
enabled

請於Client
執行 check_nrpe 測試, 成功應該會出現 NRPE的版本
# /usr/lib/nagios/plugins/check_nrpe   -H   127.0.0.1
NRPE v2.15

*************************************************************
請於 Server

針對 nagios client 測試 nagios-nrpe 成功應該會出現 NRPE的版本
#/usr/lib/nagios/plugins/check_nrpe  -H  192.168.100.100
NRPE v2.15

(這邊請確認 firewall 是否關閉, 或是准許 nrpe 通過, 可以使用 #yast2  firewall 關閉防火牆測試  或是下指令 #rcSuSEfirewall2  stop )


*************************************************************


請於Client
加入相關 nrpe 指令 ( 因為目前已經沒有 hda1 了 )
#vi   /etc/nagios/nrpe.cfg
加入
#command[check_hda1]=/usr/lib/nagios/plugins/check_disk -w 20% -c 10% -p /dev/hda1
command[check_sda1]=/usr/lib/nagios/plugins/check_disk -w 20% -c 10% -p /dev/sda1
command[check_sda2]=/usr/lib/nagios/plugins/check_disk -w 20% -c 10% -p /dev/sda2
command[check_ssh]=/usr/lib/nagios/plugins/check_ssh   127.0.0.1
command[check_smtp]=/usr/lib/nagios/plugins/check_smtp  127.0.0.1

也可以用之前已經做好的 nrpe.cfg

--2016-12-31 19:49:29--  https://raw.githubusercontent.com/sakanamax/LearnAnsible/master/playbook/general/nagios/files/nrpe.cfg
正在查找主機 raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.100.133
正在連接 raw.githubusercontent.com (raw.githubusercontent.com)|151.101.100.133|:443... 連上了。
已送出 HTTP 要求,正在等候回應... 200 OK
長度: 8474 (8.3K) [text/plain]
Saving to: ‘nrpe.cfg’

100%[===========================================================================>] 8,474       --.-K/s   in 0s      

2016-12-31 19:49:29 (53.1 MB/s) - ‘nrpe.cfg’ saved [8474/8474]

#vi   /etc/nagios/nrpe.cfg
(如果IP有改就改 allowed_hosts= 後面接的IP吧 !! )

覆蓋原來的檔案
# mv   nrpe.cfg   /etc/

重新啟動
# systemctl   restart  nrpe

測試相關指令
# /usr/lib/nagios/plugins/check_nrpe   -H 127.0.0.1   -c   check_sda2
DISK OK - free space: / 12753 MB (69% inode=97%);| /=5661MB;14732;16573;0;18415

# /usr/lib/nagios/plugins/check_nrpe  -H 127.0.0.1  -c  check_users
USERS OK - 2 users currently logged in |users=2;5;10;0

# /usr/lib/nagios/plugins/check_nrpe   -H 127.0.0.1   -c   check_load
OK - load average: 0.00, 0.00, 0.00|load1=0.000;15.000;30.000;0; load5=0.000;10.000;25.000;0; load15=0.000;5.000;20.000;0;

# /usr/lib/nagios/plugins/check_nrpe   -H 127.0.0.1   -c   check_total_procs
PROCS OK: 208 processes | procs=208;250;300;0;

另外於 Server端測試 相關的指令
# /usr/lib/nagios/plugins/check_nrpe  -H 192.168.100.100   -c check_sda2
DISK OK - free space: / 12753 MB (69% inode=97%);| /=5661MB;14732;16573;0;18415




Part III: linux 伺服器(Nagios Client)加入到 Nagios監控範圍

請於Server上面

下載事先做好的 linux.cfg 範本
--2016-12-31 20:14:49--  https://raw.githubusercontent.com/sakanamax/LearnAnsible/master/playbook/general/nagios/files/linux.cfg
正在查找主機 raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.100.133
正在連接 raw.githubusercontent.com (raw.githubusercontent.com)|151.101.100.133|:443... 連上了。
已送出 HTTP 要求,正在等候回應... 200 OK
長度: 3664 (3.6K) [text/plain]
Saving to: ‘linux.cfg’

100%[===================================================================>] 3,664       --.-K/s   in 0s      

2016-12-31 20:14:49 (19.2 MB/s) - ‘linux.cfg’ saved [3664/3664]

移動到 /etc/nagios/objects 下
# mv    linux.cfg   /etc/nagios/objects/

**設定 Nagios 載入 linux client 的設定檔 **

依照不同的性質將範本設定檔 linux.cfg 複製到不同目錄我們建立了
  • /etc/nagios/labs - 監控實驗機器
  • /etc/nagios/servers - 監控服務機器
  • /etc/nagios/pcs - 監控pc
  • /etc/nagios/projects - 監控專案機器
  • /etc/nagios/racks - 監控機架 IPMI
  • /etc/nagios/switches - 監控 switch

假設有台 linux 服務機器要監控
將剛剛的 linux.cfg 複製給 linux server 使用並複製到 /etc/nagios/servers目錄 
# cp  /etc/nagios/objects/linux.cfg   /etc/nagios/servers/linux100.cfg

確認IP設定無誤
#vi   /etc/nagios/servers/linux100.cfg
  • address         192.168.3.129 請改成實際的IP
  • host_name     suseserver129 請改成實際的名稱

確認設定檔是否無誤
# nagios   -v   /etc/nagios/nagios.cfg

重新啟動 nagios 使其生效
# systemctl   restart    nagios

結果如下


2016-12-31 21-01-02 的螢幕擷圖.png

大功告成

~ enjoy it





2 則留言:

Unknown 提到...

您好,我可以問一下如何修改nagios監控網頁,例如說是如何利用頁面調整警報不報警或是確認警報

Max 提到...

您可以參考
https://raw.githubusercontent.com/sakanamax/LearnAnsible/master/playbook/general/nagios/files/linux.cfg

因為是利用範本 copy 然後對機器進行監控, 例如 linux101.cfg
所以如果不想監控就移除該項目, 例如不想監控 loading 就把下面部份註解起來
define service{
use nrpe-check_load
host_name suseserver129
}

想監控別的就在 use 後面加上想監控的模組

給您參考