私有离线服务器基于 Prometheus 告警监控实践

背景

2022年4月20日，我们的监控平台，zabbix 所在的服务器硬盘坏了，zabbix 终于迎来它的寿终正寝。

这台服务器有 12 个数据盘，因为早期某些原因（懒吧），数据盘是做的单盘 raid0，zabbix 的数据在其中的一个数据盘上

但是我们平时事情太多了，监控平台这种毫无价值产出的事情就一直拖着。

拖到 8 月（竟然这么能拖），某个契机发现有些服务器的数据盘已经满了，结果因为 zabbix 坏了所以没有告警出来，而且这些服务器上数据还比较重要。。。。又被坑了一次

监控系统重建迫在眉睫，于是我开始着手准备必须要开始重建了。

监控选型：Zabbix vs Prometheus

是继续用老的 Zabbix 系统，还是部署当下流行的 Prometheus 呢？

我查阅了一些资料，对比如下：

Zabbix

配置都在网页上
使用外部数据库存储
告警，自动发现 All in one。
对于 clickhouse，hadoop 有 integrations 可以集成。
读取 disk IO 需要专门配置，prometheus 则不需要，只需要一个 node_exporter 便可以收集自己想要的信息

Prometheus

在配置文件中配置，缺点，写配置文件麻烦，优点，批量操作，编辑快速，便于备份移植。
内置 tsdb 数据库，轻量。
支持监控的 exporter 很多，方便导出指标。（比如 kafka_exporter， zabbix 似乎无法监控；再比如 clickhouse_exporter…,更多 exporter 见 https://prometheus.io/docs/instrumenting/exporters/ ）
可以使用 PromQL 在 grafana 中进行指标运算
指标监控可以在 grafana 配置，也可以在配置文件中配置
告警是用的 alermanager 组件，自动发现可以读取文件自动发现，也可以使用 consul
可以配置自定义监控指标

对比结果

原有的 Zabbix 不再满足我们的监控需求
我们的业务复杂，数据库多样，需要一套现代化的监控体系
不仅对 Linux 系统监控，还有业务流程，数据库，中间件监控。
Prometheus 的 node_exporter 可以通过 textfile
模块来采集我们自定义的监控指标
可以在 grafana 面板上写告警规则，直观
缺陷不支持多维数据
也就是说，一个图里面的一个指标告警后，另一个指标再次达到告警阈值，grafana 不会将其视为一个新的告警

监控拓扑

10.110.38.1 是原来的 zabbix 服务器，因为项目服务器紧缺，只规划了这么一台监控服务器

配置：

cpu: 32 核
ram: 128 G
disk:
- / ssd 986G
- /data01 raid5 28T
- /data02 raid5 28T

10.101.235.6 是一台可以通外网的服务器，上面运行 webhook 程序，接收告警，调用脚本发短信

组件说明

node_exporter

启动脚本

CentOS 7 - /etc/systemd/system/node_exporter.service

[Unit]
Description=Node Exporter
After=network.target


[Service]
User=nodeusr
Group=nodeusr
Type=simple
ExecStart=/usr/local/bin/node_exporter --collector.textfile.directory=/var/lib/node_exporter/textfile_collector --collector.systemd --collector.systemd.unit-include=(sshd|supervisor).service --collector.filesystem.ignored-mount-points=^/(sys|proc|dev|run|boot|host|etc)($|/) --collector.filesystem.fs-types-exclude=^(autofs|binfmt_misc|cgroup|configfs|debugfs|devpts|devtmpfs|fusectl|hugetlbfs|mqueue|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|sysfs|tracefs|rootfs)$ --no-collector.hwmon --no-collector.nfsd

[Install]
WantedBy=multi-user.target

说明：

–collector.textfile.directory=/var/lib/node_exporter/textfile_collector 指定收集自定义监控项的位置，textfile 是默认开启的
–collector.systemd 启动 systemd 监控，与上不同，systemd 收集器默认不开启，所以需要显性指定开启
–collector.systemd.unit-include=(sshd|supervisor).service 指定收集 systemd 的 service，不指定的话会有 700 多条指标，占存储
–collector.filesystem.ignored-mount-points=^/(sys|proc|dev|run|boot|host|etc)($|/) 忽略收集一些没必要收集的挂载点
–collector.filesystem.fs-types-exclude=^(autofs|binfmt_misc|cgroup|configfs|debugfs|devpts|devtmpfs|fusectl|hugetlbfs|mqueue|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|sysfs|tracefs|rootfs)$ 排除收集一些文件系统
–no-collector.hwmon 关闭 hwmon 收集器。hwmon 是服务器的硬件监控，电源（power），芯片（chip），传感器（sensor），温度（temp），硬件问题有专门的厂商维护，我们不用管理，如果你是云服务器，也可以考虑关闭
–no-collector.nfsd 关闭 nfsd 收集器

如果你还有其他要关的收集器，你想看默认开了哪些收集器，参考：
https://github.com/prometheus/node_exporter/blob/master/README.md#collectors

CentOS 6 - node_exporter.rhel6.service

因为只有少数几台，暂时还没有优化收集器

#!/bin/bash
#
# /etc/rc.d/init.d/node_exporter
#
#  Prometheus node exporter
#
#  description: Prometheus node exporter
#  processname: node_exporter

# Source function library.
. /etc/rc.d/init.d/functions

PROGNAME=node_exporter
PROG=/usr/local/bin/$PROGNAME
USER=nodeusr
LOGFILE=/var/log/node_exporter.log
LOCKFILE=/var/run/$PROGNAME.pid

start() {
    echo -n "Starting $PROGNAME: "
    cd /usr/local/bin/
    daemon --user $USER --pidfile="$LOCKFILE" "$PROG &>$LOGFILE &"
    echo $(pidofproc $PROGNAME) >$LOCKFILE
    echo
}

stop() {
    echo -n "Shutting down $PROGNAME: "
    killproc $PROGNAME
    rm -f $LOCKFILE
    echo
}


case "$1" in
    start)
    start
    ;;
    stop)
    stop
    ;;
    status)
    status $PROGNAME
    ;;
    restart)
    stop
    start
    ;;
    reload)
    echo "Sending SIGHUP to $PROGNAME"
    kill -SIGHUP $(pidofproc $PROGNAME)#!/bin/bash
    ;;
    *)
        echo "Usage: service node_exporter {start|stop|status|reload|restart}"
        exit 1
    ;;

安装脚本

一键安装

bash <(curl -s http://10.110.38.1/files/install_node_exporter.sh)

兼容了 centos 6 ，我们有一些老机器还在运行这个版本

#!/bin/bash

if [ "$USER" != "root" ];then
        echo "You must use <root> user to run..."
        exit 1
else
        echo "USER: ROOT"
fi


echo "INFO - check node_exporter if installed already"

service node_exporter status >> /dev/null

if [ $? -eq 0 ];then
  echo "Already installed, exit ..."
  exit 1
else
  echo "NOT install .. install now"
fi

command_exists() {
	command -v "$@" > /dev/null 2>&1
}

# check os version
# learn from https://get.docker.com
get_distribution() {

  if command_exists lsb_release; then
          dist_version="$(lsb_release --release | cut -f2)"
  fi
  if [ -z "$dist_version" ] && [ -r /etc/os-release ]; then
          dist_version="$(. /etc/os-release && echo "$VERSION_ID")"
  fi
  echo "$dist_version"
}



do_install() {
  dist_version=$( get_distribution )

  useradd -rs /bin/false nodeusr
  mkdir -p /var/lib/node_exporter/textfile_collector
  chmod -R 777 /var/lib/node_exporter/
  yum install -y -q wget

  echo "install node_exporter ..."
  wget -q http://10.110.38.1/files/node_exporter
  mv node_exporter /usr/local/bin/
  chmod +x /usr/local/bin/node_exporter

  echo "install node_exporter directory-size.sh"
  wget -q http://10.110.38.1/files/directory-size.sh
  chmod +x directory-size.sh
  mv directory-size.sh /usr/local/bin/

  echo "System dist version: $dist_version"

  dist_version=$( echo "$dist_version" | cut -d'.' -f1)

  echo "install node_exporter service ..."
  if [ "$dist_version" == "7" ]; then

    wget -q http://10.110.38.1/files/node_exporter.service
    mv node_exporter.service /etc/systemd/system/

    systemctl daemon-reload
    systemctl start node_exporter
    systemctl enable node_exporter

  elif [ "$dist_version" == "6" ]; then

    wget -q http://10.110.38.1/files/node_exporter.rhel6.service
    mv node_exporter.rhel6.service /etc/init.d/node_exporter
    chmod +x /etc/init.d/node_exporter

    touch /var/log/node_exporter.log
    chmod 777 /var/log/node_exporter.log

    /etc/init.d/node_exporter start
    # auto start
    # chkconfig --add node_exporter
  
  else

    echo "Unsupport dist version"
    exit 0
  
  fi

}


do_install

kafka_exporter

我们的 kafka 版本比较老，0.10.1

使用 https://github.com/danielqsj/kafka_exporter

尝试了好几个版本，最后发现 1.1.0 版本支持我们的 kafka

主要用于监控 kafka 消费积压

/etc/systemd/system/kafka_exporter.service

[Unit]
Description=kafka_exporter
After=local-fs.target network-online.target network.target
Wants=local-fs.target network-online.target network.target

[Service]
ExecStart=/opt/kafka_exporter/kafka_exporter --kafka.server=x.x.x.x:6667
Restart=on-failure

[Install]
WantedBy=multi-user.target

grafana 面板：https://grafana.com/grafana/dashboards/7589-kafka-exporter-overview/

prometheus （未设置鉴权）

10.110.38.1:9090

收集 exporter 展示的指标

告警项

可以参考 https://awesome-prometheus-alerts.grep.to/rules

告警优先级

severity: critical 高 - > 发短信
severity: warning 低 - > 在 karma 面板展示

我们正在用的一些告警项

HostOutOfDiskSpace 服务器磁盘快满了

- alert: HostOutOfDiskSpace
    expr: (node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes < 10 and ON (instance, device, mountpoint) node_filesystem_readonly == 0
    for: 10m
    labels:
      severity: critical
    annotations:
      summary: Host out of disk space (instance {{ $labels.instance }})
      description: "Disk is almost full (< 10% left)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

PrometheusTargetMissing 主机消失（可能是宕机或者网络中断）

- alert: PrometheusTargetMissing
  expr: up == 0
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: Prometheus target missing (instance {{ $labels.instance }})
    description: "A Prometheus target has disappeared. An exporter might be crashed.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

还有一个业务相关的监控作为例子

/data/input/ 目录大小超过 500G

- alert: OutputErrorSizeAbove500G
  expr: node_directory_size_bytes{directory="/data/input/",job="canal_xxxx"}/1024/1024/1024 > 500
  for: 10m
  labels:
    team: node
    severity: critical
  annotations:
    summary: "Canal xxxx  Output Error Size High"
    description: '{{$labels.instance}}: Error Size: {{ $value | printf "%.2f" }})'

printf "%.2f" 是将 $value 保留2位小数

注意事项

重启前先检查配置文件是否无误

config
promtool check config /etc/prometheus/prometheus.yml
rules
promtool check rules /etc/prometheus/rules/*.yml

Consul Agent （未设置鉴权）

10.110.38.1:8500

注册中心，为 prometheus 提供服务自动发现，就不用频繁修改 prometheus 的配置文件以及重启服务来加载新的主机监控

这个组件暂时还没有用起来

Alertmanager (未设置鉴权)

http://10.110.38.1:9093/

Prometheus 包含的一个报警模块
主要用于接收 prom 发送的告警信息
支持丰富的告警通知渠道
对告警信息去重，降噪，分组

Karma - Alertmanager UI （无鉴权）

github 地址：https://github.com/prymitive/karma

Alertmanager 自带一个 UI 界面，可以用来查看报警和静默管理，但是还缺乏一个 Dashboard 必要的一些功能，比如报警历史记录等等，karma 这个工具就可以来帮助增强 Alertmanager 的可视化功能。

前身是 cloudflare/unsee

配置文件：
karma.yaml

alertmanager:
  interval: 60s
  servers:
    - name: local
      uri: http://<alertmanager-ip>:9093
      timeout: 10s
      proxy: true
      readonly: false
annotations:
  default:
    hidden: false
  hidden:
    - help
  visible: []
debug: false
karma:
  name: karma-prod
listen:
  address: "0.0.0.0"
  port: 8080
  prefix: /
log:
  config: false
  level: info
ui:
  refresh: 30s
  hideFiltersWhenIdle: true
  colorTitlebar: true
  minimalGroupWidth: 420
  alertsPerGroup: 5
  collapseGroups: collapsedOnMobile

启动

1	nohup karma --config.file /etc/prometheus/karma.yaml &

grafana （有鉴权）

10.110.38.1:3000

可视化 prometheus 收集到的指标信息，也可以做部分告警规则

版本：v8.2.7

grafana 9 重构了告警功能吧，名词改的乱七八糟，我也没时间学习新的，于是我开始回滚 8

但并不是 8 的所有版本都是 legacy 的，8 版本的末尾几个版本也用了重写后的告警功能，比如 8.5.x，最终我降级到 8.2.7

局限性

我们经常在监控报警的查询中会返回多个序列，Grafana 的报警中的聚合函数和阈值检测都会去评估每一个序列，但是目前 Grafana 不会去跟踪每个序列的报警规则状态，所以这会影响到我们的报警结果，比如：

报警查询条件返回 2 个序列：server1 和 server2
server1 序列触发了报警规则并切换到报警状态
发送消息通知出去，比如发送的消息是：负载达到了峰值（server1）
如果在同一报警规则的后续评估中，server2 序列也导致触发了报警
这个时候不会发送新的通知，因为报警规则已经处于报警状态之下了
所以从上面的场景可以看出，如果规则已经处于报警状态了，当其他序列也达到了报警条件后，Grafana 不会发送通知，目前 Grafana 官方有计划针对多个序列查询的支持，会在未来的版本中跟踪每个序列的状态，所以这个也是目前 Grafana 告警功能的一些局限性。

实测还发现，grafana 发给 alertmanager 的告警，firing 和 resolved 混合, webhook 脚本不方便提取整合

注意事项

grafana dashboard 可以通过 save as 来克隆面板
dashboard json model 可以直接 import 进去（通过备份 json 来备份 dashboard）

webhook 以及处理脚本

用于 alertmanager 的告警渠道

altermanager 自带的告警渠道配置没有调用脚本来告警（只有 webhook），他们自己也不想做（ https://github.com/prometheus/alertmanager/issues/2046 ），只能考虑用 webhook 来调用脚本

可以用 python-flask 自己写，也有现成的
https://github.com/adnanh/webhook
配置方法
https://github.com/prometheus/alertmanager/issues/2046#issuecomment-535072123

或者 https://github.com/imgix/prometheus-am-executor

我最终使用了 https://github.com/adnanh/webhook

配置

hooks.yaml

- id: alertmanager
  execute-command: "/home/<user>/alertmanager/sms.sh"
  command-working-directory: "/home/<user>/alertmanager/"

  pass-arguments-to-command:
  - source: payload
    name: status
  - source: payload
    name: alerts

sms.sh

#!/bin/bash

status=$1
alerts=$2

if [ "$status" == "firing" ];then
	status="!!警告!!"
else
	status="xx恢复xx"
fi

alertname=`echo $alerts | jq '.[0]["labels"]["alertname"]'`

title=`echo "[$status] $alertname"`
content=`echo $alerts | jq '.[]["annotations"]["description"]'`

sms=`echo -e "${title}\n${content}"`

phone='136xxxxxxxx'

curl -v -X POST http://ip:8768/sms/cmppSender -F phoneNumbers="$phone" -F "smContent=$sms"

启动 webhook

1	./webhook -hooks hooks.yaml -verbose

收到的短信效果

[!!警告!!] "PrometheusTargetMissing"
"A Prometheus target has disappeared. An exporter might be crashed.
 VALUE = 0
 LABELS = map[__name__:up instance:sbi193:9100 job:canal_xxxxx]"

恢复的短信

[xx恢复xx] "PrometheusTargetMissing"
"A Prometheus target has disappeared. An exporter might be crashed.
 VALUE = 0
 LABELS = map[__name__:up instance:sbi193:9100 job:canal_xxxx]"

除了告警，还可以写一些遇到告警后可以采取的恢复脚本（自愈）
例子：https://www.modb.pro/db/194943

alertmanager 给 webhook 发送 http post 请求的内容

{
  "version": "4",
  "groupKey": <string>,    // key identifying the group of alerts (e.g. to deduplicate)
  "status": "<resolved|firing>",
  "receiver": <string>,
  "groupLabels": <object>,
  "commonLabels": <object>,
  "commonAnnotations": <object>,
  "externalURL": <string>,  // backlink to the Alertmanager.
  "alerts": [
    {
      "labels": <object>,
      "annotations": <object>,
      "startsAt": "<rfc3339>",
      "endsAt": "<rfc3339>"
    }
  ]
}

🌰

{
  "version": "4",
  "groupKey": <string>,    // key identifying the group of alerts (e.g. to deduplicate)
  "status": "<resolved|firing>",
  "receiver": “web.hook",
  "groupLabels": {"alertname":"OutputErrorSize”}, // 对应 am 的 group by
  "commonLabels": <object>,
  "commonAnnotations":  {"description":"sbi163:9100: Output Error is above 500G (current value is: 1129.2498016357422)","summary":"sbi163:9100: Canal Worker Output Error High"},
  "externalURL": <string>,  // backlink to the Alertmanager.
  "alerts": [
{
  "annotations": {
    "summary": "/data/input/error/ size alert"
  },
  "endsAt": "0001-01-01T00:00:00Z",
  "fingerprint": "ac7d1a1af47b8b92",
  "generatorURL": "http://10.110.38.1:3000/d/WKkYjbiVk/canal_xxxxx?tab=alert&viewPanel=3&orgId=1",
  "labels": {
    "__name__": "node_directory_size_bytes",
    "alertname": "/data/input/error/ size alert",
    "directory": "/data/input/error/",
    "instance": "sbi139:9100",
    "job": "canal_xxxxx"
  },
  "startsAt": "2022-08-20T18:20:30Z",
  "status": "firing"
}
{
  "annotations": {
    "summary": "/data/input/error/ size alert"
  },
  "endsAt": "0001-01-01T00:00:00Z",
  "fingerprint": "2b0fb28362ff0526",
  "generatorURL": "http://10.110.38.1:3000/d/WKkYjbiVk/canal_xxxxx?tab=alert&viewPanel=3&orgId=1",
  "labels": {
    "__name__": "node_directory_size_bytes",
    "alertname": "/data/input/error/ size alert",
    "directory": "/data/input/error/",
    "instance": "sbi140:9100",
    "job": "canal_xxxxx"
  },
  "startsAt": "2022-08-20T18:20:30Z",
  "status": "firing"
}]}

监控配置

目录大小监控

exporter.sh

#!/bin/bash

LockFile="/var/tmp/exporter.lock"

if [ -f $LockFile ];then
    echo "Compare time:"
    Time=`date +%s`
    LogTime=`stat -c %Y $LockFile`
    if [ $[$Time - $LogTime ] -lt 900 ];then
    echo "Another process is running."
    exit 1;
    else
    rm -f $LockFile;
    kill -15 `pgrep exporter.sh`;
    exit 1;
    fi
fi

touch $LockFile;

# make sure sub-dir put in front
/usr/local/bin/directory-size.sh /data/input/ /data/output/ > /tmp/metrics.prom.$$ && mv /tmp/metrics.prom.$$ /var/lib/node_exporter/textfile_collector/metrics.prom

rm -f $LockFile;
kill -15 `pgrep exporter.sh`;
exit 0;

再给 exporter.sh 配置一个定时任务，每10分钟执行一次即可。

directory-size.sh 来源官方仓库 https://github.com/prometheus-community/node-exporter-textfile-collector-scripts/blob/master/directory-size.sh

#!/bin/sh
#
# Expose directory usage metrics, passed as an argument.
#
# Usage: add this to crontab:
#
# */5 * * * * prometheus directory-size.sh /var/lib/prometheus | sponge /var/lib/node_exporter/directory_size.prom
#
# sed pattern taken from https://www.robustperception.io/monitoring-directory-sizes-with-the-textfile-collector/
#
# Author: Antoine Beaupré <[email protected]>
echo "# HELP node_directory_size_bytes Disk space used by some directories"
echo "# TYPE node_directory_size_bytes gauge"
du --block-size=1 --summarize "$@" \
  | sed -ne 's/\\/\\\\/;s/"/\\"/g;s/^\([0-9]\+\)\t\(.*\)$/node_directory_size_bytes{directory="\2"} \1/p'

生成的文件需要原子写入 sponge ，这个命令在 moreutils 包里

我们内网系统没有这个包，我也懒得装，于是使用这样的方式

1	/usr/local/bin/directory-size.sh /data/input/ /data/output/ > /tmp/metrics.prom.$$ && mv /tmp/metrics.prom.$$ /var/lib/node_exporter/textfile_collector/metrics.prom

详细参考 https://www.modb.pro/db/150234

自定义监控

自定义监控需要满足 exporter 导出指标的格式，还要写注释信息，注释里面有指标的类型，prometheus 收集指标时需要知道指标的类型

1
2
3

# HELP 指标名
# TYPE 指标名 指标类型
指标名{维度1="1", 维度2="2"} 值

我尝试自己写了一个，作为示例
canal-size.sh

#!/bin/sh
#
# Author: Meow-bot  <[email protected]>
LOG_PATH=/app/canal_xxx/log/
#DATA_MODEL=xxxx
day_id=`date +%Y%m%d`
#
#echo "# HELP canal_output_current_day_size_bytes canal output file size only current day"
#echo "# TYPE canal_output_current_day_size_bytes gauge"
#
#cat ${LOG_PATH}/info.log  | grep "${DATA_MODEL}_${day_id}" | awk -F '[:,]' 'BEGIN{sum=0}{sum+=$12}END{print "canal_output_current_day_size_bytes " sum}'
#
#echo "# HELP canal_output_size_bytes canal output file size"
#echo "# TYPE canal_output_size_bytes gauge"
#
#cat ${LOG_PATH}/info.log  | grep "${DATA_MODEL}_" | awk -F '[:,]' 'BEGIN{sum=0}{sum+=$12}END{print "canal_output_size_bytes " sum}'

echo "# HELP canal_error_log_size_bytes canal error log file size"
echo "# TYPE canal_error_log_size_bytes gauge"

ls -l ${LOG_PATH}/error.log  | awk  '{print "canal_error_log_size_bytes " $5}'

echo "# HELP canal_input_size_bytes canal input file size"
echo "# TYPE canal_input_size_bytes gauge"

du -s --block-size=1 /data/input/bak/${day_id} | awk '{print "canal_input_size_bytes " $1}'

exporter.sh

1	/usr/local/bin/canal-size.sh > /tmp/canal.prom.$$ && mv /tmp/canal.prom.$$ /var/lib/node_exporter/textfile_collector/canal.prom

进程监控

进程是 system service 参考 https://medium.com/kartbites/process-level-monitoring-and-alerting-in-prometheus-915ed7508058

进程监控（通过读取 /proc
https://github.com/ncabatoff/process-exporter

手写脚本，发给 pushgateway
https://devconnected.com/monitoring-linux-processes-using-prometheus-and-grafana/

我采用第一种，在 node_exporter.service 启动参数启用 systemd 收集器，配置简单，前提你监控的进程是 system service。如果不是，做成 service 也很简单。

fsimage 监控

监控 fsimage
https://github.com/marcelmay/hadoop-hdfs-fsimage-exporter

TODO 未来规划

prometheus 需要优化存储配置
当前接入了 200 + 服务器，从 8.25 至 10.10，存储已经 229G ，当前我们一共 1500+ 服务器，全部接入也能装得下，只是可能查询速度会变慢，这个需要注意。
hadoop yarn 队列资源监控
通过 jmx 待定
https://github.com/prometheus/jmx_exporter

[x] 3. 集成 Ambari metrics
版本要求： Grafana 4.5.x - 5.x.x 不满足
github: https://github.com/prajwalrao/ambari-metrics-grafana
可以考虑自己做适配

其他参考资料

验证告警规则 https://blog.cloudflare.com/monitoring-our-monitoring/

clickhouse 容量预测框架
https://translation.meow.page/post/clickhouse-capacity-estimation-framework/
（原文： https://blog.cloudflare.com/clickhouse-capacity-estimation-framework/ ）