自己做的小网站分享,邯郸网站维护,修改网站需要什么,给公司网站设计一、需求 对入库到clickhouse的业务日志进行告警#xff0c;达阀值后发送企业微信告警。
方法一、 fluent-bit–clickhouse(http)–shell脚本,每隔一分钟获取分析结果 -- 把结果保存到/dev/shm/目录下 -- node_exporter读取指标入库到prometheus-- rules…一、需求 对入库到clickhouse的业务日志进行告警达阀值后发送企业微信告警。
方法一、 fluent-bit–clickhouse(http)–shell脚本,每隔一分钟获取分析结果 -- 把结果保存到/dev/shm/目录下 -- node_exporter读取指标入库到prometheus-- rules根据告警规则生产告警–alertmanager–webhook -- 企业微信。 方法二、 fluent-bit–clickhouse(http)–python,每隔一分钟获取分析结果 -- pushgateway–指标入库到prometheus-- rules根据告警规则生产告警–alertmanager–webhook -- 企业微信。
二、告警组件 clickhouse prometheus alertmanager node_exporter查询脚本或者python脚本pushgateway) webhook
三、clickhouse搭建和建表 业务日志库
四、node_exporter 启动参数添加 --collector.textfile.directory/dev/shm/
[Unit]
Descriptionnode_exporter Service
Afternetwork.target
Afternetwork-online.target
Wantsnetwork-online.target[Service]
Typesimple
WorkingDirectory/data/node_exporter
ExecStart/data/node_exporter/node_exporter \
--web.config.file/data/node_exporter/etc/config.yml \
--collector.filesystem.mount-points-exclude^/(sys|proc|dev|host|etc|var/lib/docker/.|var/lib/kubelet/.)($|/) \
--collector.systemd \
--collector.systemd.unit-include(docker|sshd|isg|sgadmin).service \
--web.listen-address:19100 \
--collector.textfile.directory/dev/shm/ \
--web.telemetry-path/metricsRestartalways
RestartSec5[Install]
WantedBymulti-user.target五、shell脚本 使用crontab定时一分钟执行一次
#!/usr/bin/env bash
#
# Generate node_resolv_info
# which are not handled by node_exporters own collectorset -e#ch的IP
ch_hostxx.xx.xx.xx
#ch的端口
ch_port9000
#ch的用户
ch_userxxxx
#ch的密码
ch_passwordxxxxxxxxxxxxxxxxxxxx
#ch的数据库
ch_databasexxxxxxxxxxxxxx
#ch的表名
ch_tablexxxxxxxxxxxxx
#查询推后
query_delay60#因入库时间较慢查询前一分钟所
#站点聚合
site_sqlSELECT splitByChar(/,req_path)[2] as paasid , round(sum(if((toInt64(res_statuscode) 200) AND (toInt64(res_statuscode) 400), 1, 0))) as suc, count(1) as total , round(sum(if((toInt64(res_statuscode) 200) AND (toInt64(res_statuscode) 400), 1, 0)) / count(1)*100, 5) AS val FROM ${ch_database}.${ch_table} PREWHERE (create_time toDateTime(now() - 60 - ${query_delay})) AND (create_time toDateTime(now() - ${query_delay})) GROUP BY paasid HAVING total 5 ORDER BY val DESCSITE_ARRAY(docker exec -i ch clickhouse-client --user${ch_user} --password${ch_password} --host ${ch_host} --port ${ch_port} -n -m -q ${site_sql}| tr -d \r)site_num${#SITE_ARRAY[]}cat EOS /dev/shm/site_rate.prom.tmp
# HELP site_rate
# TYPE site_rate gauge
EOS
for ((i0;isite_num;ii4)); doREQ_PATH${SITE_ARRAY[i]}SUC${SITE_ARRAY[i1]}TOL${SITE_ARRAY[i2]}VAL${SITE_ARRAY[i3]}
cat EOS /dev/shm/site_rate.prom.tmp
site_rate{site_path${REQ_PATH},suc${SUC},total${TOL}} ${VAL}
EOS
done
\mv /dev/shm/site_rate.prom.tmp /dev/shm/site_rate.prom#------------------------------------
#API接口
api_sqlSELECT req_path , round(sum(if((toInt64(res_statuscode) 200) AND (toInt64(res_statuscode) 400), 1, 0))) as suc, count(1) as total , round(sum(if((toInt64(res_statuscode) 200) AND (toInt64(res_statuscode) 400), 1, 0)) / count(1)*100, 5) AS val FROM ${ch_database}.${ch_table} PREWHERE req_path like /ebus/% and (create_time toDateTime(now() - 60 - ${query_delay})) AND (create_time toDateTime(now() - ${query_delay})) GROUP BY req_path HAVING total 3 ORDER BY val DESCAPI_ARRAY(docker exec -i ch clickhouse-client --user${ch_user} --password${ch_password} --host ${ch_host} --port ${ch_port} -n -m -q ${api_sql}| tr -d \r)api_num${#API_ARRAY[]}cat EOS /dev/shm/api_rate.prom.tmp
# HELP api_rate
# TYPE api_rate gauge
EOS
for ((i0;iapi_num;ii4)); doREQ_PATH${API_ARRAY[i]}SUC${API_ARRAY[i1]}TOL${API_ARRAY[i2]}VAL${API_ARRAY[i3]}
cat EOS /dev/shm/interface_rate.prom.tmp
api_rate{api_path${REQ_PATH},suc${SUC},total${TOL}} ${VAL}
EOS
done\mv /dev/shm/api_rate.prom.tmp /dev/shm/api_rate.prom#脚本生成结果1
cat /dev/shm/site_rate.prom
# HELP site_rate
# TYPE site_rate gauge
site_rate{site_path/metrics/,suc49,total49} 100
site_rate{site_path/grafana/,suc9,total9} 100
site_rate{site_path/dail_healthcheck/,suc16,total16} 100
site_rate{site_path/abcyhzx5/,suc64,total64} 100
site_rate{site_path/abcapm/,suc30,total32} 93.75
site_rate{site_path/abc/,suc333,total370} 90
site_rate{site_path/ebus/,suc2,total14} 14.28571六、prometheus告警规则
groups:- name: 接口成功率-监控告警rules:- alert: 接口成功率低于85%expr: avg by (api_path,suc,total) (api_rate) 85for: 0mlabels:severity: 一般alert: apiannotations:description: 接口成功率低于85%\n(suc:{{$labels.suc}} total:{{$labels.total}})\n成功率:{{printf \%.0f\ $value}}%- alert: 站点成功率低于85%expr: avg by (site_path,suc,total) (site_rate) 85for: 0mlabels:severity: 一般alert: apiannotations:description: 站点成功率低于85%\n(suc:{{$labels.suc}} total:{{$labels.total}})\n成功率:{{printf \%.0f\ $value}}%七、alertmanager
global:resolve_timeout: 1msmtp_from: xxxxxxxxqq.comsmtp_smarthost: smtp.qq.com:465smtp_auth_username: xxxxxxqqq.comsmtp_auth_password: XXXXXX smtp_require_tls: falsesmtp_hello: qq.comtemplates:- /etc/alertmanager/email.tmpl #邮件模板文件容器内的路径 route:receiver: ding2wechat#按alertname等进行分组group_by: [alertname]#周期内有同一组的报警到来则一起发送 group_wait: 1m #报警发送周期 group_interval: 10m#与上次相同的报警延迟30m才发送这里应该是(1030)m左右 repeat_interval: 30m routes:#可以使用match_re正则匹配- match: severity: 严重#匹配上则发给下面的nameding2wechat receiver: ding2wechat - match:alert: api #匹配上则发给下面的nameapi_ding2wechatreceiver: api_ding2wechatrepeat_interval: 24hgroup_interval: 1mreceivers:
##企微机器人2,通过prometheus-webhook-dingtalk后再通过ding2wechat
- name: ding2wechatwebhook_configs:- url: http://172.xxx.xxx.xxx:8060/dingtalk/ding2wechat/sendsend_resolved: true- name: api_ding2wechatwebhook_configs:#不需要发送恢复告警- url: http://172.xxx.xxx.xxx:8060/dingtalk/ding2wechat/sendsend_resolved: false- name: emailemail_configs:- to: xxxxxxxxqq.comhtml: {{ template email.jwolf.html . }}send_resolved: true#抑制规则如果是critical时抑制warning警报
inhibit_rules:- source_match:severity: criticaltarget_match:severity: warningequal: [alertname, instance]