Linux修行之路 - 技术博客

PromQL 语句使用

2022年5月11日 · 阅读需 7 分钟

https://prometheus.io/docs/prometheus/latest/querying/basics/

Prometheus 提供一个函数式的表达式语言PromQL (Prometheus Query Language),可以使用户实时地查找和聚合时间序列数据,表达式计算结果可以在图表中展示,也可以在 Prometheus 表达式浏览器中以表格形式展示, 或者作为数据源, 以 HTTP API 的方式提供给外部系统使用。

PromQL 数据基础

数据分类

瞬时向量、瞬时数据(instant vector):是一组时间序列,每个时间序列包含单个数据样本,比如node_memory_MemTotal_bytes查询当前剩余内存就是一个瞬时向量,该表达式的返回值中只会包含该时间序列中的最新的一个样本值,而相应的这样的表达式称之为瞬时向量表达式,例如: prometheus_http_requests_tota

范围向量、范围数据(range vector):是指在任何一个时间范围内,抓取的所有度量指标数据.比如最近一天的网卡流量趋势图，例如: prometheus_http_requests_total[5m]

标量、纯量数据(scalar):是一个浮点数类型的数据值,使用 node_load1 获取到时一个瞬时向量, 但是可用使用内置函数scalar()将瞬时向量转换为标量,例如: scalar(sum(node_load1))

字符串(string):字符串类型的数据,目前使用较少

数据类型

Counter

Counter : 计数器, Counter 类型代表一个累积的指标数据, 在没有被重置的前提下只增不减,比如磁盘I/O 总数、nginx的请求总数、网卡流经的报文总数等。

Gauge

Gauge : 仪表盘, Gauge 类型代表一个可以任意变化的指标数据, 值可以随时增高或减少,如带宽速录、CPU负载、内存利用率、nginx 活动连接数等。

例如在Graph 的 Gauge 查看 node_load1 可以看到相对应图标指标

Histogram

Histogram:累积直方图 , Histogram 会在一段时间范围内对数据进行采样 (通常是请求持续时间或响应大小等), 假如每分钟产生一个当前的活跃连接数, 那么一天就会产生1440个数据, 查看数据的每间隔的绘图跨度为2小时, 那么2点的柱状图 (bucket) 会包含 0点到 2点即两个小时的数据, 而 4点的柱状图 (bucket) 则会包含0点到 4点的数据 , 而 6点的柱状图 (bucket) 则会包含 0点到 6点的数据。
  prometheus_tsdb_compaction_chunk_range_seconds_bucket
    # TYPE go_gc_heap_frees_by_size_bytes_total histogram
    go_gc_heap_frees_by_size_bytes_total_bucket[le="8.999999999999998"] 1.489766e+06  
    go_gc_heap_frees_by_size_bytes_total_bucket[le="24.999999999999996"] 3.1621269e+07
    go_gc_heap_frees_by_size_bytes_total_bucket[le=" 64.99999999999999 "] 3.9805887e+07
    go_gc_heap_frees_by_size_bytes_total_bucket[le="144.99999999999997"] 4.5192143e+07
    go_gc_heap_frees_by_size_bytes_total_bucket [le=" 320.99999999999994"] 4.5981675e+07
    go_gc_heap_frees_by_size_bytes_total_bucket[le="704.9999999999999 "] 4.6231281e+07
    go_gc_heap_frees_by_size_bytes_total_bucket[le="1536.9999999999998") 4.6293065e+07
    go_gc_heap_frees_by_size_bytes_total_bucket[le=" 3200.9999999999995"] 4.6357758e+07

Summary

Summary: 摘要,也是一组数据,统计的不是区间的个数而是统计分位数,从0到1,表示是0%~100%,如下统计的是0、0.25、0.5、0.75、1 的数据量分别是多少
  go_gc_duration_seconds
    # HELP go_gc_duration_seconds A summary of the pause duration of garbage collection cycles.
    # TYPE go_gc_duration_seconds summary
    go_gc_duration_seconds {quantile="0"} 1.8479e-05
    go_gc_duration_seconds {quantile="0.25"} 6.5059e-05
    go_gc_duration_seconds {quantile="0.5"} 9.3605e-05
    go_gc_duration_seconds {quantile="0.75"} 0.000133103 #百分75的go_gc_duration_seconds的持续时间
    go_gc_duration_seconds {quantile="1"} 0.004022673
    go_gc_duration_seconds_sum 1.446781088
    go_gc_duration_seconds_count 7830

PromQL-指标数据

node_memory_MemTotal_bytes #查询node节点总内存大小
node_memory_MemFree_bytes #查询node节点剩余可用内存
node_memory_MemTotal_bytes{instance="192.168.15.100:9100"} #基于标签查询指定节点的总内存
node_memory_MemFree_bytes{instance="192.168.15.100:9100"}  #基于标签查询指定节点的可用内存

node_disk_io_time_seconds_total{device="sda"} #查询指定磁盘的每秒磁盘io
node_filesystem_free_bytes{device="/dev/sda1",fstype="xfs", mountpoint="/"} #查看指定磁盘的磁盘剩余空间
# 例如查询 node_filesystem_free_bytes{device="/dev/sda1"} 指定查看

# HELP node_loadl 1m load average. #CPU负载
# TYPE node_loadl gauge
node_load1 0.1
# HELP node_load15 15m load average.
# TYPE node_load15 gauge
node_load15 0.17
# HELP node_load5 5m load average.
# TYPE node_load5 gauge
node_load5 0.13

PromeQL-匹配器

=  :选择与提供的字符串完全相同的标签,精确匹配。
!= :选择与提供的字符串不相同的标签,去反。
=~ :选择正则表达式与提供的字符串(或子字符串)相匹配的标签。
!~ :选择正则表达式与提供的字符串(或子字符串)不匹配的标签。

#查询格式 <metric name>{<label name>=<label value>, ...}
node_loadl {instance="192.168.15.100:9100"}
node_loadl {job="promethues-node"}

node_load1{job="promethues-node",instance="192.168.15.100:9100"} #精确匹配
node_load1{job="promethues-node",instance!="192.168.15.100:9100"} #取反
node_loadl{instance=~"192.168.15.100.*:9100$"}  #包含正则且匹配
node_loadl{instance!~"192.168.15.100:9100"}  #包含正则且取反

PromQL-时间范围

s - 秒
m - 分钟
h - 小时
d - 天
w - 周
y - 年

#瞬时向量表达式,选择当前最新的数据
node_memory_MemTotal_bytes{}

#区间向量表达式,选择以当前时间为基准,查询所有节点node_memory_MemTotal_bytes指标5分钟内的数据
node_memory_MemTotal_bytes{}[5m]

#区间向量表达式,选择以当前时间为基准,查询指定节点node_memory_MemTotal_bytes指标5分钟内的数据
node_memory_MemTotal_bytes{instance="192.168.15.100:9100"}[5m]

PromQL-运算符

+ 加法
- 减法
* 乘法
/ 除法
% 模
^ 幕等

node_memory_MemFree_bytes/1024/1024 #将内存进行单位从字节转行为兆
node_disk_read_bytes_total{device="sda"} + node_disk_written_bytes_total{device="sda"}
#计算磁盘读写数据量

PromQL-聚合运算

max、min、avg

max() #最大值
min() #最小值
avg() #平均值

# 计算每个节点的最大的流量值:
max(node_network_receive_bytes_total) by (instance)

# 计算每个节点最近五分钟每个device的最大流量
max(rate(node_network_receive_bytes_total[5m])) by (device)

sum、sount

sum() #求数据值相加的和(总数)
sum(prometheus_http_requests_total)
{}  2495 #最近总共请求数为2495次,用于计算返回值的总数(如http请求次数)

count() #统计返回值的条数
count(node_os_version)
{}  2  #一共两条返回的数据,可以用于统计节点数、pod数量等

count_values()  #对value的个数(行数)进行计数
count_values("node_version",node_os_version)  #统计不同的系统版本节点有多少

abs、absent

abs()  #返回指标数据的值
abs(sum(prometheus_http_requests_total{handler="/metrics"}))
 
absent()  #如果监指标有数据就返回空,如果监控项没有数据就返回1,可用于对监控项设置告警通知 
absent(sum(prometheus_http_requests_total{handler="/metrics"}))

# 当一个指标没有返回数据，则返回1如 ： {}   1
# 比如说查询 absent(sum(prometheus_http_requests_total{handler="/metricsAAA"}))
# 则是没有数据，会返回 {}    1

stddev、stdvar

stddev() #标准差
stddev(prometheus_http_requests_total)  #5+5=10,1+9=10,1+9这一组的数据差异就大,在系统是数据波动较大,不稳定

stdvar()  #求方差
stdvar(prometheus_http_requests_total)

topk、bottomk

topk() #样本值排名最大的N个数据
#取从大到小的前6个
topk(6, prometheus_http_requests_total)

bottomk()  #样本值排名最小的N个数据
  #取从小到大的前6个
  bottomk(6, prometheus_http_requests_total)

rate、irate

rate()  #函数是专门搭配counter数据类型使用函数,功能是取counter数据类型在这个时间段中平均每秒的增量平均数
rate(prometheus_http_requests_total[5m])
rate(node_network_receive_bytes_total[5m])

irate() #函数是专门搭配counter数据类型使用函数,功能是取counter数据类型在这个时间段中平均每秒的峰值
irate(prometheus_http_requests_total[5m])
irate(node_network_receive_bytes_total[5m])

by、without

#by,在计算结果中,只保留by指定的标签的值,并移除其它所有的
sum(rate(node_network_receive_packets_total{instance=~".*"}[10m])) by (instance) 
sum(rate(node_memory_MemFree_bytes[5m])) by (increase)

#without,从计算结果中移除列举的instance, job标签,保留其它标签
sum(prometheus_http_requests_total) without (instance,job)

Prometheus victoria-metrics 存储

2022年5月11日 · 阅读需 9 分钟

Prometheus victoria-metrics 存储

Prometheus 本地存储

默认情况下，prometheus 将采集到的数据存储在本地的 TSDB 数据库中，路径默认为 prometheus 安装目录的 data 目录，数据写入过程为先把数据写入 wal 日志并放在内存，然后 2 小时后将内存数据保存至一个新的 block 块，同时再把新采集的数据写入内存并在 2 小时后再保存至一个新的 block 块，以此类推

每个 block 为一个 data 目录中以 01 开头的存储目录，比如说：

[root@yuan ~]# ls -l /apps/prometheus/data/
total 20
drwxr-xr-x 3 root root    68 May  8 13:00 01G2H0KYPSE8MZATVBBG9KPJME
drwxr-xr-x 3 root root    68 May 10 16:08 01G2PG73ZNQ0G1ZGD0EGDV2VMD
drwxr-xr-x 3 root root    68 May 10 16:08 01G2PG740MHR9JC7V772CYARAR
drwxr-xr-x 3 root root    68 May 10 16:08 01G2PG742VRGMXC2Z784YNKW29

block 的特征

block 会压缩、合并历史数据块，以及删除过期的块，随着压缩、合并，block 的数量会减少，在压缩过程中会发生三件事：

定期执行压缩

合并小的 block 到大的 block

清理过期的块

每个块有 4 部分组成

[root@yuan ~]# tree /apps/prometheus/data/01G2H0KYPSE8MZATVBBG9KPJME/
/apps/prometheus/data/01G2H0KYPSE8MZATVBBG9KPJME/
├── chunks
│   └── 000001 #数据目录，每个大小为 512MB 超过会被切分为多个
├── index  #索引文件，记录存储的数据的索引信息，通过文件内的几个表来查找时序数据
├── meta.json  #block 元数据信息，包含了样本数、采集数据数据的起始时间、压缩历史
└── tombstones  #逻辑数据，主要记载删除记录和标记要删除的内容，删除标记，可在查询块时排除样本

本地存储配置参数

--config.file="prometheus.yml"  #指定配置文件
--web.listen-address="0.0.0.0:9090"  #指定监听地址
--storage.tsdb.path="data/"  #指定数据存储目录

--storage.tsdb.retention.size=B,KB,MB,TB,PB,EB  #指定 chunk 大小，默认 512MB
--storage.tsdb.retention.time=  #数据保存时长，默认15天

--query.timeout=2m  #最大查询超时时间
--query.max-concurrency=20  #最大查询并发量

--web.read-timeout=5m  #最大空闲超时时间
--web.max-connections=512  #最大并发连接数
--web.enable-lifecycle  #启用 API 动态加载配置功能

远端存储 victoriametrics

https://github.com/VictoriaMetrics/VictoriaMetrics

https://docs.victoriametrics.com/Single-server-VictoriaMetrics.html

单机版部署

# 下载安装包
wget https://github.com/VictoriaMetrics/VictoriaMetrics/releases/download/v1.71.0/victoria-metrics-arm-v1.71.0.tar.gz

tar xvf victoria-metrics-arm-v1.71.0.tar.gz

参数：

-httpListenAddr=0.0.0.0:8428 #监听地址及端口

-storageDataPath #VictoriaMetrics 将所有数据存储在此目录中，默认为执行启动 victoria 的当前目录下的 victoria-metrics-data 目录中

-retentionPeriod #存储数据的保留，较旧的数据会自动删除，默认保留期为 1 个月，默认单位为m(月)，支持的单位有 h(hour), d(day), w(week), y(year)

设置 service 启动文件

mv victoria-metrics-prod /usr/local/bin

cat /etc/systemd/system/victoria-metrics-prod.service

[Unit]
Description=For Victoria-metrics-prod Service
After=network.target

[Service]
ExecStart=/usr/local/bin/victoria-metrics-prod -httpListenAddr=0.0.0.0:8428 -storageDataPath=/data/victoria -retentionPeriod=3

[Install]
WantedBy=multi-user.target

启动并设置开机自启

systemctl daemon-reload
systemctl restart victoria-metrics-prod.service
systemctl enable victoria-metrics-prod.service

验证页面： 192.168.15.100:8428 查看数据

Prometheus 设置

global:
  scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

remote_write:
  - url: http://192.168.15.100:8428/api/v1/write

重启 prometheus，再次验证 192.168.15.100:8428 查看数据

grafana 配置

添加数据源：

类型为 prometheus ，地址及端口为 VictoriaMetrics: http://192.168.15.100:8428

导入指定模版

8919

官方 docker-compose

https://github.com/VictoriaMetrics/VictoriaMetrics/tree/cluster/deployment/docker

git clone https://github.com/VictoriaMetrics/VictoriaMetrics.git
cd VictoriaMetrics/deployment/docker

[root@yuan docker]# ls -l
total 44
-rw-r--r-- 1 root root    61 May 10 19:32 alertmanager.yml
-rw-r--r-- 1 root root 17025 May 10 19:32 alerts.yml
drwxr-xr-x 2 root root    24 May 10 19:32 base
drwxr-xr-x 2 root root    24 May 10 19:32 builder
-rw-r--r-- 1 root root  2843 May 10 19:32 docker-compose.yml
-rw-r--r-- 1 root root  6280 May 10 19:32 Makefile
-rw-r--r-- 1 root root   298 May 10 19:32 prometheus.yml
drwxr-xr-x 4 root root    43 May 10 19:32 provisioning
-rw-r--r-- 1 root root  1495 May 10 19:32 README.md

docker-compose up -d

# 验证 web 界面
192.168.15.100:8428

集群版部署

组件介绍

vminsert #写入组件(写)，vminsert 负责接收数据写入并根据对度量名称及其所有标签的一致 hash 结果将数据分散写入不同的后段 vmstorage 节点之间 vmstorage，vminsert 默认端口 8480

vmstorage #存储原始数据并返回给定时间范围内给定标签过滤器的查询数据，默认端口8482

vmselect #查询组建(读)，连续 vmstorage，默认端口 8481

其他可选组件：

vmagent #是一个很轻量级但功能强大的代理，它可以从 node_exporter 各种来源收集度量指标，并将它们存储在 VictoriaMetrics 或任何其他支持远程写入协议的与 prometheus 兼容的存储系统中，有替代 prometheus server 的意向

vmalert #替换prometheus server，以 VictoriaMetrics 为数据源，基于兼容 prometheus 的告警规则，判断数据是否异常，并将产生的通知发送给 alertmanager

Vmgateway #读写 VictoriaMetrics 数据的代理网关，可实现限速和访问控制等功能，目前为企业版组件

vmctl #VictoriaMetrics 的命令行工具，目前主要用于将 prometheus 、opentsdb 等数据源的数据迁移到VictoriaMetrics

部署集群

分别在各个 VictoriaMetrics 服务器进行安装配置

wget https://github.com/VictoriaMetrics/VictoriaMetrics/releases/download/v1.71.0/victoria-metrics-amd64-v1.71.0-cluster.tar.gz

tar xvf victoria-metrics-amd64-v1.71.0-cluster.tar.gz

[root@yuan victoria]# ls -l
total 35016
-rwxr-xr-x 1 yuan yuan 11312016 Dec 21 01:33 vminsert-prod
-rwxr-xr-x 1 yuan yuan 13026872 Dec 21 01:33 vmselect-prod
-rwxr-xr-x 1 yuan yuan 11512464 Dec 21 01:33 vmstorage-prod

mv vminsert-prod vmselect-prod vmstorage-prod /usr/local/bin

# 主要参数
-httpListenAddr string
    Address to listen for http connections (default ":8482")
-vminsertAddr string
    TCP address to accept connections from vminsert services (default ":8400") 
-vmselectAddr string
    TCP address to accept connections from vmselect services (default ":8401 ")

部署 vmstorage-prod 组件

负责数据的持久化，监听端口:API 8482，数据写入端口：8400，数据读取端口：8401

vim /etc/systemd/system/vmstorage.service

[Unit]
Description=Vmstorage Server
After=network.target

[Service]
Restart=on-failure
WorkingDirectory=/tmp
ExecStart=/usr/local/bin/vmstorage-prod -loggerTimezone Asia/Shanghai -storageDataPath /data/vmstorage-data -httpListenAddr :8482 -vminsertAddr :8400 -vmselectAddr :8401

[Install]
WantedBy=multi-user.target

启动服务并设置开机自启

systemctl restart vmstorage.service
systemctl enable vmstorage.service
systemctl status vmstorage.service

配置另外两台服务器

# 将启动文件发送至另两台服务器
scp /etc/systemd/system/vmstorage.service 192.168.15.101:etc/systemd/system/vmstorage.service

scp /etc/systemd/system/vmstorage.service 192.168.15.101:etc/systemd/system/vmstorage.service

scp /usr/local/bin/vm* 192.168.15.101:/usr/local/bin/
scp /usr/local/bin/vm* 192.168.15.102:/usr/local/bin/

# 101
systemctl restart vmstorage.service && systemctl enable vmstorage.service
systemctl status vmstorage.service

# 102
systemctl restart vmstorage.service && systemctl enable vmstorage.service
systemctl status vmstorage.service

部署 vminsert-prod 组件

接收外部的写请求，默认端口 8480

vim /etc/systemd/system/vminsert.service

[Unit]
Description=Vminsert Server
After=network.target

[Service]
Restart=on-failure
WorkingDirectory=/tmp
ExecStart=/usr/local/bin/vminsert-prod -httpListenAddr :8480 -storageNode=192.168.15.100:8400,192.168.15.101:8400,192.168.15.102:8400

[Install]
WantedBy=multi-user.target

启动服务并设置开机自启

systemctl daemon-reload
systemctl restart vminsert.service && systemctl enable vminsert.service
systemctl status vminsert.service

配置另外两台服务器

scp /etc/systemd/system/vminsert.service 192.168.15.101:/etc/systemd/system/vminsert.service
scp /etc/systemd/system/vminsert.service 192.168.15.102:/etc/systemd/system/vminsert.service

systemctl daemon-reload
systemctl restart vminsert.service && systemctl enable vminsert.service
systemctl status vminsert.service

部署 vmselect-prod 组件

负责接收外部的读请求，默认端口 8481

vim /etc/systemd/system/vmselect.service

[Unit]
Description=Vminsert Server
After=network.target

[Service]
Restart=on-failure
WorkingDirectory=/tmp
ExecStart=/usr/local/bin/vmselect-prod -httpListenAddr :8481 -storageNode=192.168.15.100:8401,192.168.15.101:8401,192.168.15.102:8401

[Install]
WantedBy=multi-user.target

启动服务并设置开机自启

systemctl daemon-reload
systemctl restart vmselect.service && systemctl enable vmselect.service
systemctl status vmselect.service

配置另外两台服务器

scp /etc/systemd/system/vmselect.service 192.168.15.101:/etc/systemd/system/vmselect.service
scp /etc/systemd/system/vmselect.service 192.168.15.102:/etc/systemd/system/vmselect.service

systemctl daemon-reload
systemctl restart vmselect.service && systemctl enable vmselect.service
systemctl status vmselect.service

验证服务端口

#192.168.15.100
curl http://192.168.15.100:8480/metrics
curl http://192.168.15.100:8481/metrics
curl http://192.168.15.100:8482/metrics

#192.168.15.101
curl http://192.168.15.101:8480/metrics
curl http://192.168.15.101:8481/metrics
curl http://192.168.15.101:8482/metrics

#192.168.15.102
curl http://192.168.15.102:8480/metrics
curl http://192.168.15.102:8481/metrics
curl http://192.168.15.102:8482/metrics

Prometheus 配置远程写入

vim /apps/prometheus/prometheus.yml

# my global config
global:
  scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).
  
# 单机写入
#remote_write:
#  - url: http://192.168.15.100:8428/api/v1/write

# 集群写入
remote_write:
  - url: http://192.168.15.100:8480/insert/0/prometheus
  - url: http://192.168.15.101:8480/insert/0/prometheus
  - url: http://192.168.15.102:8480/insert/0/prometheus

grafana 数据源配置

https://github.com/VictoriaMetrics/VictoriaMetrics#grafana-setup

添加数据源

在 grafana settings 中添加Data Sources，

Name：vmselect , URL：http://192.168.15.102:8481/select/0/prometheus

导入指定模版：

13824

import 导入后查看 Dashboard

开启数据复制

https://docs.victoriametrics.com/Cluster-VictoriaMetrics.html#replication-and-data-safety

默认情况下，数据被 vminsert 的组件基于 hash 算法分别将数据持久化到不同的 vmstorage 节点，可以启用 vminsert 组件支持的 -replicationFactor=N 复制功能，将数据分别在各节点保存一份完整的副本以实现数据的高可用

Alertmanager 告警配置

2022年5月10日 · 阅读需 5 分钟

部署在二进制安装篇有写

邮件通知

配置并启动 alertmanager

global:
  resolve_timeout: 5m
  smtp_from: 'xxxx@qq.com'
  smtp_smarthost: 'smtp.qq.com:465'
  smtp_auth_username: 'xxxx@qq.com'
  smtp_auth_password: 'uukxxxxdvnxzbiaf'
  smtp_require_tls: false
  smtp_hello: '@qq.com'
route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 2m
  repeat_interval: 5m
  receiver: 'web.hook'
#  #receiver: 'default-receiver' #其他的告警发送给default-receiver
#  routes: #将critical的报警发送给myalertname
#  - reciver: myalertname
#  group_wait: 10s
receivers:
  - name: 'web.hook'
#    webhook_configs:
#      - url: 'http://127.0.0.1:5001/'
    email_configs:
    - to: 'xxxx@qq.com'
inhibit_rules:
  - source_match: #源匹配级别，当匹配成功发出通知，但是其他'alertname'，'dev'，'instance'产生的warning级别的告警通知将被抑制
      severity: 'critical' #报警的事件级别
    target_match:
      severity: 'warning' #匹配目标为新产生的目标告警为'warning' 将被抑制
    equal: ['alertname', 'dev', 'instance']

配置 prometheus 报警规则

# 创建角色目录
mkdir /apps/prometheus/rules && cd /apps/prometheus/rules

# 编写配置文件
vim server_rules.yaml
#---------------------------------

groups:
  - name: alertmanager_pod.rules
    rules:
    - alert: Pod_all_cpu_usage
      expr: (sum by(name)(rate(container_cpu_usage_seconds_total{image!=""}[5m]))*100) > 10
      for: 2m
      labels:
        severity: critical
        service: pods
      annotations:
        description: 容器 {{ $labels.name }} CPU 资源利用率大于 10% , (current value is {{ $value }})
        summary: Dev CPU 负载告警

    - alert: Pod_all_memory_usage  
      #expr: sort_desc(avg by(name)(irate(container_memory_usage_bytes{name!=""}[5m]))*100) > 10  #内存大于10%
      expr: sort_desc(avg by(name)(irate(node_memory_MemFree_bytes {name!=""}[5m]))) > 2147483648   #内存大于2G
      for: 2m
      labels:
        severity: critical
      annotations:
        description: 容器 {{ $labels.name }} Memory 资源利用率大于 2G , (current value is {{ $value }})
        summary: Dev Memory 负载告警

    - alert: Pod_all_network_receive_usage
      expr: sum by (name)(irate(container_network_receive_bytes_total{container_name="POD"}[1m])) > 50*1024*1024
      for: 2m
      labels:
        severity: critical
      annotations:
        description: 容器 {{ $labels.name }} network_receive 资源利用率大于 50M , (current value is {{ $value }})

    - alert: node内存可用大小 
      expr: node_memory_MemFree_bytes > 512*1024*1024 #故意写错的
      #expr: node_memory_MemFree_bytes > 1 #故意写错的(容器可用内存小于100k)
      for: 15s
      labels:
        severity: critical
      annotations:
        description: node可用内存小于4G

  - name: alertmanager_node.rules
    rules:
    - alert: 磁盘容量
      expr: 100-(node_filesystem_free_bytes{fstype=~"ext4|xfs"}/node_filesystem_size_bytes {fstype=~"ext4|xfs"}*100) > 80  #磁盘容量利用率大于80%
      for: 2s
      labels:
        severity: critical
      annotations:
        summary: "{{$labels.mountpoint}} 磁盘分区使用率过高！"
        description: "{{$labels.mountpoint }} 磁盘分区使用大于80%(目前使用:{{$value}}%)"

    - alert: 磁盘容量
      expr: 100-(node_filesystem_free_bytes{fstype=~"ext4|xfs"}/node_filesystem_size_bytes {fstype=~"ext4|xfs"}*100) > 60 #磁盘容量利用率大于60%
      for: 2s
      labels:
        severity: warning
      annotations:
        summary: "{{$labels.mountpoint}} 磁盘分区使用率过高！"
        description: "{{$labels.mountpoint }} 磁盘分区使用大于80%(目前使用:{{$value}}%)"

Prometheus 加载报警规则

vim /apps/prometheus/prometheus.yml

alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - 192.168.15.100:9093

# 文件路径:
rule_files:
  - /apps/prometheus/rules/server_rules.yaml

规则验证

./promtool check rules rules/server_rules.yaml
Checking rules/server_rules.yaml
  SUCCESS: 4 rules found

重启 prometheus

systemctl restart prometheus.service

使用 amtool查看当前告警

./amtool alert --alertmanager.url=http://192.168.15.100:9093

![prome验证](/img/AlertManager 告警配置/prome.png)

prometheus 报警状态

inactive：没有异常

pending：已经出发阈值，但未满足告警持续时间（即rule中的for字段）

firing：已经触发阈值并满足条件发送至alertmanager

邮箱验证邮件：

![邮箱验证](/img/AlertManager 告警配置/email.png)

钉钉告警通知

钉钉群创建机器人 - 关键字认证

Webhook 复制

* 安全设置 （☑️勾选自定义关键字）
 alertname

钉钉认证 - 关键字

# 创建脚本目录
mkdir /data/scripts -p

vim /data/scripts/dingding-keywords.sh
#!/bin/bash
source /etc/profile
#PHONE=$1
#SUBJECT=$2
MESSAGE=$1

/usr/bin/curl -X "POST" "https://oapi.dingtalk.com/robot/send?access_token=ba76276cd923xxe5dcd653fxxxx4b71c4a23e8c4eb8e91446840d527c8d9cd4e' \
-H 'Content-Type: application/json' \
-d '{"msgtype": "text",
  "text": {
    "content": "'${MESSAGE}'"
  }
}'

测试发送消息

/usr/bin/curl -v -XPOST  \
-H 'Content-Type: application/json' \
-d '{"msgtype": "text","text": {"content": "namespace=default\npod=pod1\ncpu=87%\n 持续时间=4.5m\nalertname=pod"}}'  'https://oapi.dingtalk.com/robot/send?access_token=766379d2ee757779c06ea6ff531d2d52640571293c3e1eedd42d71c19e60af07'

-------------------
# 或者按上面的脚本去 bash /data/scripts/dingding-keywords.sh 后接参数

钉钉接收到告警信息

部署 webhook-dingtalk

# 下载解压
cd /apps
wget https://github.com/timonwong/prometheus-webhook-dingtalk/releases/download/v1.4.0/prometheus-webhook-dingtalk-1.4.0.linux-amd64.tar.gz
tar xf prometheus-webhook-dingtalk-1.4.0.linux-amd64.tar.gz

# 运行
cd prometheus-webhook-dingtalk-1.4.0.linux-amd64

nohup ./prometheus-webhook-dingtalk  --web.listen-address="192.168.15.100:8060" --ding.profile="alertname=https://oapi.dingtalk.com/robot/send?access_token=766379d2ee757779c06ea6ff531d2d52640571293c3e1eedd42d71c19e60af07" &

alertmanager 修改配置

vi /apps/alertmanager/alertmanager.yml

global:
  resolve_timeout: 5m
  smtp_from: 'xxxx@qq.com'
  smtp_smarthost: 'smtp.qq.com:465'
  smtp_auth_username: 'xxxx@qq.com'
  smtp_auth_password: 'gtiuxxxxngxybhdi'
  smtp_require_tls: false
  smtp_hello: '@qq.com'
route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 1m
  repeat_interval: 5m
  receiver: 'dingding'
  #receiver: 'web.hook'
#  #receiver: 'default-receiver' #其他的告警发送给default-receiver
#  routes: #将critical的报警发送给myalertname
#  - reciver: myalertname
#  group_wait: 10s
receivers:
  - name: dingding
    webhook_configs:
    - url: 'http://192.168.15.100:8060/dingding/alertname/send'
      send_resolved: true
  - name: 'web.hook'
#    webhook_configs:
#      - url: 'http://127.0.0.1:5001/'
    email_configs:
    - to: 'xxxx@qq.com'
      send_resolved: true
inhibit_rules:
  - source_match: #源匹配级别，当匹配成功发出通知，但是其他'alertname'，'dev'，'instance'产生的warning级别的告警通知将被抑制
      severity: 'critical' #报警的事件级别
    target_match:
      severity: 'warning' #匹配目标为新产生的目标告警为'warning' 将被抑制
    equal: ['alertname', 'dev', 'instance']

PromQL 数据基础​

数据分类​

数据类型​

PromQL-指标数据​

PromeQL-匹配器​

PromQL-时间范围​

PromQL-运算符​

PromQL-聚合运算​

max、min、avg​

sum、sount​

abs、absent​

stddev、stdvar​

topk、bottomk​

rate、irate​

by、without​

Prometheus victoria-metrics 存储​

Prometheus 本地存储​

block 的特征​

每个块有 4 部分组成​

本地存储配置参数​

远端存储 victoriametrics​

单机版部署​

参数：​

设置 service 启动文件​

启动并设置开机自启​

Prometheus 设置​

grafana 配置​

官方 docker-compose​

集群版部署​

组件介绍​

部署集群​

Prometheus 配置远程写入​

grafana 数据源配置​

添加数据源​

开启数据复制​

邮件通知​

钉钉告警通知​

PromQL 数据基础

数据分类

数据类型

PromQL-指标数据

PromeQL-匹配器

PromQL-时间范围

PromQL-运算符

PromQL-聚合运算

max、min、avg

sum、sount

abs、absent

stddev、stdvar

topk、bottomk

rate、irate

by、without

Prometheus victoria-metrics 存储

Prometheus 本地存储

block 的特征

每个块有 4 部分组成

本地存储配置参数

远端存储 victoriametrics

单机版部署

参数：

设置 service 启动文件

启动并设置开机自启

Prometheus 设置

grafana 配置

官方 docker-compose

集群版部署

组件介绍

部署集群

Prometheus 配置远程写入

grafana 数据源配置

添加数据源

开启数据复制

邮件通知

钉钉告警通知