跳到主要内容

14 篇博文 含有标签「Prometheus」

查看所有标签

Prometheus+Alertmanager对接Telegram告警

· 阅读需 5 分钟

Prometheus+Alertmanager对接Telegram告警

Prometheus 搭建启动

1、安装prometheus

mkdir /apps  && cd /apps

wget https://github.com/prometheus/prometheus/releases/download/v2.35.0/prometheus-2.35.0.linux-amd64.tar.gz
tar xf prometheus-2.35.0.linux-amd64.tar.gz

# 创建软连接
ln -sv prometheus-2.35.0.linux-amd64 /apps/prometheus

2、配置systemd管理

vim /etc/systemd/system/prometheus.service

[Unit]
Description=Prometheus
After=network.target

[Service]
Type=simple
User=prometheus
Group=prometheus
ExecStart=/apps/prometheus/prometheus \
--config.file=/apps/prometheus/prometheus.yml \
--storage.tsdb.path=/apps/prometheus/data \
--web.console.templates=/apps/prometheus/consoles \
--web.console.libraries=/apps/prometheus/console_libraries \
--web.listen-address=:9090 \
--storage.tsdb.retention.time=15d \
--web.enable-lifecycle \
--web.enable-admin-api
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target

3、创建用户和设置权限

# 创建prometheus用户
useradd --no-create-home --shell /bin/false prometheus

# 设置目录权限
chown -R prometheus:prometheus /apps/prometheus/

4、启动服务

systemctl daemon-reload
systemctl enable prometheus
systemctl start prometheus
systemctl status prometheus

Alertmanager 搭建配置

1、安装Alertmanager

cd /apps
wget https://github.com/prometheus/alertmanager/releases/download/v0.24.0/alertmanager-0.24.0.linux-amd64.tar.gz
tar xf alertmanager-0.24.0.linux-amd64.tar.gz
ln -sv alertmanager-0.24.0.linux-amd64 /apps/alertmanager

2、配置systemd管理

vim /etc/systemd/system/alertmanager.service

[Unit]
Description=Alertmanager
After=network.target

[Service]
Type=simple
User=prometheus
Group=prometheus
ExecStart=/apps/alertmanager/alertmanager \
--config.file=/apps/alertmanager/alertmanager.yml \
--storage.path=/apps/alertmanager/data \
--web.listen-address=:9093 \
--web.external-url=http://localhost:9093
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target

3、配置Alertmanager

vim /apps/alertmanager/alertmanager.yml

global:
smtp_smarthost: 'smtp.163.com:25'
smtp_from: 'your-email@163.com'

route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'telegram-notifications'

receivers:
- name: 'telegram-notifications'
telegram_configs:
- api_url: 'https://api.telegram.org'
bot_token: 'YOUR_BOT_TOKEN'
chat_id: YOUR_CHAT_ID
message: |
{{ range .Alerts -}}
Alert: {{ .Annotations.summary }}
Description: {{ .Annotations.description }}
Labels:
{{ range .Labels.SortedPairs }} - {{ .Name }}: {{ .Value }}
{{ end }}
{{ end }}

inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']

4、设置权限和启动

chown -R prometheus:prometheus /apps/alertmanager/
systemctl daemon-reload
systemctl enable alertmanager
systemctl start alertmanager
systemctl status alertmanager

配置Prometheus规则

1、创建告警规则文件

vim /apps/prometheus/rules/alert-rules.yml

groups:
- name: example
rules:
- alert: InstanceDown
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Instance {{ $labels.instance }} down"
description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 1 minute."

- alert: HighCPUUsage
expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage detected"
description: "CPU usage is above 80% for more than 5 minutes on {{ $labels.instance }}"

- alert: HighMemoryUsage
expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage detected"
description: "Memory usage is above 80% for more than 5 minutes on {{ $labels.instance }}"

2、修改Prometheus配置

vim /apps/prometheus/prometheus.yml

global:
scrape_interval: 15s
evaluation_interval: 15s

alerting:
alertmanagers:
- static_configs:
- targets:
- localhost:9093

rule_files:
- "rules/*.yml"

scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']

- job_name: 'node'
static_configs:
- targets: ['localhost:9100']

创建Telegram Bot

1、创建Bot

  1. 在Telegram中搜索 @BotFather
  2. 发送 /newbot 命令
  3. 按提示设置bot名称和用户名
  4. 获取bot token

2、获取Chat ID

  1. 将bot添加到群组或直接私聊
  2. 发送一条消息给bot
  3. 访问 https://api.telegram.org/bot<TOKEN>/getUpdates
  4. 从返回的JSON中找到chat id

3、测试Bot

# 测试发送消息
curl -X POST "https://api.telegram.org/bot<YOUR_BOT_TOKEN>/sendMessage" \
-H "Content-Type: application/json" \
-d '{"chat_id": "<YOUR_CHAT_ID>", "text": "Test message from Prometheus"}'

配置验证

1、重启服务

systemctl restart prometheus
systemctl restart alertmanager

2、检查配置

# 检查Prometheus配置
curl http://localhost:9090/api/v1/status/config

# 检查Alertmanager配置
curl http://localhost:9093/api/v1/status/config

3、测试告警

# 停止node_exporter来触发告警
systemctl stop node_exporter

# 查看告警状态
curl http://localhost:9090/api/v1/alerts

高级配置

1、告警模板自定义

receivers:
- name: 'telegram-notifications'
telegram_configs:
- api_url: 'https://api.telegram.org'
bot_token: 'YOUR_BOT_TOKEN'
chat_id: YOUR_CHAT_ID
parse_mode: 'HTML'
message: |
<b>🚨 {{ .Status | toUpper }}</b>

<b>Alert:</b> {{ .GroupLabels.alertname }}
<b>Severity:</b> {{ .CommonLabels.severity }}
<b>Instance:</b> {{ .CommonLabels.instance }}

<b>Summary:</b> {{ .CommonAnnotations.summary }}
<b>Description:</b> {{ .CommonAnnotations.description }}

<b>Started:</b> {{ .StartsAt.Format "2006-01-02 15:04:05" }}

2、告警抑制规则

inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'instance']

3、路由规则

route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
receiver: 'default-receiver'
routes:
- match:
severity: critical
receiver: 'critical-receiver'
repeat_interval: 5m
- match:
severity: warning
receiver: 'warning-receiver'
repeat_interval: 30m

故障排除

1、常见问题

  • Bot token错误:确保从BotFather获取的token正确
  • Chat ID错误:通过getUpdates API获取正确的chat id
  • 网络问题:检查服务器是否能访问Telegram API
  • 权限问题:确保prometheus用户有正确的文件权限

2、日志查看

# 查看Prometheus日志
journalctl -u prometheus -f

# 查看Alertmanager日志
journalctl -u alertmanager -f

3、调试命令

# 测试告警规则
/apps/prometheus/promtool query instant 'up == 0'

# 验证告警配置
/apps/alertmanager/amtool config check --config.file=/apps/alertmanager/alertmanager.yml

总结

通过以上配置,你已经成功搭建了Prometheus+Alertmanager+Telegram的告警系统。这个系统能够:

  1. 监控系统指标
  2. 根据预设规则触发告警
  3. 将告警信息推送到Telegram
  4. 提供灵活的告警路由和抑制机制

记住要定期检查和更新告警规则,确保告警系统的有效性。

Python 告警代理服务器

· 阅读需 2 分钟

安装python 环境

# 安装
yum install -y epel-release
yum install -y python3
yum install -y python3-pip

# 验证
python3 --version
pip3 --version

# 使用pip3来安装 requests 和 flask 库:
pip3 install requests flask

编写代理服务器

main.py

import json
import requests
from flask import Flask, request

app = Flask(__name__)

# 配置自己的群机器人的 key
WECHAT_API_URL = "https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=your_key"

def convert_to_wechat_markdown(data):
status = data['status']
group_labels = data.get('groupLabels', {})
common_labels = data.get('commonLabels', {})
alerts = data.get('alerts', [])

# Modify the alert title based on the status
alert_title = f"{status.upper()}告警" if status == 'firing' else "告警已解决"

content = f"# {alert_title}\n"
content += f"> 告警名称: {group_labels.get('alertname')}\n"
content += f"> 严重程度: {common_labels.get('severity')}\n\n"

for alert in alerts:
annotations = alert.get('annotations', {})
starts_at = alert.get('startsAt', '') # 获取故障时间
content += f"#### 告警描述\n> {annotations.get('description')}\n"
content += f"> 故障时间: {starts_at}\n" # 添加故障时间

content += f"\n[面板地址](http://dashboard.grafana.com)\n" # 添加面板地址

return {
'msgtype': 'markdown',
'markdown': {'content': content}
}

@app.route('/proxy', methods=['POST'])
def proxy():
data = request.json
wechat_data = convert_to_wechat_markdown(data)
headers = {
'Content-Type': 'application/json'
}
response = requests.post(WECHAT_API_URL, headers=headers, data=json.dumps(wechat_data))
return response.text, response.status_code

if __name__ == "__main__":
app.run(host="0.0.0.0", port=8000)

启动服务器

nohup python3 main.py &

手动测试

curl -X POST http://localhost:8000/proxy -H "Content-Type: application/json" -d '{
"status": "firing",
"groupLabels": {"alertname": "TestAlert"},
"commonLabels": {
"cluster": "TestCluster",
"service": "TestService",
"severity": "critical"
},
"alerts": [
{
"annotations": {
"description": "This is a test alert."
}
}
]
}'

配置 Alertmanager

global:
resolve_timeout: 5m

route:
group_by: ['alertname']
group_wait: 30s # 初次发送告警延时
group_interval: 5m # 距离第一次发送告警,等待多久再次发送告警
repeat_interval: 24h # 告警重发时间
receiver: 'default-receiver'
routes:
- match:
severity: critical
receiver: 'weixin-receiver'

receivers:
- name: 'default-receiver'
- name: 'weixin-receiver'
webhook_configs:
- url: 'http://127.0.0.1:8000/proxy'
send_resolved: true

Prometheus 监控 ingress-nginx-controller

· 阅读需 2 分钟

Prometheus 监控 ingress-nginx-controller

官方文档

检查ingress是否暴露出端口

# 查看是否有内容
http://xx.xx.xx.xx:10254/metrics

# 如果没有内容,添加部分内容
vim mandatory.yaml
apiVersion: v1
kind: Deployment
metadata:
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "10254"
..
spec:
ports:
- name: prometheus
containerPort: 10254
..

配置 Prometheus 配置文件

vim prometheus.yml 
- job_name: 'ingress-nginx-controller exporter'
static_configs:
- targets: ['xx.xx.xx.xx:10254']

配置 Grafana

wget https://raw.githubusercontent.com/kubernetes/ingress-nginx/main/deploy/grafana/dashboards/nginx.json

# 倒入 JSON 文件

![grafana 展示](/img/Prometheus 监控 ingress-nginx-controller/ingress-exporter.png)

编写告警信息

cd rules && vim ingress_rules.yaml 

groups:
- name: Ingress_monitor
rules:
- alert: 4xx (> 5%) 的 HTTP 请求过多
expr: sum(rate(nginx_ingress_controller_requests{status=~"^4.."}[1m])) / sum(rate(nginx_ingress_controller_requests[1m])) * 100 >= 5
for: 1m
labels:
severity: critical
annotations:
summary: Nginx high HTTP 4xx error rate (instance {{ $labels.instance }})
description: "Too many HTTP requests with status 4xx (> 5%)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

- alert: 5xx (> 5%) 的 HTTP 请求过多
expr: sum(rate(nginx_ingress_controller_requests{status=~"^5.."}[1m])) / sum(rate(nginx_ingress_controller_requests[1m])) * 100 >= 5
for: 1m
labels:
severity: critical
annotations:
summary: Nginx high HTTP 4xx error rate (instance {{ $labels.instance }})
description: "Too many HTTP requests with status 5xx (> 5%)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

- alert: ingress-nginx 延迟高于3秒
expr: histogram_quantile(0.99, sum(rate(nginx_http_request_duration_seconds_bucket[2m])) by (host, node)) > 3
for: 2m
labels:
severity: warning
annotations:
summary: Nginx latency high (instance {{ $labels.instance }})
description: "Nginx p99 latency is higher than 3 seconds\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"