14 篇博文含有标签「Prometheus」

Prometheus+Alertmanager对接Telegram告警

2024年1月1日 · 阅读需 5 分钟

Prometheus+Alertmanager对接Telegram告警

Prometheus 搭建启动

1、安装prometheus

mkdir /apps  && cd /apps

wget https://github.com/prometheus/prometheus/releases/download/v2.35.0/prometheus-2.35.0.linux-amd64.tar.gz
tar xf prometheus-2.35.0.linux-amd64.tar.gz

# 创建软连接
ln -sv prometheus-2.35.0.linux-amd64 /apps/prometheus

2、配置systemd管理

vim /etc/systemd/system/prometheus.service

[Unit]
Description=Prometheus
After=network.target

[Service]
Type=simple
User=prometheus
Group=prometheus
ExecStart=/apps/prometheus/prometheus \
    --config.file=/apps/prometheus/prometheus.yml \
    --storage.tsdb.path=/apps/prometheus/data \
    --web.console.templates=/apps/prometheus/consoles \
    --web.console.libraries=/apps/prometheus/console_libraries \
    --web.listen-address=:9090 \
    --storage.tsdb.retention.time=15d \
    --web.enable-lifecycle \
    --web.enable-admin-api
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target

3、创建用户和设置权限

# 创建prometheus用户
useradd --no-create-home --shell /bin/false prometheus

# 设置目录权限
chown -R prometheus:prometheus /apps/prometheus/

4、启动服务

systemctl daemon-reload
systemctl enable prometheus
systemctl start prometheus
systemctl status prometheus

Alertmanager 搭建配置

1、安装Alertmanager

cd /apps
wget https://github.com/prometheus/alertmanager/releases/download/v0.24.0/alertmanager-0.24.0.linux-amd64.tar.gz
tar xf alertmanager-0.24.0.linux-amd64.tar.gz
ln -sv alertmanager-0.24.0.linux-amd64 /apps/alertmanager

2、配置systemd管理

vim /etc/systemd/system/alertmanager.service

[Unit]
Description=Alertmanager
After=network.target

[Service]
Type=simple
User=prometheus
Group=prometheus
ExecStart=/apps/alertmanager/alertmanager \
    --config.file=/apps/alertmanager/alertmanager.yml \
    --storage.path=/apps/alertmanager/data \
    --web.listen-address=:9093 \
    --web.external-url=http://localhost:9093
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target

3、配置Alertmanager

vim /apps/alertmanager/alertmanager.yml

global:
  smtp_smarthost: 'smtp.163.com:25'
  smtp_from: 'your-email@163.com'

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'telegram-notifications'

receivers:
- name: 'telegram-notifications'
  telegram_configs:
  - api_url: 'https://api.telegram.org'
    bot_token: 'YOUR_BOT_TOKEN'
    chat_id: YOUR_CHAT_ID
    message: |
      {{ range .Alerts -}}
      Alert: {{ .Annotations.summary }}
      Description: {{ .Annotations.description }}
      Labels:
      {{ range .Labels.SortedPairs }}  - {{ .Name }}: {{ .Value }}
      {{ end }}
      {{ end }}

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'dev', 'instance']

4、设置权限和启动

chown -R prometheus:prometheus /apps/alertmanager/
systemctl daemon-reload
systemctl enable alertmanager
systemctl start alertmanager
systemctl status alertmanager

配置Prometheus规则

1、创建告警规则文件

vim /apps/prometheus/rules/alert-rules.yml

groups:
- name: example
  rules:
  - alert: InstanceDown
    expr: up == 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Instance {{ $labels.instance }} down"
      description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 1 minute."

  - alert: HighCPUUsage
    expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High CPU usage detected"
      description: "CPU usage is above 80% for more than 5 minutes on {{ $labels.instance }}"

  - alert: HighMemoryUsage
    expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 80
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High memory usage detected"
      description: "Memory usage is above 80% for more than 5 minutes on {{ $labels.instance }}"

2、修改Prometheus配置

vim /apps/prometheus/prometheus.yml

global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - localhost:9093

rule_files:
  - "rules/*.yml"

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
    - targets: ['localhost:9090']

  - job_name: 'node'
    static_configs:
    - targets: ['localhost:9100']

创建Telegram Bot

1、创建Bot

在Telegram中搜索 @BotFather
发送 /newbot 命令
按提示设置bot名称和用户名
获取bot token

2、获取Chat ID

将bot添加到群组或直接私聊
发送一条消息给bot
访问 https://api.telegram.org/bot<TOKEN>/getUpdates
从返回的JSON中找到chat id

3、测试Bot

# 测试发送消息
curl -X POST "https://api.telegram.org/bot<YOUR_BOT_TOKEN>/sendMessage" \
     -H "Content-Type: application/json" \
     -d '{"chat_id": "<YOUR_CHAT_ID>", "text": "Test message from Prometheus"}'

配置验证

1、重启服务

systemctl restart prometheus
systemctl restart alertmanager

2、检查配置

# 检查Prometheus配置
curl http://localhost:9090/api/v1/status/config

# 检查Alertmanager配置
curl http://localhost:9093/api/v1/status/config

3、测试告警

# 停止node_exporter来触发告警
systemctl stop node_exporter

# 查看告警状态
curl http://localhost:9090/api/v1/alerts

高级配置

1、告警模板自定义

receivers:
- name: 'telegram-notifications'
  telegram_configs:
  - api_url: 'https://api.telegram.org'
    bot_token: 'YOUR_BOT_TOKEN'
    chat_id: YOUR_CHAT_ID
    parse_mode: 'HTML'
    message: |
      <b>🚨 {{ .Status | toUpper }}</b>
      
      <b>Alert:</b> {{ .GroupLabels.alertname }}
      <b>Severity:</b> {{ .CommonLabels.severity }}
      <b>Instance:</b> {{ .CommonLabels.instance }}
      
      <b>Summary:</b> {{ .CommonAnnotations.summary }}
      <b>Description:</b> {{ .CommonAnnotations.description }}
      
      <b>Started:</b> {{ .StartsAt.Format "2006-01-02 15:04:05" }}

2、告警抑制规则

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'instance']

3、路由规则

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 12h
  receiver: 'default-receiver'
  routes:
  - match:
      severity: critical
    receiver: 'critical-receiver'
    repeat_interval: 5m
  - match:
      severity: warning
    receiver: 'warning-receiver'
    repeat_interval: 30m

故障排除

1、常见问题

Bot token错误：确保从BotFather获取的token正确
Chat ID错误：通过getUpdates API获取正确的chat id
网络问题：检查服务器是否能访问Telegram API
权限问题：确保prometheus用户有正确的文件权限

2、日志查看

# 查看Prometheus日志
journalctl -u prometheus -f

# 查看Alertmanager日志
journalctl -u alertmanager -f

3、调试命令

# 测试告警规则
/apps/prometheus/promtool query instant 'up == 0'

# 验证告警配置
/apps/alertmanager/amtool config check --config.file=/apps/alertmanager/alertmanager.yml

总结

通过以上配置，你已经成功搭建了Prometheus+Alertmanager+Telegram的告警系统。这个系统能够：

监控系统指标
根据预设规则触发告警
将告警信息推送到Telegram
提供灵活的告警路由和抑制机制

记住要定期检查和更新告警规则，确保告警系统的有效性。

Python 告警代理服务器

2023年7月20日 · 阅读需 2 分钟

安装python 环境

# 安装
yum install -y epel-release
yum install -y python3
yum install -y python3-pip

# 验证
python3 --version
pip3 --version

# 使用pip3来安装 requests 和 flask 库：
pip3 install requests flask

编写代理服务器

main.py

import json
import requests
from flask import Flask, request

app = Flask(__name__)

# 配置自己的群机器人的 key
WECHAT_API_URL = "https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=your_key"

def convert_to_wechat_markdown(data):
    status = data['status']
    group_labels = data.get('groupLabels', {})
    common_labels = data.get('commonLabels', {})
    alerts = data.get('alerts', [])

    # Modify the alert title based on the status
    alert_title = f"{status.upper()}告警" if status == 'firing' else "告警已解决"
    
    content = f"# {alert_title}\n"
    content += f"> 告警名称: {group_labels.get('alertname')}\n"
    content += f"> 严重程度: {common_labels.get('severity')}\n\n"

    for alert in alerts:
        annotations = alert.get('annotations', {})
        starts_at = alert.get('startsAt', '') # 获取故障时间
        content += f"#### 告警描述\n> {annotations.get('description')}\n"
        content += f"> 故障时间: {starts_at}\n" # 添加故障时间

    content += f"\n[面板地址](http://dashboard.grafana.com)\n" # 添加面板地址

    return {
        'msgtype': 'markdown',
        'markdown': {'content': content}
    }

@app.route('/proxy', methods=['POST'])
def proxy():
    data = request.json
    wechat_data = convert_to_wechat_markdown(data)
    headers = {
        'Content-Type': 'application/json'
    }
    response = requests.post(WECHAT_API_URL, headers=headers, data=json.dumps(wechat_data))
    return response.text, response.status_code

if __name__ == "__main__":
    app.run(host="0.0.0.0", port=8000)

启动服务器

nohup python3 main.py &

手动测试

curl -X POST http://localhost:8000/proxy -H "Content-Type: application/json" -d '{
  "status": "firing",
  "groupLabels": {"alertname": "TestAlert"},
  "commonLabels": {
    "cluster": "TestCluster",
    "service": "TestService",
    "severity": "critical"
  },
  "alerts": [
    {
      "annotations": {
        "description": "This is a test alert."
      }
    }
  ]
}'

配置 Alertmanager

global:
  resolve_timeout: 5m

route:
  group_by: ['alertname']
  group_wait: 30s          # 初次发送告警延时
  group_interval: 5m      # 距离第一次发送告警，等待多久再次发送告警
  repeat_interval: 24h     # 告警重发时间
  receiver: 'default-receiver'
  routes:
  - match:
      severity: critical
    receiver: 'weixin-receiver'

receivers:
- name: 'default-receiver'
- name: 'weixin-receiver'
  webhook_configs:
  - url: 'http://127.0.0.1:8000/proxy'
    send_resolved: true

Prometheus 监控 ingress-nginx-controller

2022年10月17日 · 阅读需 2 分钟

Prometheus 监控 ingress-nginx-controller

官方文档

检查ingress是否暴露出端口

# 查看是否有内容
http://xx.xx.xx.xx:10254/metrics

# 如果没有内容，添加部分内容
vim mandatory.yaml
apiVersion: v1
kind: Deployment
metadata:
 annotations:
   prometheus.io/scrape: "true"
   prometheus.io/port: "10254"
..
spec:
  ports:
    - name: prometheus
      containerPort: 10254
      ..

配置 Prometheus 配置文件

vim prometheus.yml 
- job_name: 'ingress-nginx-controller exporter'
    static_configs:
    - targets: ['xx.xx.xx.xx:10254']

配置 Grafana

wget https://raw.githubusercontent.com/kubernetes/ingress-nginx/main/deploy/grafana/dashboards/nginx.json

# 倒入 JSON 文件

![grafana 展示](/img/Prometheus 监控 ingress-nginx-controller/ingress-exporter.png)

编写告警信息

cd rules && vim ingress_rules.yaml 

groups:
- name: Ingress_monitor
  rules:
  - alert: 4xx (> 5%) 的 HTTP 请求过多
    expr: sum(rate(nginx_ingress_controller_requests{status=~"^4.."}[1m])) /  sum(rate(nginx_ingress_controller_requests[1m])) * 100 >= 5
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: Nginx high HTTP 4xx error rate (instance {{ $labels.instance }})
      description: "Too many HTTP requests with status 4xx (> 5%)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  - alert: 5xx (> 5%) 的 HTTP 请求过多
    expr: sum(rate(nginx_ingress_controller_requests{status=~"^5.."}[1m])) /  sum(rate(nginx_ingress_controller_requests[1m])) * 100 >= 5
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: Nginx high HTTP 4xx error rate (instance {{ $labels.instance }})
      description: "Too many HTTP requests with status 5xx (> 5%)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
      
  - alert: ingress-nginx 延迟高于3秒
    expr: histogram_quantile(0.99, sum(rate(nginx_http_request_duration_seconds_bucket[2m])) by (host, node)) > 3
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Nginx latency high (instance {{ $labels.instance }})
      description: "Nginx p99 latency is higher than 3 seconds\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

Prometheus+Alertmanager对接Telegram告警​

Prometheus 搭建启动​

1、安装prometheus​

2、配置systemd管理​

3、创建用户和设置权限​

4、启动服务​

Alertmanager 搭建配置​

1、安装Alertmanager​

2、配置systemd管理​

3、配置Alertmanager​

4、设置权限和启动​

配置Prometheus规则​

1、创建告警规则文件​

2、修改Prometheus配置​

创建Telegram Bot​

1、创建Bot​

2、获取Chat ID​

3、测试Bot​

配置验证​

1、重启服务​

2、检查配置​

3、测试告警​

高级配置​

1、告警模板自定义​

2、告警抑制规则​

3、路由规则​

故障排除​

1、常见问题​

2、日志查看​

3、调试命令​

总结​

安装python 环境​

编写代理服务器​

启动服务器​

手动测试​

配置 Alertmanager​

Prometheus 监控 ingress-nginx-controller​

检查ingress是否暴露出端口​

配置 Prometheus 配置文件​

配置 Grafana​

编写告警信息​

Prometheus+Alertmanager对接Telegram告警

Prometheus 搭建启动

1、安装prometheus

2、配置systemd管理

3、创建用户和设置权限

4、启动服务

Alertmanager 搭建配置

1、安装Alertmanager

2、配置systemd管理

3、配置Alertmanager

4、设置权限和启动

配置Prometheus规则

1、创建告警规则文件

2、修改Prometheus配置

创建Telegram Bot

1、创建Bot

2、获取Chat ID

3、测试Bot

配置验证

1、重启服务

2、检查配置

3、测试告警

高级配置

1、告警模板自定义

2、告警抑制规则

3、路由规则

故障排除

1、常见问题

2、日志查看

3、调试命令

总结

安装python 环境

编写代理服务器

启动服务器

手动测试

配置 Alertmanager

Prometheus 监控 ingress-nginx-controller

检查ingress是否暴露出端口

配置 Prometheus 配置文件

配置 Grafana

编写告警信息