博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
Kubernetes中使用prometheus+alertmanager实现监控告警
阅读量:6648 次
发布时间:2019-06-25

本文共 6302 字,大约阅读时间需要 21 分钟。

监控告警原型图

a095f5691889d8803cda17b2daaa252fe126c507

原型图解释

prometheus与alertmanager作为container运行在同一个pods中并交由Deployment控制器管理,alertmanager默认开启9093端口,因为我们的prometheus与alertmanager是处于同一个pod中,所以prometheus直接使用localhost:9093就可以与alertmanager通信(用于发送告警通知),告警规则配置rules.yml以Configmap的形式挂载到prometheus容器供prometheus使用,告警通知对象配置也通过Configmap挂载到alertmanager容器供alertmanager使用,这里我们使用邮件接收告警通知,具体配置在alertmanager.yml中

测试环境

环境:Linux 3.10.0-693.el7.x86_64 x86_64 GNU/Linux

平台:Kubernetes v1.10.5
Tips:prometheus与alertmanager完整的配置在文档末尾

创建告警规则

在prometheus中指定告警规则的路径, rules.yml就是用来指定报警规则,这里我们将rules.yml用ConfigMap的形式挂载到/etc/prometheus目录下面即可:
rule_files:- /etc/prometheus/rules.yml

这里我们指定了一个InstanceDown告警,当主机挂掉1分钟则prometheus会发出告警

rules.yml: |    groups:    - name: example      rules:      - alert: InstanceDown        expr: up == 0        for: 1m        labels:          severity: page        annotations:          summary: "Instance {
{ $labels.instance }} down" description: "{
{ $labels.instance }} of job {
{ $labels.job }} has been down for more than 1 minutes."

配置prometheus与alertmanager通信(用于prometheus向alertmanager发送告警信息)

alertmanager默认开启9093端口,又因为我们的prometheus与alertmanager是处于同一个pod中,所以prometheus直接使用localhost:9093就可以与alertmanager通信
alerting:  alertmanagers:  - static_configs:    - targets: ["localhost:9093"]

alertmanager配置告警通知对象

我们这里举了一个邮件告警的例子,alertmanager接收到prometheus发出的告警时,alertmanager会向指定的邮箱发送一封告警邮件,这个配置也是通过Configmap的形式挂载到alertmanager所在的容器中供alertmanager使用
alertmanager.yml: |-    global:      smtp_smarthost: 'smtp.exmail.qq.com:465'      smtp_from: 'xin.liu@woqutech.com'      smtp_auth_username: 'xin.liu@woqutech.com'      smtp_auth_password: 'xxxxxxxxxxxx'      smtp_require_tls: false    route:      group_by: [alertname]      group_wait: 30s      group_interval: 5m      repeat_interval: 10m      receiver: default-receiver    receivers:    - name: 'default-receiver'      email_configs:      - to: '1148576125@qq.com'

原型效果展示

在prometheus web ui中可以看到 配置的告警规则

e33cb95b8c5c3117dc83ab18d26c9f961e2dbf8d

为了看测试效果,关掉一个主机节点:

在prometheus web ui中可以看到一个InstanceDown告警被触发

bd5f0ad84e97b525def09cb5b963a5126580a1d2

在alertmanager web ui中可以看到alertmanager收到prometheus发出的告警

4947747a69de1d6dcb30e5cde70c5184f4ffd4f2

指定接收告警的邮箱收到alertmanager发出的告警邮件

b34e928bf4a86a2fb750e00c30027c8662927990

全部配置

node_exporter_daemonset.yaml

apiVersion: extensions/v1beta1kind: DaemonSetmetadata:  name: node-exporter  namespace: kube-system  labels:    app: node_exporterspec:  selector:    matchLabels:      name: node_exporter  template:    metadata:      labels:        name: node_exporter    spec:      tolerations:      - key: node-role.kubernetes.io/master        effect: NoSchedule      containers:      - name: node-exporter        image: alery/node-exporter:1.0        ports:        - name: node-exporter          containerPort: 9100          hostPort: 9100        volumeMounts:        - name: localtime          mountPath: /etc/localtime        - name: host          mountPath: /host          readOnly: true      volumes:      - name: localtime        hostPath:          path: /usr/share/zoneinfo/Asia/Shanghai      - name: host        hostPath:          path: /

alertmanager-cm.yaml

kind: ConfigMapapiVersion: v1metadata:  name: alertmanager  namespace: kube-systemdata:  alertmanager.yml: |-    global:      smtp_smarthost: 'smtp.exmail.qq.com:465'      smtp_from: 'xin.liu@woqutech.com'      smtp_auth_username: 'xin.liu@woqutech.com'      smtp_auth_password: 'xxxxxxxxxxxx'      smtp_require_tls: false    route:      group_by: [alertname]      group_wait: 30s      group_interval: 5m      repeat_interval: 10m      receiver: default-receiver    receivers:    - name: 'default-receiver'      email_configs:      - to: '1148576125@qq.com'

prometheus-rbac.yaml

apiVersion: rbac.authorization.k8s.io/v1beta1kind: ClusterRolemetadata:  name: prometheus  namespace: kube-systemrules:- apiGroups: [""]  resources:  - nodes  - nodes/proxy  - services  - endpoints  - pods  verbs: ["get", "list", "watch"]- nonResourceURLs: ["/metrics"]  verbs: ["get"]---apiVersion: v1kind: ServiceAccountmetadata:  name: prometheus  namespace: kube-system---apiVersion: rbac.authorization.k8s.io/v1beta1kind: ClusterRoleBindingmetadata:  name: prometheus  namespace: kube-systemroleRef:  apiGroup: rbac.authorization.k8s.io  kind: ClusterRole  name: prometheussubjects:- kind: ServiceAccount  name: prometheus  namespace: kube-system

prometheus-cm.yaml

kind: ConfigMapapiVersion: v1data:  prometheus.yml: |    rule_files:    - /etc/prometheus/rules.yml    alerting:      alertmanagers:      - static_configs:        - targets: ["localhost:9093"]    scrape_configs:    - job_name: 'node'      kubernetes_sd_configs:      - role: pod      relabel_configs:      - source_labels: [__meta_kubernetes_pod_ip]        action: replace        target_label: __address__        replacement: $1:9100      - source_labels: [__meta_kubernetes_pod_host_ip]        action: replace        target_label: instance      - source_labels: [__meta_kubernetes_pod_node_name]        action: replace        target_label: node_name      - action: labelmap        regex: __meta_kubernetes_pod_label_(name)      - source_labels: [__meta_kubernetes_pod_label_name]        regex: node_exporter        action: keep  rules.yml: |    groups:    - name: example      rules:      - alert: InstanceDown        expr: up == 0        for: 5m        labels:          severity: page        annotations:          summary: "Instance {
{ $labels.instance }} down" description: "{
{ $labels.instance }} of job {
{ $labels.job }} has been down for more than 5 minutes." - alert: APIHighRequestLatency expr: api_http_request_latencies_second{quantile="0.5"} > 1 for: 10m annotations: summary: "High request latency on {
{ $labels.instance }}" description: "{
{ $labels.instance }} has a median request latency above 1s (current value: {
{ $value }}s)"metadata: name: prometheus-config-v0.1.0 namespace: kube-system

prometheus.yaml

apiVersion: extensions/v1beta1kind: Deploymentmetadata:  namespace: kube-system  name: prometheus  labels:    name: prometheusspec:  replicas: 1  selector:    matchLabels:      app: prometheus  template:    metadata:      name: prometheus      labels:        app: prometheus    spec:      serviceAccountName: prometheus      nodeSelector:        node-role.kubernetes.io/master: ""      tolerations:      - effect: NoSchedule        key: node-role.kubernetes.io/master
本文转自SegmentFault-

转载地址:http://gvyto.baihongyu.com/

你可能感兴趣的文章
Jetty
查看>>
web测试容易遗漏的地方
查看>>
iphone char*与nsdata之间的转换
查看>>
xslt 映射 xml
查看>>
清新脱俗的 Web 服务器 Caddy
查看>>
微服务指南走北(二):微服务架构的进程间通信(IPC)
查看>>
Webpack2.x踩坑与总结
查看>>
使用pcp监控spring boot的docker应用
查看>>
PHP微型框架设计
查看>>
关于数据缺失问题的总结
查看>>
Three.js 最新版本改进了对WebGL的支持
查看>>
微软Office 365正式上架Mac App Store
查看>>
Eclipse Collections:让Java Streams更上一层楼
查看>>
《系统与网络管理实践》(第三版)作者访谈
查看>>
除了输入法,移动端AI还有哪些想象空间?
查看>>
独家!阿里开源自用OpenJDK版本,Java社区迎来中国力量
查看>>
血淋淋的BUG:波音在软件开发上错在哪里?
查看>>
访谈:Kotlin在Pinterest的逆势生长
查看>>
云端能力知几许?12人众测华为云企业级Kubernetes集群实力
查看>>
JavaScript || this
查看>>