python export_python自定义监控slurm的Prometheus的export

最新推荐文章于 2023-05-27 11:04:50 发布

weixin_39809793

最新推荐文章于 2023-05-27 11:04:50 发布

阅读量389

点赞数

文章标签： python export

首先：这篇文章做的是写一个监控slurm的Prometheus的export，安装环境是ubuntu16.04。

1. 下载Prometheus

官网链接下载,然后解压

tar -zxvf prometheus-2.4.3.linux-amd64.tar.gz

cd prometheus-2.4.3.linux-amd64

2. 配置文件prometheus.yml

开头的都是默认配置，需要配置的是最低下的job_name,把你需要监控的ip地址设置一下，我在这监控的是my_slurm,ip为localhost：8000(最好写成IP地址，不要写localhost，我这里在偷懒 :D)

# my global config

global:

scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.

evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.

# scrape_timeout is set to the global default (10s).

# Alertmanager configuration

alerting:

alertmanagers:

- static_configs:

- targets:

# - alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.

rule_files:

# - "first_rules.yml"

# - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:

# Here it's Prometheus itself.

scrape_configs:

# The job name is added as a label `job=` to any timeseries scraped from this config.

- job_name: 'prometheus'

# metrics_path defaults to '/metrics'

# scheme defaults to 'http'.

static_configs:

- targets: ['localhost:9090']

- job_name: 'my_demo'

# scheme defaults to 'http'.

static_configs:

- targets: ['localhost:9100']

- job_name: 'my_slurm'

# scheme defaults to 'http'.

static_configs:

- targets: ['localhost:8000']

3.没有设置其他监控，直接开启服务

./prometheus --config.file=prometheus.yml

4. 下载slurm

5. slurm的作用

现在slurm要做的就是跑一个job，然后我们通过slurm命令，拿到这个job所用的资源，先举个小栗子！

vim job.sh # 创建一个脚本任务，随便一个延时就可以了

sbatch job.sh # 运行这个任务，此时返回jobID。

cat slurm-1.out # -1就是jobID

scontrol show nodes # 查看所有状态信息

oc: NodeName=localhost Arch=x86_64 CoresPerSocket=1

CPUAlloc=1 CPUErr=0 CPUTot=1 CPULoad=0.14 Features=(null)

Gres=(null)

NodeAddr=localhost NodeHostName=localhost Version=15.08

OS=Linux RealMemory=7965 AllocMem=1024 FreeMem=4005 Sockets=1 Boards=1

State=ALLOCATED ThreadsPerCore=1 TmpDisk=41197 Weight=1 Owner=N/A

BootTime=2018-12-03T14:23:17 SlurmdStartTime=2018-12-03T14:24:58

CapWatts=n/a

CurrentWatts=0 LowestJoules=0 ConsumedJoules=0

ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

这里显示了一大堆东西，我们只需要将需要的属性名(如：CPUTot，这是这个节点所分配的cpu数)通过正则选出就可以了，量。

此时job正在跑，通过squeue可以查看job所用的资源，更多详情可查看slurm-squeue命令

squeue --format "%A:%c:%t" # %A是jobID，%c是所用cpu数，%t是job状态

JOBID:MIN_CPUS:ST

35:1:R

6. 好了，一切准备就绪，该写一个collect了。

在Prometheus中有四种Metrics(数据类型):Counter, Gauge, Summary和Histogram。

Counter：是可以增长的，初始值为0，只增不减。

Gauge：与counter类似，可增可减。

其他两个很少用，Histogram，Summary

下面用到的都是Gauge类型。

栗子中用到的subprocess可以查看我之前的文章python使用subprocess

注意：GaugeMetricFamily中的value，值必须是float类型。

import re

import subprocess

from prometheus_client.core import GaugeMetricFamily, REGISTRY

from prometheus_client import make_wsgi_app

from wsgiref.simple_server import make_server

class CustomCollector(object):

def add(self, params):

sum = 0

for i in params:

sum += int(i)

return sum

def collect(self):

output = subprocess.Popen("scontrol show nodes",

stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=True)

out_put = output.communicate()[0]

if out_put:

count = re.findall(r'CPUTot=(\d+)', out_put)

total_c = self.add(count)

yield GaugeMetricFamily('slurm_cpu_total', 'total_count', value=total_c)

used = re.findall(r'CPUAlloc=(\d+)', out_put)

used_cpu = self.add(used)

yield GaugeMetricFamily('slurm_cpu_used', 'used_count', value=used_cpu)

real_memory = re.findall(r'RealMemory=(\d+)', out_put)

total_memory = self.add(real_memory)

yield GaugeMetricFamily('slurm_memory_total', 'total_memory', value=total_memory)

alloc_memory = re.findall(r'AllocMem=(\d+)', out_put)

used_memory = self.add(alloc_memory)

yield GaugeMetricFamily('slurm_memory_used', 'used_memory', value=used_memory)

REGISTRY.register(CustomCollector())

if __name__ == '__main__':

coll = CustomCollector()

for i in coll.collect():

print i

app = make_wsgi_app()

httpd = make_server('', 8000, app)

httpd.serve_forever()

7. 在当前文件夹下创建job.sh文件并运行，到prometheus的目录下，运行服务(第三步没做的现在开启)，打开浏览器，登陆localhost:9090

这里搜索的关键词就是我们定义类型时，给的名称。

以上！

weixin_39809793

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
python export_python自定义监控slurm的Prometheus的export

首先：这篇文章做的是写一个监控slurm的Prometheus的export，安装环境是ubuntu16.04。1. 下载Prometheus官网链接下载,然后解压tar -zxvf prometheus-2.4.3.linux-amd64.tar.gzcd prometheus-2.4.3.linux-amd642. 配置文件prometheus.yml开头的都是默认配置，需要配置的是最低下的...
复制链接

扫一扫