手把手带你开发一个服务监控组件
前言
随着业务迅速扩张,越来越多后端团队采用微服务设计方案。微服务设计在降低业务开发门槛同时,对 (包括监控在内的) 系统基础设施提出更高要求。微服务场景下,后台服务数量迅速膨胀,各个服务技术选型多样化,涉及相关人员众多,对系统监控带来比较大挑战。
本文会从设计到具体实现给大家介绍一个轻量级可插拔的微服务监控插件。
监控指标
- CPU使用率
- JVM内存使用率
- Dubbo线程池数量
- 数据库连接池使用情况
监控系统原理
监控系统的原理一般分为以下几个步骤:
- 数据采集
- 数据传输、存储
- 数据加工、分析、处理
- 数据展示
本插件通过定时任务,采集各个指标的数据,进行分析(跟阈值进行比较),如果超过阈值,系统会发送预警信息给维护人。
下面对各个模块的实现逐一展开介绍。
CPU监控
一开始,我对CPU监控的方式是使用Hyperic-Sigar。
它是一个收集系统各项底层信息的工具集,它的特点是收集信息全面,它可以收集CPU、MEM、NETWORK、PROCESS、IOSTAT 等资源的使用情况。同时它还支持跨平台。
重要的是它对各个资源信息封装的很全面,使用起来非常方便。
使用之前首先要进行配置。根据系统以及系统版本的不同可以选择不同的文件。
pom依赖:
<dependency>
<groupId>org.hyperic</groupId>
<artifactId>sigar</artifactId>
<version>1.6.5.132</version>
</dependency>
配置类:
@Slf4j
@Configuration
public class SigarConfig {
static {
initSigar();
}
/**
* 初始化sigar的配置文件
*/
public static void initSigar() {
log.info("==initSigar==");
SigarLoader loader = new SigarLoader(Sigar.class);
String lib = null;
try {
lib = loader.getLibraryName();
log.info("sigar lib:{}", lib);
} catch (ArchNotSupportedException e) {
log.error("error:", e);
}
ResourceLoader resourceLoader = new DefaultResourceLoader();
Resource resource = resourceLoader.getResource("classpath:/sigar.so/" + lib);
if (resource.exists()) {
log.info("==exists==");
InputStream is = null;
BufferedOutputStream os = null;
try {
is = resource.getInputStream();
File tempDir = new File("./log");
if (!tempDir.exists()) {
tempDir.mkdirs();
}
os = new BufferedOutputStream(new FileOutputStream(new File(tempDir, lib), false));
int lentgh = 0;
while ((lentgh = is.read()) != -1) {
os.write(lentgh);
}
System.setProperty("org.hyperic.sigar.path", tempDir.getCanonicalPath());
} catch (IOException e) {
log.error("init siagr fail:", e);
} finally {
try {
if (is != null) {
is.close();
}
if (os != null) {
os.close();
}
} catch (IOException e) {
log.error("关闭错误:", e.getMessage());
}
}
}
}
}
public static CpuPerc[] cpu() {
Sigar sigar = new Sigar();
CpuPerc[] cpuList = null;
try {
cpuList = sigar.getCpuPercList();
} catch (SigarException e) {
log.error("error:", e);
}
return cpuList;
}
后来监控的一段时间,发现这种方式作用不大。
因为我们项目都是部署在容器当中的,而sigar获取的是宿主机的cpu使用情况。这样的数据对我们的来说是没有实际意义的。
我们实际是想监控的是容器的cpu使用情况。
我们统计docker各个容器的资源使用情况时可以用 docker stats
命令:
那如何使用程序来实现呢?——docker-java。
docker-java
Docker的 Java 版本API。
想要用Java或者其他方式访问dockerAPI都需要设置一个端口。
运行以下命令,进入docker.service
:
vi /lib/systemd/system/docker.service
- 找到Execstart=/usr/bin/dockerd所在行;
- 修改为
tcp://0.0.0.0:2375 -H unix://var/run/docker.sock
; - 保存退出。
然后运行以下命令:
systemctl daemon-reload
service docker restart//重启启动docker
systemctl stats docker//可以查看相关内容,看看2375是否已经设置好
这时可以通过java来访问docker了,或者可以在浏览器的地址栏中访问docker了。
在浏览器地址栏中访问:http://ip:2375/info
,返回的数据是以json的格式展示。
使用java访问时需要在项目中引入依赖:
<dependency>
<groupId>com.github.docker-java</groupId>
<artifactId>docker-java</artifactId>
<version>3.1.3</version>
</dependency>
DockerClientUtil:
//连接docker
public DockerClient connectDocker() {
DockerClient dockerClient = DockerClientBuilder.getInstance("tcp://localhost:2375").build();
Info info = dockerClient.infoCmd().exec();
log.info("docker的环境信息如下:{}", JSONObject.toJSONString(info));
return dockerClient;
}
我们通过一个main方法来获取对应镜像占用资源的详细信息:
public static void main(String[] args) {
DockerClientUtil dockerClientService = new DockerClientUtil();
//连接docker服务器
DockerClient client = dockerClientService.connectDocker();
ContainerStatisticsInfo statistics = ContainerStatisticsInfo.builder().build();
ResultCallback<Statistics> resultCallback = new ResultCallback<Statistics>() {
@Override
public void close() throws IOException {
}
@Override
public void onStart(Closeable closeable) {
}
@Override
public void onNext(Statistics object) {
statistics.setStatistics(object);
}
@Override
public void onError(Throwable throwable) {
}
@Override
public void onComplete() {
}
};
client.statsCmd("CONTAINER_ID").exec(resultCallback);
try {
Thread.sleep(2000);
} catch (InterruptedException e) {
e.printStackTrace();
}
System.out.println(JSONObject.toJSONString(statistics));
}
响应信息:
{"ok":false,"statistics":{"blkioStats":{"ioMergedRecursive":[],"ioQueueRecursive":[],"ioServiceBytesRecursive":[{"major":8,"minor":0,"op":"Read","value":83476480},{"major":8,"minor":0,"op":"Write","value":13570048},{"major":8,"minor":0,"op":"Sync","value":13565952},{"major":8,"minor":0,"op":"Async","value":83480576},{"major":8,"minor":0,"op":"Total","value":97046528},{"major":253,"minor":0,"op":"Read","value":83476480},{"major":253,"minor":0,"op":"Write","value":13704704},{"major":253,"minor":0,"op":"Sync","value":13700608},{"major":253,"minor":0,"op":"Async","value":83480576},{"major":253,"minor":0,"op":"Total","value":97181184}],"ioServiceTimeRecursive":[],"ioServicedRecursive":[{"major":8,"minor":0,"op":"Read","value":988},{"major":8,"minor":0,"op":"Write","value":128},{"major":8,"minor":0,"op":"Sync","value":127},{"major":8,"minor":0,"op":"Async","value":989},{"major":8,"minor":0,"op":"Total","value":1116},{"major":253,"minor":0,"op":"Read","value":988},{"major":253,"minor":0,"op":"Write","value":128},{"major":253,"minor":0,"op":"Sync","value":127},{"major":253,"minor":0,"op":"Async","value":989},{"major":253,"minor":0,"op":"Total","value":1116}],"ioTimeRecursive":[],"ioWaitTimeRecursive":[],"sectorsRecursive":[]},"cpuStats":{"cpuUsage":{"percpuUsage":[1740906492,1576282396,2071067412,1788793435],"totalUsage":7177049735,"usageInKernelmode":3990000000,"usageInUsermode":1850000000},"onlineCpus":4,"systemCpuUsage":10680530000000,"throttlingData":{"periods":0,"throttledPeriods":0,"throttledTime":0}},"memoryStats":{"limit":3902095360,"maxUsage":474267648,"stats":{"activeAnon":376696832,"activeFile":19701760,"cache":95473664,"dirty":0,"hierarchicalMemoryLimit":9223372036854771712,"hierarchicalMemswLimit":9223372036854771712,"inactiveAnon":0,"inactiveFile":75771904,"mappedFile":21716992,"pgfault":93119,"pgmajfault":390,"pgpgin":78547,"pgpgout":51163,"rss":376696832,"rssHuge":358612992,"totalActiveAnon":376696832,"totalActiveFile":19701760,"totalCache":95473664,"totalDirty":0,"totalInactiveAnon":0,"totalInactiveFile":75771904,"totalMappedFile":21716992,"totalPgfault":0,"totalPgmajfault":0,"totalPgpgin":0,"totalPgpgout":0,"totalRss":376696832,"totalRssHuge":358612992,"totalUnevictable":0,"totalWriteback":0,"unevictable":0,"writeback":0},"usage":472170496},"networks":{"eth0":{"rxBytes":698,"rxDropped":0,"rxErrors":0,"rxPackets":9,"txBytes":0,"txDropped":0,"txErrors":0,"txPackets":0}},"pidsStats":{"current":37},"preCpuStats":{"cpuUsage":{"percpuUsage":[1738966229,1574918654,2069521792,1787287623],"totalUsage":7170694298,"usageInKernelmode":3990000000,"usageInUsermode":1850000000},"onlineCpus":4,"systemCpuUsage":10676500000000,"throttlingData":{"periods":0,"throttledPeriods":0,"throttledTime":0}},"read":"2021-08-10T14:39:39.345562286Z"}}
JVM内存监控
JDK提供的java.lang.Runtime
类里包含:freeMemory()
、totalMemory()
、maxMemory()
,通过这三个方法,我们可以计算出来JVM内存的使用占比。
public static JvmMemoryVO jvmMemory() {
JvmMemoryVO jvmMemoryVO = new JvmMemoryVO();
double totalMemory = Runtime.getRuntime().totalMemory() / 1024L / 1024L;
double freeMemory = Runtime.getRuntime().freeMemory() / 1024L / 1024L;
double maxMemory = Runtime.getRuntime().maxMemory() / 1024L / 1024L;
// Java 虚拟机中的总内存量 单位MB
jvmMemoryVO.setTotalMemory(totalMemory);
//Java 虚拟机中的可用内存量 单位MB
jvmMemoryVO.setFreeMemory(freeMemory);
// Java 虚拟机中的占用内存量 单位MB
jvmMemoryVO.setUsedMemory(totalMemory - freeMemory);
jvmMemoryVO.setMaxMemory(maxMemory);
//使用率
jvmMemoryVO.setMemoryUsageRate(jvmMemoryVO.getUsedMemory()/ jvmMemoryVO.getTotalMemory());
return jvmMemoryVO;
}
Dubbo线程池监控
Dubbo给我们提供了一个Checker类:ThreadPoolStatusChecker
,通过它可以获取Dubbo的线程池信息。
我通过反射的方式,调用check方法,然后解析返回数据。实现对Dubbo线程池的监控。
public static void checkDubbo() {
//dubbo线程池数量监控
Class<?> clazz = null;
try {
clazz = Class.forName("com.alibaba.dubbo.rpc.protocol.dubbo.status.ThreadPoolStatusChecker");
} catch (ClassNotFoundException e) {
log.error("类不存在:", e);
}
Method check = null;
try {
check = clazz.getMethod("check");
} catch (NoSuchMethodException e) {
log.error("方法不存在:", e);
}
Object result = null;
try {
result = check.invoke(clazz.newInstance());
} catch (IllegalAccessException e) {
log.error("error:", e);
} catch (InvocationTargetException e) {
log.error("error:", e);
} catch (InstantiationException e) {
log.error("error:", e);
}
}
数据库连接池监控
我们项目当中用到的连接池是阿里的DruidDataSource
。它底层封装的功能非常丰富,其中就包括查询数据库连接池的相关信息。
项目启动后,在浏览器输入:http://127.0.0.1:8080/druid/datasource.html
即可访问。
我在插件中就是通过Http调用http://127.0.0.1:8080/druid/datasource.json
来获取的连接池信息。
需要注意的是,如果项目当中对druid增加鉴权:
spring.datasource.druid.stat-view-servlet.login-username=admin
spring.datasource.druid.stat-view-servlet.login-password=admin
访问之前需要先请求登录接口,获取到cookie之后,带着这个cookie去请求。
好了,这款组件就介绍到这啦,需要源码的同学可以私信我。
目前市面上也有很多成熟的监控组件:spring-boot-actuator、elk等。感兴趣的同学可以自行了解一下。
总结
- 开发基础组件,首要要保证不要影响服务的运行。对于异常最好进行try/catch。
- 不要因为引入组件带来性能问题。
- 后续待完善:数据存储、kakfa 搜集日志、画像分析。