A Hadoop data pipeline to analyze applicaction performance

最新推荐文章于 2023-08-03 02:46:35 发布

天外有菌

最新推荐文章于 2023-08-03 02:46:35 发布

阅读量3k

点赞数

分类专栏： Hadoop 文章标签： Hadoop Flume Oozie Pig Hive

本文链接：https://blog.csdn.net/zhangjun2915/article/details/9005329

版权

本文介绍了一个基于Hadoop的数据管道，用于分析应用程序性能。通过Flume收集日志和数据库性能数据，存储在HDFS和HBase中，利用Pig进行ETL转换，Oozie触发作业，最后用Hive进行分析和报告。该系统设计包括数据收集、存储、处理和分析四个阶段，旨在识别和优化性能低效的RESTful API。

摘要由CSDN通过智能技术生成

1. Introduction

In recent years, Hadoop has been under the spotlight for its flexible and scalable architecture to store and process big data on commodity machines. One of its common use cases is to analyze application log files, as the size of log files generated by applications keeps increasing (volume) and log files are often unstructured (variety).

In this project, we have built a data pipeline to analyze application performance based on application performance data (appperfdata) extracted from log files and database performance data (db2perfdata) extracted from DBAU database. XXX is used as a sample application to analyze, but it can be tailored to analyze other applications as well.

In this sample use case, appperdata is the duration of restful APIs. For example, from below appperfdata, we can know how many milliseconds the restful API took to execute. In this example, it costs 283 miliseconds to complete API “/msqe/coreserverDEV2/webapp/maskingservice/needMasking/4964”. Therefore, by ordering the information based on API duration, we can see which restful APIs are poor performed and then do optimization accordingly.

2012-12-14-06-01 06:01:24.743 283 /XXX/webapp/maskingservice/needMasking/4964

On the other hand, from below db2perfdata, we can know how many select, read, update, insert, delete operations are performed at a certain time (currently it is collected at minute-level).
2012-12-14-06-01,3038,281910,383,365,0
By correlating appperfdata and db2perfdata based on timestamp, ‘2012-12-14-06-01’ in above examp