1. Introduction
In recent years, Hadoop has been under the spotlight for its flexible and scalable architecture to store and process big data on commodity machines. One of its common use cases is to analyze application log files, as the size of log files generated by applications keeps increasing (volume) and log files are often unstructured (variety).
In this project, we have built a data pipeline to analyze application performance based on application performance data (appperfdata) extracted from log files and database performance data (db2perfdata) extracted from DBAU database. XXX is used as a sample application to analyze, but it can be tailored to analyze other applications as well.
In this sample use case, appperdata is the duration of restful APIs. For example, from below appperfdata, we can know how many milliseconds the restful API took to execute. In this example, it costs 283 miliseconds to complete API “/msqe/coreserverDEV2/webapp/maskingservice/needMasking/4964”. Therefore, by ordering the information based on API duration, we can see which restful APIs are poor performed and then do optimization accordingly.
2012-12-14-06-01 06:01:24.743 283 /XXX/webapp/maskingservice/needMasking/4964
On the other hand, from below db2perfdata, we can know how many select, read, update, insert, delete operations are performed at a certain time (currently it is collected at minute-level).
2012-12-14-06-01,3038,281910,383,365,0
By correlating appperfdata and db2perfdata based on timestamp, ‘2012-12-14-06-01’ in above examp