How to know that a new data is been added to HDFS?

最新推荐文章于 2023-08-05 01:15:07 发布

江南老画船

最新推荐文章于 2023-08-05 01:15:07 发布

阅读量154

点赞数

分类专栏：大数据

原文链接：https://stackoverflow.com/questions/14934079/how-to-know-that-a-new-data-is-been-added-to-hdfs

版权

大数据专栏收录该内容

2 篇文章 0 订阅

订阅专栏

I am implementing a Notification system based on publish subscribe model to notify about the availability of data as it arrives/loaded to HDFS. I did n’t find a ways where to look for this. Is there any HDFS API which can be used to do this or what method should I use to get information of new data written to HDFS? I am using Hadoop v2.0.2 and I don’t want to use HCatalog, I want to implement my own tool to do this.?

What you are looking for is Oozie Coordinator.

HDFS is a file system, so something must be built on top of HDFS to check for file availability. HBase has coprocessor which are triggered procedures . But it is only available for HBase tables. So it cannot be used for detecting data availabilty in HDFS.

Oozie is a workflow scheduler system to manage Hadoop jobs. Oozie Coordinator jobs are recurrent Oozie Workflow jobs triggered by time (frequency) and data availabilty. Also you can execute other programs from it :