前两篇回顾了hive, hadoop等知识
接下来,我们继续学习hadoop的其他服务吧!(抱歉,为了英语不退化,下面会有英文的出现)
In the next step, we are continuing to talking about the ecosystem of Hadoop.
Except for the major elements of Hadoop such as hdfs, Mapreduce and yarn, there are many tools and solutions that are used to supplement or support these major components.
- HDFS: Hadoop Distributed File System that can store different types of large data sets (i.e. structured, unstructured and semi-structured data)
- YARN: Yet Another Resource Negotiator that performs all processing activities by allocating resources and scheduling tasks.
- MapReduce: Programming based Data Processing
- PIG: it has two parts Pig Latin and Pig runtime as Java and JVM, which has SQL like command structure.
- Spark: In-Memory computations to increase the speed of data processing over MapReduce.
- HIVE: Query-based processing of data services
- HBase: NoSQL Database, written by Java API or shell command; reference
- Mahout: it is renowned for machine learning. it provides an environment for creating machine learning applications
- Solr&Lucene: Searching and Indexing in hadoop ecosystem
- Zookeeper: it is the coordinator of any hadoop job and coordinates with various services in a distributed environment.
- Oozie: works as a clock and alarm service inside hadoop ecosystem just like a scheduler.
- Sqoop: it is a tool designed to transfer data between hdfs and relational database servers, importing data from RD to HDFS, and vice versa. The language used is Java; reference
- Flume: ingest unstructured and semi-structured data into hdfs, such as datasets from a web server.
- Kafka: it is distributed streaming platform and a distributed commit log service; I recommend this article reference since it clearly explains how Kafka work!!