HADOOP-[1]-QUICK GUIDE-Architecture and installation

HADOOP - QUICK GUIDE-[1]-Architecture and installation

原文
https://www.tutorialspoint.com/hadoop/hadoop_quick_guide.htm

Hadoop Architecture

Hadoop has two major layers namely:

  • Processing/Computation layer
  • Storage layer HDFS.

这里写图片描述

MapReduce

并行计算model

Hadoop Distributed File System

provides a distributed file system that is designed to run on commodity hardware.

  • Highly fault-tolerant
  • designed to be deployed on low-cost hardware.
  • provides high throughput access to application data and is suitable for applications having large datasets.

Hadoop Common

These are Java libraries and utilities required by other Hadoop modules.

Hadoop YARN

This is a framework for job scheduling and cluster resource management.

How Does Hadoop Work?

Hadoop that it runs across clustered and low-cost machines,Hadoop does not rely on hardware to provide fault-tolerance and high availability , rather Hadoop library itself has been designed to detect and handle failures at the application layer.

This means:
you can tie together many commodity computers with single-CPU, as a single functional distributed system and practically, the clustered machines can read the dataset in parallel and provide a much higher throughput.

Work process:

  • Data is initially divided into directories and files. Files are divided into uniform sized blocks of 128M and 64M .
  • These files are then distributed across various cluster nodes for further processing. HDFS, being on top of the local file system, supervises the processing.
  • Blocks are replicated for handling hardware failure.
  • Checking that the code was executed successfully.
  • Performing the sort that takes place between the map and reduce stages. Sending the sorted data to a certain computer.
  • Writing the debugging logs for each job.

HADOOP - ENVIORNMENT SETUP

这里介绍Linux环境下的安装
建议建立一个单独的用户,将Hadoop file system和Unix file system隔离。

$su
   password:
# useradd hadoop
# passwd hadoop
   New passwd: Retype new passwd

SSH Setup and Key Generation

建立ssh,允许在集群中执行starting、stopping、分布式的后台操作。不同Hadoop用户的权限验证,使用下面的命令用SSH产生key value 对,
将id_rsa中的public keys,拷贝到 authorized_keys,同时给所有者赋予读写权限。

$ ssh-keygen -t rsa
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
$ chmod 0600 ~/.ssh/authorized_keys

Installing Java

验证java环境

 $ java -version
 java version "1.7.0_71"
 Java(TM) SE Runtime Environment (build 1.7.0_71-b13) Java HotSpot(TM) Client VM (build 25.0-b02, mixed mode)

如果没有安装java,可以按照下面的步骤进行安装:

Step 1

下载最新版java JDK{lastversion}-X64.tar.gz
http://www.oracle.com/technetwork/java/javase/downloads/jdk7-downloads1880260.html.

Step 2

解压

$ cd Downloads/
$ ls
jdk-7u71-linux-x64.gz
$ tar zxf jdk-7u71-linux-x64.gz $ ls
jdk1.7.0_71   jdk-7u71-linux-x64.gz

Step 3

为了让所有用户可以使用java,将它移到/user/local/
切换到root 执行命名

$ su
password:
# mv jdk1.7.0_71 /usr/local/ 
# exit

Step 4

设置环境变量,将下面的commands添加到~/.bashrc 文件中

 export JAVA_HOME=/usr/local/jdk1.7.0_71
 export PATH=PATH:$JAVA_HOME/bin

 然后执行
 source  **_~/.bashrc_**
 java -version
 重新验证

Downloading Hadoop

下载解压Hadoop

$ su
password:
# cd /usr/local
# wget http://apache.claz.org/hadoop/common/hadoop-2.4.1/ hadoop-2.4.1.tar.gz
# tar xzf hadoop-2.4.1.tar.gz
# mv hadoop-2.4.1/* to hadoop/
# exit

Hadoop Operation Modes

Hadoop支持3种操作模式:

  • Local/Standalone Mode : 下载之后默认配置即为standalone模式,可以以单个java 进程运行。
  • Pseudo Distributed Mode : 单机模拟分布式,Hadoop的每个后台如hdfs、yarn、Mapreduce等,将各自以一个java进程,该模式开发有用
  • Fully Distributed Mode : 至少2台机器作为一个集群

Installing Hadoop in Standalone Mode

下面介绍standalone mode的安装,将没有daemods运行,所有都在一个JVM中运行,standalone mode非常适合开发中运行Mapreduce,这在test和debug中很便利。

Setting Up Hadoop

配置Hadoop环境变量,在~/.bashrc_ 文件中添加:

export HADOOP_HOME=/usr/local/hadoop

命令行验证

 $ hadoop version

如果运行正常将有如下输出:

Hadoop 2.4.1
Subversion https://svn.apache.org/repos/asf/hadoop/common -r 1529768 Compiled by hortonmu on 2013-10-07T06:28Z
Compiled with protoc 2.5.0
From source with checksum 79e53ce7994d1628b240f09af91e1af4

Example

验证Hadoop中example简单测试例程,如Pi value,word counts等
>
$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar

下面我们需要一个input文件目录,将在里面存储一些files,用于单词统计。同时我们可以用其他相似的 .jar 文件测试其他例子。如只需要运行以下commands测试 hadoop-mapreduce- examples-2.2.0.jar 提供的Mapreduce功能函数。

 $ hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduceexamples-2.2.0.jar
Step 1

在任意位置建立input 目录

$ mkdir input
$ cp $HADOOP_HOME/*.txt input $ ls -l input
total 24
-rw-r--r-- 1 root root 15164 Feb 21 10:14 LICENSE.txt 
-rw-r--r-- 1 root root 101 Feb 21 10:14 NOTICE.txt 
-rw-r--r-- 1 root root 1366 Feb 21 10:14 README.txt
Step 2

启动Hadoop进程统计words

$ hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduceexamples-2.2.0.jar
wordcount input output
Step 3

step 2将在output目录下存储结果,output/part-r00000

 $cat output/*

Installing Hadoop in Pseudo Distributed Mode

Step 1: Setting Up Hadoop

配置Hadoop环境变量,在~/.bashrc_ 文件中添加:

export HADOOP_HOME=/usr/local/hadoop 
export HADOOP_MAPRED_HOME=$HADOOP_HOME 
export HADOOP_COMMON_HOME=$HADOOP_HOME 
export HADOOP_HDFS_HOME=$HADOOP_HOME 
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native 
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_INSTALL=$HADOOP_HOME

配置生效

 $ source ~/.bashr

Step 2: Hadoop Configuration

$HADOOP-HOME/etc/hadoop中有配置文件,根据我们的Hadoop infrastructure 修改这些配置文件:

cd HADOOP_HOME/etc/hadoop

hadoop­_env.sh

替换JAVA_HOME value

export JAVA_HOME=/usr/local/jdk1.7.0_71

core-­site.xml

core­-site.xml中包含Hadoop实例使用的port number,文件系统的内存分配,用于sorting的内存限制,Read/Write Buffers的大小,这里简单配置为:

<configuration>
  <property>
     <name>fs.default.name</name> 
     <value>hdfs://localhost:9000</value>
   </property>
</configuration>

【提示】
XML文件格式要注意,有时配置完后运行的错误是因为格式不规范

hdfs­-site.xml

hdfs­-site.xml中包含数据备份数,本地文件系统中namenode path,datanode path,简单的配置为:

dfs.replication (data replication value) = 1
(In the below given path /hadoop/ is the user name.
hadoopinfra/hdfs/namenode is the directory created by hdfs file system.)
namenode path = //home/hadoop/hadoopinfra/hdfs/namenode (hadoopinfra/hdfs/datanode is the directory created by hdfs file system.)
datanode path = //home/hadoop/hadoopinfra/hdfs/datanode

<configuration>
    <property> 
        <name>dfs.replication</name> 
        <value>1</value>
    </property>
    <property>
        <name>dfs.name.dir</name> 
        <value>file:///home/hadoop/hadoopinfra/hdfs/namenode </value>
    </property>
    <property>
        <name>dfs.data.dir</name> 
        <value>file:///home/hadoop/hadoopinfra/hdfs/datanode </value>
   </property>
</configuration>
mapred-­site.xml

这个文件标识使用哪个MapReduce framework,copy the file from mapred­site.xml.template to mapred­ site.xml

 $ cp mapred-site.xml.template mapred-site.xml
 vi  mapred-site.xml

添加

<configuration>
    <property>
         <name>mapreduce.framework.name</name> 
         <value>yarn</value>
   </property>
</configuration>

Verifying Hadoop Installation

Step 1: Name Node Setup
$ cd ~ 
$ hdfs namenode -format 

期望输出:

 INFO namenode.NameNode: STARTUP_MSG: 
/************************************************************ 
STARTUP_MSG: Starting NameNode 
STARTUP_MSG:   host = localhost/192.168.1.11 
STARTUP_MSG:   args = [-format] 
STARTUP_MSG:   version = 2.4.1 
...
...
10/24/14 21:30:56 INFO common.Storage: Storage directory 
/home/hadoop/hadoopinfra/hdfs/namenode has been successfully formatted. 
10/24/14 21:30:56 INFO namenode.NNStorageRetentionManager: Going to 
retain 1 images with txid >= 0 
10/24/14 21:30:56 INFO util.ExitUtil: Exiting with status 0 
10/24/14 21:30:56 INFO namenode.NameNode: SHUTDOWN_MSG: 
/************************************************************ 
SHUTDOWN_MSG: Shutting down NameNode at localhost/192.168.1.11 
************************************************************/
Step 2: Verifying Hadoop dfs

$ start-dfs.sh

期望输出:

Starting namenodes on [localhost]
localhost: starting namenode, logging to /home/hadoop/hadoop 2.4.1/logs/hadoop-hadoop-namenode-localhost.out.
localhost: starting datanode, logging to /home/hadoop/hadoop 2.4.1/logs/hadoop-hadoop-datanode-localhost.out.
Starting secondary namenodes [0.0.0.0]

查看进程

$ jps

namenodes、datanode和secondary namenodes会启动

Step 3: Verifying Yarn Script

$ start-yarn.sh

期望输出:

starting yarn daemons.
starting resourcemanager, logging to /home/hadoop/hadoop 2.4.1/logs/yarn-hadoop-resourcemanager-localhost.out.
localhost: starting nodemanager, logging to /home/hadoop/hadoop 2.4.1/logs/yarn-hadoop-nodemanager-localhost.out

查看进程

$ jps

resourcemanager和nodemanager也启动了

Step 4: Accessing Hadoop on Browser

Hadoop默认访问端口50070,在browser中键入

http://localhost:50070/

这里写图片描述

访问文件系统

Step 5: Verify All Applications for Cluster

访问cluster中所有application端口8088:

http://localhost:8088/

这里写图片描述

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
图像识别技术在病虫害检测中的应用是一个快速发展的领域,它结合了计算机视觉和机器学习算法来自动识别和分类植物上的病虫害。以下是这一技术的一些关键步骤和组成部分: 1. **数据收集**:首先需要收集大量的植物图像数据,这些数据包括健康植物的图像以及受不同病虫害影响的植物图像。 2. **图像预处理**:对收集到的图像进行处理,以提高后续分析的准确性。这可能包括调整亮度、对比度、去噪、裁剪、缩放等。 3. **特征提取**:从图像中提取有助于识别病虫害的特征。这些特征可能包括颜色、纹理、形状、边缘等。 4. **模型训练**:使用机器学习算法(如支持向量机、随机森林、卷积神经网络等)来训练模型。训练过程中,算法会学习如何根据提取的特征来识别不同的病虫害。 5. **模型验证和测试**:在独立的测试集上验证模型的性能,以确保其准确性和泛化能力。 6. **部署和应用**:将训练好的模型部署到实际的病虫害检测系统中,可以是移动应用、网页服务或集成到智能农业设备中。 7. **实时监测**:在实际应用中,系统可以实时接收植物图像,并快速给出病虫害的检测结果。 8. **持续学习**:随着时间的推移,系统可以不断学习新的病虫害样本,以提高其识别能力。 9. **用户界面**:为了方便用户使用,通常会有一个用户友好的界面,显示检测结果,并提供进一步的指导或建议。 这项技术的优势在于它可以快速、准确地识别出病虫害,甚至在早期阶段就能发现问题,从而及时采取措施。此外,它还可以减少对化学农药的依赖,支持可持续农业发展。随着技术的不断进步,图像识别在病虫害检测中的应用将越来越广泛。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值