Hadoop单机版安装

最新推荐文章于 2024-05-06 10:58:25 发布

ciaos

最新推荐文章于 2024-05-06 10:58:25 发布

阅读量1.1k

点赞数

分类专栏：其它乱七八糟

本文链接：https://blog.csdn.net/ciaos/article/details/8395736

版权

其它乱七八糟专栏收录该内容

37 篇文章 0 订阅

订阅专栏

配置好java与ssh环境

/etc/init.d/sshd start

配置ssh免登录

cd
ssh-keygen -t rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

登录ssh localhost查看能否免密码登录（第一次登录可能需要输入密码）

然后下载hadoop(我选的是1.1.1版本)

编辑 conf/hadoop-env.sh文件，添加配置环境变量

export JAVA_HOME=/root/jdk1.6.0_38

编辑hadoop几个核心文件

conf/core-site.xml

<configuration>
<property>
    <name>fs.default.name</name>
    <value>hdfs://localhost:9000</value>
 </property>
</configuration>

conf/hdfs-site.xml

<configuration>
  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>
  <property>
    <name>dfs.name.dir</name>
    <value>/root/opt/hadoop/var/namedir</value>
  </property>
  <property>
    <name>dfs.data.dir</name>
    <value>/root/opt/hadoop/var/datadir</value>
  </property>
</configuration>

conf/mapred-site.xml

<configuration>
  <property>
    <name>mapred.job.tracker</name>
    <value>hdfs://localhost:9001</value>
  </property>
</configuration>

格式化namenode以及运行如下，然后通过运行jps查看是否运行正常

bin/hadoop namenode -format
bin/start-all.sh

[fedora@fedora hadoop]$ jps
16449 Jps
13388 NameNode
13492 DataNode
13599 SecondaryNameNode
13686 JobTracker
23248 ElasticSearch
13800 TaskTracker

如果Taskracker或者其它组件启动失败，可以查看vi hs*错误日志，一般是动态链接库文件找不到依赖，配置LD_LIBRARY_PATH即可

配置环境变量export CLASSPATH=$CLASSPATH:/home/fedora/hadoop/

如果datanode或者namenode启动失败，可以尝试删除rm -rf /root/opt/hadoop/var/*，然后格式化namenode再启动。

下面就可以写一个简单的统计任务来测试单机版hadoop了

创建test.log，内容如下

20121221  04567 user s00001
20121221  75531 user s00003
20121222  52369 user s00002
20121222  01297 user s00001
20121223  61223 user s00002
20121223  33121 user s00003

放入到hdfs中

bin/hadoop fs -mkdir inputTest
bin/hadoop fs -put test.log inputTest

编写统计脚本如下test/MaxDownBytes.java

package test;

import java.io.IOException;

import java.util.Iterator;
import org.apache.hadoop.fs.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;

public class MaxDownBytes {

	public static class MaxDownBytesMapper extends MapReduceBase
		implements Mapper<LongWritable, Text, Text, IntWritable> {
		@Override
		public void map(LongWritable key, Text value,
				OutputCollector<Text, IntWritable> output, Reporter reporter)
				throws IOException {
			// TODO Auto-generated method stub
			String line = value.toString();
			String[] ss = line.split(" ");
			String date = ss[0];
			int downBytes = Integer.parseInt(ss[1]);
			output.collect(new Text(date), new IntWritable(downBytes));
		}
	}

	public static class MaxDownBytesReducer extends MapReduceBase
		implements Reducer<Text, IntWritable, Text, IntWritable>{

		@Override
		public void reduce(Text key, Iterator<IntWritable> values,
				OutputCollector<Text, IntWritable> output, Reporter reporter)
				throws IOException {
			// TODO Auto-generated method stub
			
			int maxValue = Integer.MIN_VALUE;
			while(values.hasNext()){
				maxValue = Math.max(maxValue, values.next().get());
			}
			output.collect(key, new IntWritable(maxValue));
		}
	}
	
	/**
	 * @param args
	 * @throws IOException 
	 */
	public static void main(String[] args) throws IOException {
		// TODO Auto-generated method stub
		if(args.length != 2){
			System.err.println("Usage: MaxDownBytes <input path> <output path>");
			System.exit(-1);
		}
		
		JobConf conf = new JobConf(MaxDownBytes.class);
		conf.setJobName("Max download bytes");
		
		FileInputFormat.addInputPath(conf, new Path(args[0]));
		FileOutputFormat.setOutputPath(conf, new Path(args[1]));
		
		conf.setMapperClass(MaxDownBytesMapper.class);
		conf.setReducerClass(MaxDownBytesReducer.class);
		
		conf.setOutputKeyClass(Text.class);
		conf.setOutputValueClass(IntWritable.class);
		
		JobClient.runJob(conf);
	}
}

编译打包运行

javac -cp hadoop-core-1.1.1.jar test/MaxDownBytes.java
jar cvf MaxDownBytes.jar test
bin/hadoop jar MaxDownBytes.jar test.MaxDownBytes inputTest outputRes

查看结果如下：

[fedora@fedora hadoop]$ bin/hadoop fs -cat outputRes/part-00000
20121221	75531
20121222	52369
20121223	61223

如果要计算每天的下载bytes总和，只需要修改reducer函数如下即可

                @Override
                public void reduce(Text key, Iterator<IntWritable> values,
                                OutputCollector<Text, IntWritable> output, Reporter reporter)
                                throws IOException {
                        // TODO Auto-generated method stub

                        int sumValue = 0;
                        while(values.hasNext()){
                                sumValue += values.next().get();
                        }
                        output.collect(key, new IntWritable(sumValue));
                }

MapReduce 集群上运行的分布式数据处理模型

一个mapreduce任务由输入数据，mapreduce程序以及配置信息组成。hadoop将它
分配为map任务和reduce任务

有两类节点负责任务的执行，jobtracker和若干tasktracker，前者负责协调所有
tasktracker，重启失败的tasktracker等。tasktracker负责执行任务，将进度反
馈给jobtracker。

hadoop将输入数据拆分固定大小的多份，为每一份创建个map task运行，一个分片的
大小不能太大也不能太小，一般为一个hdfs块大小64M合适（适合本地化计算），map计
算的中间结果是存在本地磁盘而不是hdfs中。

为了能减少各个node之间数据交互量，还可以指定一个combiner function。

	conf.setMapperClass(MaxTempratureMapper.class);
	conf.setCombinerClass(MaxTempratureReducer.class);
	conf.setReducerClass(MaxTempratureReducer.class);

HDFS集群上运行的分布式文件系统
HDFS是一个设计来存储大文件的文件系统。一次写多次读。
namenode在内存中保持文件系统的meta信息，文件系统所能存放的文件数量受到namenode
的内存大小限制。每个文件目录块话费150字节内存，存放一百万个文件需要300MB

bin/hadoop fsck / -files -blocks

namenode维护整个文件系统的目录结构等信息，hadoop提供两种机制保证它的安全可靠
1，同步原子性将状态发送到别的文件系统保存
2，运行一个备用namenode

HDFS提供命令行操作 hadoop fs [command],同时提供java的api

Pig分布式数据分析平台语言

Hive分布式数据仓库工具，对存储在HDFS的数据提供sql查询，将sql转换成mapreduce任务运行

HBase分布式列级存储数据库，使用HDFS作为其底层存储，支持Mapreduce批处理计算和随机读点查询

Zookeeper分布式协调服务，提供分布式锁

Sqoop一个高效地将数据从关系型数据库转移到hdfs的工具