hadoop学习笔记

启动后可以用
* NameNode - http://localhost:50070/
* JobTracker - http://localhost:50030/
-------- 具体流程-------------

$./bin/hadoop namenode -format
$./bin/start-all.sh
$jps -l
$./bin/hadoop dfsadmin -report
$echo "hello hadoopworld." > /tmp/test_file1.txt
$echo "hello world hadoop,I'm test." > /tmp/test_file2.txt
$./bin/hadoop dfs -mkdir test-in
$./bin/hadoop dfs -copyFromLocal /tmp/test*.txt test-in
$./bin/hadoop dfs -ls test-in
$./bin/hadoop jar hadoop-0.20.2-examples.jar wordcount test-in test-out
$./bin/hadoop dfs -ls test-out
$./bin/hadoop dfs -cat test-out/part-r-00000

--------- 具体流程--------------


$ ./bin/hadoop namenode -format
10/12/29 14:25:57 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = haoning/10.4.125.111
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 0.20.2
STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r 911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010
************************************************************/
10/12/29 14:25:57 INFO namenode.FSNamesystem: fsOwner=Administrator,None,root,Administrators,Users,Debugger,Users,ora_dba
10/12/29 14:25:57 INFO namenode.FSNamesystem: supergroup=supergroup
10/12/29 14:25:57 INFO namenode.FSNamesystem: isPermissionEnabled=true
10/12/29 14:25:57 INFO common.Storage: Image file of size 103 saved in 0 seconds.
10/12/29 14:25:57 INFO common.Storage: Storage directory \home\Administrator\tmp\dfs\name has been successfully formatted.
10/12/29 14:25:57 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at haoning/10.4.125.111
************************************************************/


$ ./bin/start-all.sh
starting namenode, logging to /usr/local/hadoop/bin/../logs/hadoop-Administrator-namenode-haoning.out
localhost: datanode running as process 352. Stop it first.
localhost: starting secondarynamenode, logging to /usr/local/hadoop/bin/../logs/hadoop-Administrator-secondarynamenode-haoning.out
starting jobtracker, logging to /usr/local/hadoop/bin/../logs/hadoop-Administrator-jobtracker-haoning.out
localhost: starting tasktracker, logging to /usr/local/hadoop/bin/../logs/hadoop-Administrator-tasktracker-haoning.out


网上 下的,http://blogimg.chinaunix.net/blog/upfile2/100317223114.pdf写得简单易懂。
似乎hadoop对openjdk不感冒
结合http://hadoop.apache.org/common/docs/r0.18.2/cn/quickstart.html看吧
[b]JDK[/b]
用hadoop-0.20.2
sudo apt-get install sun-java6-jdk
ubuntu上装在了/usr/lib/jvm里面
如果在红帽5上安装
需要下载jdk-6u23-linux-x64-rpm.bin
装完后在/usr/java/jdk1.6.0_23里面
如果ubuntu的ssh比较慢
辑/etc/ssh/ssh_config这个文件,将
#GSSAPIAuthentication no
#GSSAPIDelegateCredentials no


[b]用户[/b]
redhat5:
groupadd hadoop
useradd hadoop -g hadoop
vim /etc/sudoers
修改
root ALL=(ALL) ALL
hadoop ALL=(ALL) ALL

需要!强制执行
[hadoop@122226 .ssh]$ ssh-keygen -t rsa -P ""
cat id_rsa.pub >authorized_keys
ubuntu好使

[b]配置文件[/b]
0.18的版本好像是1个hadoop-site.xml
0.20分成了多个*-site.xml
hadoop@ubuntu:/usr/local/hadoop/hadoop-0.20.2$ vim conf/hadoop-env.sh
设置export JAVA_HOME=/usr/lib/jvm/java-6-sun
hadoop@ubuntu:/usr/local/hadoop/hadoop-0.20.2$ vim conf/core-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/hadoop/tmp</value>
</property>
</configuration>

hadoop@ubuntu:/usr/local/hadoop/hadoop-0.20.2$ vim conf/mapred-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
</configuration>



可以把test-out考出来
./bin/hadoop dfs -copyToLocal /user/hadoop/test-out test-out
看一下,-copyToLocal和-get似乎是一个意思


执行./bin/hadoop namenode -format


hadoop@ubuntu:~/tmp$ tree
.
└── dfs
└── name
├── current
│   ├── edits
│   ├── fsimage
│   ├── fstime
│   └── VERSION
└── image
└── fsimage

4 directories, 5 files


./bin/start-all.sh之后
执行./bin/hadoop dfs -mkdir test-in
.
|-- dfs
| |-- data
| | |-- current
| | | |-- blk_-1605603437240955017
| | | |-- blk_-1605603437240955017_1019.meta
| | | |-- dncp_block_verification.log.curr
| | | `-- VERSION
| | |-- detach
| | |-- in_use.lock
| | |-- storage
| | `-- tmp
| |-- name
| | |-- current
| | | |-- edits
| | | |-- fsimage
| | | |-- fstime
| | | `-- VERSION
| | |-- image
| | | `-- fsimage
| | `-- in_use.lock
| `-- namesecondary
| |-- current
| | |-- edits
| | |-- fsimage
| | |-- fstime
| | `-- VERSION
| |-- image
| | `-- fsimage
| `-- in_use.lock
`-- mapred
`-- local
13 directories, 18 files


启动后

hadoop@ubuntu:/usr/local/hadoop/hadoop-0.20.2$ ./bin/hadoop dfsadmin -report
Configured Capacity: 25538187264 (23.78 GB)
Present Capacity: 8219365391 (7.65 GB)
DFS Remaining: 8219340800 (7.65 GB)
DFS Used: 24591 (24.01 KB)
DFS Used%: 0%
Under replicated blocks: 1
Blocks with corrupt replicas: 0
Missing blocks: 0

-------------------------------------------------
Datanodes available: 1 (1 total, 0 dead)

Name: 127.0.0.1:50010
Decommission Status : Normal
Configured Capacity: 25538187264 (23.78 GB)
DFS Used: 24591 (24.01 KB)
Non DFS Used: 17318821873 (16.13 GB)
DFS Remaining: 8219340800(7.65 GB)
DFS Used%: 0%
DFS Remaining%: 32.18%
Last contact: Tue Dec 21 11:18:48 CST 2010


或者执行jps。
执行
hadoop@zhengxq-desktop:/usr/local/hadoop/hadoop-0.20.1$ echo "hello hadoop
world." > /tmp/test_file1.txt
hadoop@zhengxq-desktop:/usr/local/hadoop/hadoop-0.20.1$ echo "hello world
hadoop,I'm haha." > /tmp/test_file2.txt
hadoop@zhengxq-desktop:/usr/local/hadoop/hadoop-0.20.1$ bin/hadoop dfs -
copyFromLocal /tmp/test*.txt test-in

hadoop@test-linux:~/tmp$ tree
.
|-- dfs
| |-- data
| | |-- current
| | | |-- blk_-1605603437240955017
| | | |-- blk_-1605603437240955017_1019.meta
| | | |-- blk_-2047199693110071270
| | | |-- blk_-2047199693110071270_1020.meta
| | | |-- blk_-7264292243816045059
| | | |-- blk_-7264292243816045059_1021.meta
| | | |-- dncp_block_verification.log.curr
| | | `-- VERSION
| | |-- detach
| | |-- in_use.lock
| | |-- storage
| | `-- tmp
| |-- name
| | |-- current
| | | |-- edits
| | | |-- fsimage
| | | |-- fstime
| | | `-- VERSION
| | |-- image
| | | `-- fsimage
| | `-- in_use.lock
| `-- namesecondary
| |-- current
| | |-- edits
| | |-- fsimage
| | |-- fstime
| | `-- VERSION
| |-- image
| | `-- fsimage
| `-- in_use.lock
`-- mapred
`-- local

13 directories, 22 files


hadoop@test-linux:/usr/local/hadoop/hadoop-0.20.2$ ./bin/hadoop dfs -ls test-in
Found 2 items
-rw-r--r-- 3 hadoop supergroup 21 2010-12-21 23:28 /user/hadoop/test-in/test_file1.txt
-rw-r--r-- 3 hadoop supergroup 22 2010-12-21 23:28 /user/hadoop/test-in/test_file2.txt
hadoop@test-linux:/usr/local/hadoop/hadoop-0.20.2$ ./bin/hadoop jar hadoop-0.20.2-examples.jar wordcount test-in test-out
10/12/21 23:36:12 INFO input.FileInputFormat: Total input paths to process : 2
10/12/21 23:36:13 INFO mapred.JobClient: Running job: job_201012212251_0001
10/12/21 23:36:14 INFO mapred.JobClient: map 0% reduce 0%
10/12/21 23:36:55 INFO mapred.JobClient: map 100% reduce 0%
10/12/21 23:37:14 INFO mapred.JobClient: map 100% reduce 100%
10/12/21 23:37:16 INFO mapred.JobClient: Job complete: job_201012212251_0001
10/12/21 23:37:16 INFO mapred.JobClient: Counters: 17
10/12/21 23:37:16 INFO mapred.JobClient: Job Counters
10/12/21 23:37:16 INFO mapred.JobClient: Launched reduce tasks=1
10/12/21 23:37:16 INFO mapred.JobClient: Launched map tasks=2
10/12/21 23:37:16 INFO mapred.JobClient: Data-local map tasks=2
10/12/21 23:37:16 INFO mapred.JobClient: FileSystemCounters
10/12/21 23:37:16 INFO mapred.JobClient: FILE_BYTES_READ=85
10/12/21 23:37:16 INFO mapred.JobClient: HDFS_BYTES_READ=43
10/12/21 23:37:16 INFO mapred.JobClient: FILE_BYTES_WRITTEN=240
10/12/21 23:37:16 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=38
10/12/21 23:37:16 INFO mapred.JobClient: Map-Reduce Framework
10/12/21 23:37:16 INFO mapred.JobClient: Reduce input groups=4
10/12/21 23:37:16 INFO mapred.JobClient: Combine output records=6
10/12/21 23:37:16 INFO mapred.JobClient: Map input records=2
10/12/21 23:37:16 INFO mapred.JobClient: Reduce shuffle bytes=91
10/12/21 23:37:16 INFO mapred.JobClient: Reduce output records=4
10/12/21 23:37:16 INFO mapred.JobClient: Spilled Records=12
10/12/21 23:37:16 INFO mapred.JobClient: Map output bytes=67
10/12/21 23:37:16 INFO mapred.JobClient: Combine input records=6
10/12/21 23:37:16 INFO mapred.JobClient: Map output records=6
10/12/21 23:37:16 INFO mapred.JobClient: Reduce input records=6


hadoop@test-linux:~/tmp$ tree
.
|-- dfs
| |-- data
| | |-- current
| | | |-- blk_-1605603437240955017
| | | |-- blk_-1605603437240955017_1019.meta
| | | |-- blk_-1792462247745372986
| | | |-- blk_-1792462247745372986_1027.meta
| | | |-- blk_-2047199693110071270
| | | |-- blk_-2047199693110071270_1020.meta
| | | |-- blk_-27635221429411767
| | | |-- blk_-27635221429411767_1027.meta
| | | |-- blk_-7264292243816045059
| | | |-- blk_-7264292243816045059_1021.meta
| | | |-- blk_-8634524858846751168
| | | |-- blk_-8634524858846751168_1026.meta
| | | |-- dncp_block_verification.log.curr
| | | `-- VERSION
| | |-- detach
| | |-- in_use.lock
| | |-- storage
| | `-- tmp
| |-- name
| | |-- current
| | | |-- edits
| | | |-- fsimage
| | | |-- fstime
| | | `-- VERSION
| | |-- image
| | | `-- fsimage
| | `-- in_use.lock
| `-- namesecondary
| |-- current
| | |-- edits
| | |-- fsimage
| | |-- fstime
| | `-- VERSION
| |-- image
| | `-- fsimage
| `-- in_use.lock
`-- mapred
`-- local
|-- jobTracker
`-- taskTracker
`-- jobcache

16 directories, 28 files

hadoop@test-linux:/usr/local/hadoop/hadoop-0.20.2$ ./bin/hadoop dfs -lsr
drwxr-xr-x - hadoop supergroup 0 2010-12-21 23:28 /user/hadoop/test-in
-rw-r--r-- 3 hadoop supergroup 21 2010-12-21 23:28 /user/hadoop/test-in/haoning1.txt
-rw-r--r-- 3 hadoop supergroup 22 2010-12-21 23:28 /user/hadoop/test-in/haoning2.txt
drwxr-xr-x - hadoop supergroup 0 2010-12-21 23:37 /user/hadoop/test-out
drwxr-xr-x - hadoop supergroup 0 2010-12-21 23:36 /user/hadoop/test-out/_logs
drwxr-xr-x - hadoop supergroup 0 2010-12-21 23:36 /user/hadoop/test-out/_logs/history
-rw-r--r-- 3 hadoop supergroup 16751 2010-12-21 23:36 /user/hadoop/test-out/_logs/history/localhost_1292943083664_job_201012212251_0001_conf.xml
-rw-r--r-- 3 hadoop supergroup 8774 2010-12-21 23:36 /user/hadoop/test-out/_logs/history/localhost_1292943083664_job_201012212251_0001_hadoop_word+count
-rw-r--r-- 3 hadoop supergroup 38 2010-12-21 23:37 /user/hadoop/test-out/part-r-00000

hadoop@test-linux:/usr/local/hadoop/hadoop-0.20.2$ ./bin/hadoop dfs -cat test-out/part-r-00000
统计出字数结果

可见
一个在dfs看到的文件对应两个文件
blk_-*_*.meta
blk_-*
但log不算


在windows上安装,先安装cygwin,全装吧,得有ssh,照百度文库“在Windows上安装Hadoop教程”,设置个链接,解决空格问题
ln -s /cygdrive/c/Program\ Files/Java/jdk1.6.0_17 \
/usr/local/jdk1.6.0_17


在windows下路径会乱一点,cygwin如果装了ssh,ssh-keygen互信过,文件夹权限一致就应该没问题了,2010年12月29日,成功在ubuntu10.10,redhat5,windowxp+cygwin上跑起单机的,准备跑个集群的试试,windows用tree /F查看tmp中结果
很奇怪
D:/cygwin/usr/local/hadoop
在这下面执行
echo "aa" >/tmp/test_file1.txt
$ ./bin/hadoop fs -copyFromLocal /tmp/test_file1.txt /user/Administrator/test-in
copyFromLocal: File /tmp/test_file1.txt does not exist.
把test_file1.txt复制到D:/tmp就好了


集群
usermod -G group user
vi /etc/group
/etc/passwd

看了http://malixxx.iteye.com/blog/459277,
相对hadoop-0.20.2
core-site.xml为

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://192.168.200.12:8888</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/Administrator/tmp</value>
</property>
</configuration>


masters本机ip 192.168.200.12
slaves为节点ip 192.168.200.16
mapred-site.xml为

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>192.168.200.12:9999</value>
</property>
</configuration>

如果改了mapred-site.xml不行,报错“FATAL org.apache.hadoop.mapred.JobTracker: java.net.BindException: Problem binding to
/192.168.200.16:9999 : Cannot assign requested address”
因为jobtracker没起来,应该是配置错误,暂时没解决
其实方法就是把单机能运行的例子,改一个slave文件,(似乎slave就是datanode)
注意hadoop-env.sh设置了JAVA_HOME,如果是把namenode的hadoop拷贝到了datanode上,注意JAVA_HOME路径是否正确
---------★-----------
防火墙陶腾了我一天,考
上网找了个脚本刷上之后不报错了
accept-all.sh

#!/bin/sh

IPT='/sbin/iptables'
$IPT -t nat -F
$IPT -t nat -X
$IPT -t nat -P PREROUTING ACCEPT
$IPT -t nat -P POSTROUTING ACCEPT
$IPT -t nat -P OUTPUT ACCEPT
$IPT -t mangle -F
$IPT -t mangle -X
$IPT -t mangle -P PREROUTING ACCEPT
$IPT -t mangle -P INPUT ACCEPT
$IPT -t mangle -P FORWARD ACCEPT
$IPT -t mangle -P OUTPUT ACCEPT
$IPT -t mangle -P POSTROUTING ACCEPT
$IPT -F
$IPT -X
$IPT -P FORWARD ACCEPT
$IPT -P INPUT ACCEPT
$IPT -P OUTPUT ACCEPT
$IPT -t raw -F
$IPT -t raw -X
$IPT -t raw -P PREROUTING ACCEPT
$IPT -t raw -P OUTPUT ACCEPT

各种网络连接错误nio错就罪魁祸首就是iptables防火墙
运行成功后,
12的mastar上,启动NameNode,SecondaryNameNode,JobTracker,为

$ jps -l
32416 org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode
32483 org.apache.hadoop.mapred.JobTracker
1398 sun.tools.jps.Jps
32252 org.apache.hadoop.hdfs.server.namenode.NameNode

从系统中考个文件到hadoop的新建的dir中后

.
`-- tmp
`-- dfs
|-- name
| |-- current
| | |-- VERSION
| | |-- edits
| | |-- edits.new
| | |-- fsimage
| | `-- fstime
| |-- image
| | `-- fsimage
| `-- in_use.lock
`-- namesecondary
|-- current
|-- in_use.lock
`-- lastcheckpoint.tmp

16这个slave上启动DataNode,TaskTracker 为

# jps -l
32316 sun.tools.jps.Jps
31068 org.apache.hadoop.mapred.TaskTracker
30949 org.apache.hadoop.hdfs.server.datanode.DataNode
#

.
`-- tmp
|-- dfs
| `-- data
| |-- current
| | |-- VERSION
| | |-- blk_-4054376904853997355
| | |-- blk_-4054376904853997355_1002.meta
| | |-- blk_-8185269915321998969
| | |-- blk_-8185269915321998969_1001.meta
| | `-- dncp_block_verification.log.curr
| |-- detach
| |-- in_use.lock
| |-- storage
| `-- tmp
`-- mapred
`-- local

一共5个
遇到了“FileSystem is not ready yet”“retry”“could only be replicated to 0 nodes”之类的错误全没了
看了一下ipc,跟ibm那个入门级的nio的MultiPortEcho.java很像
http://bbs.hadoopor.com/thread-329-1-2.html
遇到连接不上之类的错误就找网络原因吧
在家的ubuntu单机运行,写localhost好使,换成ip就不好使了
后来卸载了virtbr0,bridge-utils也不好使
后来配置了/etc/network/interfaces

auto lo
iface lo inet loopback
auto eth0
iface eth0 inet static
address 192.168.1.118
netmask 255.255.255.0
network 192.168.0.0
broadcast 192.168.1.255
gateway 192.168.1.1

$ /etc/init.d/networking restart
清了/tmp目录 /home/hadoop/temp目录,如果是多台机器,多台机器的tmp都删,注意查看logs里面的日志,有的时候stop-all.sh后,jps -l看不到的线程但是java还在跑,用ps -ef|grep java看一下是否有java的进程,netstat -nltp|grep 9999看jobTracker是否还在跑,如果有,kill掉
重启机器,把core-site.xml,mapred-site.xml,slaves,masters都设置192.168.1.118就好使了,dhcp好像有问题,一晚上一直报错,要不就是dfsadmin -report全0,可能是ip映射有问题,还出现什么127.0.1.1的问题,反正设置固定ip就好了


一直不知道hdfs-site.xml是干什么的,上网找了个http://www.javawhat.com/showWebsiteContent.do?id=527440
<configuration>
<property>
<name>dfs.hosts.exclude</name>
<value>conf/excludes</value>
</property>
<property>
<name>dfs.http.address</name>
<value>namenodeip:50070</value>
</property>
<property>
<name>dfs.balance.bandwidthPerSec</name>
<value>12582912</value>
</property>
<property>
<name>dfs.block.size</name>
<value>134217728</value>
<final>true</final>
</property>
<property>
<name>dfs.data.dir</name>
<value>/hadoop1/data/,/hadoop2/data/</value>
<final>true</final>
</property>
<property>
<name>dfs.datanode.du.reserved</name>
<value>1073741824</value>
<final>true</final>
</property>
<property>
<name>dfs.datanode.handler.count</name>
<value>10</value>
<final>true</final>
</property>
<property>
<name>dfs.name.dir</name>
<value>/hadoop/name/</value>
<final>true</final>
</property>
<property>
<name>dfs.namenode.handler.count</name>
<value>64</value>
<final>true</final>
</property>
<property>
<name>dfs.permissions</name>
<value>True</value>
<final>true</final>
</property>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
</configuration>
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值