1)找到问题的原因
sequenceFile时连接不上错误,一直以为是集群没有搭建成功,运行其他程序却没有问题,确定是程序本身的问题。
2)hadoop streaming babel不能处理连续的两个由sequencefile生成的文件,一直不知道是什么原因,将sequence用cat导出时,又上网搜发现是babel不能处理在分子开始处多一个空行的文件。
3)hadoop的streaming可以自己生成,可见hadoop的可扩展性有多大,其他问题也可以自己定制hadoop
4)
单变量检查法(cat~~babel -isdf -ofpt -xh -xf FP4')
hadoop jar ../SdfInputFormat-old/jar-streaming/hadoop-streaming-1.0.0.jar -input "/Com*_1*" -mapper 'cat' -inputformat TextInputFormat -numReduceTasks 0 -output 'o19'
hadoop jar jar/hadoop-streaming-1.0.0.jar -input "/Com*_1*" -mapper 'cat' -inputformat TextInputFormat -numReduceTasks 0 -output 'o20'
hadoop jar jar/hadoop-streaming-1.0.0.jar -input "/Com*_1*" -mapper 'cat' -inputformat SdfTextInputFormat -numReduceTasks 0 -output 'o21'
hadoop jar jar/hadoop-streaming-1.0.0.jar -input "/Com*_1*" -mapper 'babel -isdf -ofpt -xh -xf FP4' -inputformat SdfTextInputFormat -numReduceTasks 0 -output 'o22'
mjiang@venus ~/java/eclipse/target-hadoop/Streaming-jar $ hadoop fs -cat o22/part-00000
>1 15 bits set
00000000 00000000 00000000 00000000 00000000 00000000
00002fc0 00000000 00000000 00000000 00000000 00000000
00000000 00980000 00000004 04400003
hadoop jar jar/hadoop-streaming-1.0.0.jar -input "/Compound_000000001_000025000.sdf" -mapper 'babel -isdf -ofpt -xh -xf FP4' -inputformat SdfTextInputFormat -numReduceTasks 0 -output 'o23'
5)
hadoop jar jar-streaming/hadoop-streaming-1.0.0.jar -D mapred.map.tasks=2 -input "/Com*_1*" -mapper 'cat' -inputformat SdfTextInputFormat -numReduceTasks 0 -output 'o13'
hadoop fs -cat /user/mjiang/o12/part-00001 没有,说明分片数不是mapred.map.tasks=2 控制的 6)
mapreduce 处理数据是 一行调用map程序一次,而 streaming是map循环调用数据 。。。 要不就是babel有保存的功能
测试:
#!/usr/bin/env python
import sys
i=0
for line in sys.stdin:
#if line.strip()=="> <PUBCHEM_OPENEYE_CAN_SMILES>":
if line.strip()=="> <PUBCHEM_OPENEYE_CAN_SMILES>":
#print line.strip()
i+=1;
print i
streaming运行后:
输出为:
1
2
3
...
而不是1,1,1
所以肯定是一个map一直运行,而不是一行调用一次
7)
每一步都有目的,别试来试去。可能有时候就是需要试来拭去。特别是自己不熟悉的知识
20120808
hadoop fs -ls 通配符的缺陷
mjiang@syvenus:~/program/eclipse/ccms_stat$ hadoop fs -ls /user/mjiang/
Found 4 items
drwxrwxrwx - mjiang supergroup 0 2012-08-06 15:00 /user/mjiang/.Trash
drwxrwxrwx - mjiang supergroup 0 2012-08-06 14:55 /user/mjiang/hive
drwxrwxrwx - mjiang supergroup 0 2012-08-03 17:40 /user/mjiang/mjiang
drwxrwxrwx - mjiang supergroup 0 2012-08-06 13:36 /user/mjiang/test
mjiang@syvenus:~/program/eclipse/ccms_stat$ hadoop fs -ls /user/mjiang/te*
mjiang@syvenus:~/program/eclipse/ccms_stat$
mjiang@syvenus:~/program/eclipse/ccms_stat$ hls /user/mjiang/mjiang
Found 1 items
drwxrwxrwx - mjiang supergroup 0 2012-08-03 17:40 /user/mjiang/mjiang/test4
mjiang@syvenus:~/program/eclipse/ccms_stat$ hls /user/mjiang/mjiang/te*
Found 8 items
-rw-rw-rw- 3 mjiang supergroup 0 2012-08-03 17:40 /user/mjiang/mjiang/test4/_SUCCESS
drwxrwxrwx - mjiang supergroup 0 2012-08-03 17:36 /user/mjiang/mjiang/test4/_logs
-rw-rw-rw- 3 mjiang supergroup 14 2012-08-03 17:36 /user/mjiang/mjiang/test4/part-m-00000.bz2
-rw-rw-rw- 3 mjiang supergroup 75007289 2012-08-03 17:36 /user/mjiang/mjiang/test4/part-m-00001.bz2
-rw-rw-rw- 3 mjiang supergroup 19696013 2012-08-03 17:36 /user/mjiang/mjiang/test4/part-m-00002.bz2
-rw-rw-rw- 3 mjiang supergroup 28549784 2012-08-03 17:36 /user/mjiang/mjiang/test4/part-m-00003.bz2
-rw-rw-rw- 3 mjiang supergroup 16578129 2012-08-03 17:36 /user/mjiang/mjiang/test4/part-m-00004.bz2
-rw-rw-rw- 3 mjiang supergroup 292 2012-08-03 17:36 /user/mjiang/mjiang/test4/part-m-00005.bz2
貌似只有最后一层才可用
8)
map只有一个时是不出具体的进展信息的,好几个后就出了具体的进展信息了
2/07/31 11:05:25 INFO mapred.JobClient: map 50% reduce 0%
12/07/31 11:05:37 INFO mapred.JobClient: map 66% reduce 0%12/07/31 11:06:10 INFO mapred.JobClient: map 83% reduce 0%