第十八讲
1、准备压缩数据
从dmoz下载url库
wget http://rdf.dmoz.org/rdf/content.rdf.u8.gz
gunzip content.rdf.u8.gz
准备nutch1.6
svn co https://svn.apache.org/repos/asf/nutch/tags/release-1.6/
cprelease-1.6/conf/nutch-site.xml.template release-1.6/conf/nutch-site.xml
vi release-1.6/conf/nutch-site.xml
增加:
<property>
<name>http.agent.name</name>
<value>nutch</value>
</property>
cdrelease-1.6
ant
cd ..
使用DmozParser把dmoz的URL库解析为文本
release-1.6/runtime/local/bin/nutchorg.apache.nutch.tools.DmozParser content.rdf.u8 > urls &
将url文本内容放到HDFS上面
hadoop fs -put urls urls
2、以不同压缩方法注入URL
进入nutch主目录
cd release-1.6
以未压缩的方式注入URL
runtime/deploy/bin/nutch inject data_no_compress/crawldb urls
以默认压缩的方式注入URL
viconf/nutch-site.xml
<property>
<name>mapred.output.compression.type</name>
<value>BLOCK</value>
</property>
<property>
<name>mapred.output.compress</name>
<value>true</value>
</property>
<property>
<name>mapred.compress.map.output</name>
<value>true</value>
</property>
<property>
<name>mapred.map.output.compression.codec</name>
<value>org.apache.hadoop.io.compress.DefaultCodec</value>
</property>
<property>
<name>mapred.output.compression.codec</name>
<value>org.apache.hadoop.io.compress.DefaultCodec</value>
</property>
ant
runtime/deploy/bin/nutch inject data_default_compress/crawldb urls
以Gzip压缩的方式注入URL
viconf/nutch-site.xml
<property>
<name>mapred.output.compression.type</name>
<value>BLOCK</value>
</property>
<property>
<name>mapred.output.compress</name>
<value>true</value>
</property>
<property>
<name>mapred.compress.map.output</name>
<value>true</value>
</property>
<property>
<name>mapred.map.output.compression.codec</name>
<value>org.apache.hadoop.io.compress.GzipCodec</value>
</property>
<property>
<name>mapred.output.compression.codec</name>
<value>org.apache.hadoop.io.compress.GzipCodec</value>
</property>
ant
runtime/deploy/bin/nutch inject data_gzip_compress/crawldb urls
以BZip2的压缩方式注入URL
viconf/nutch-site.xml
<property>
<name>mapred.output.compression.type</name>
<value>BLOCK</value>
</property>
<property>
<name>mapred.output.compress</name>
<value>true</value>
</property>
<property>
<name>mapred.compress.map.output</name>
<value>true</value>
</property>
<property>
<name>mapred.map.output.compression.codec</name>
<value>org.apache.hadoop.io.compress.BZip2Codec</value>
</property>
<property>
<name>mapred.output.compression.codec</name>
<value>org.apache.hadoop.io.compress.BZip2Codec</value>
</property>
ant
runtime/deploy/bin/nutch inject data_bzip2_compress/crawldb urls
以Snappy的方式注入URL
viconf/nutch-site.xml
<property>
<name>mapred.output.compression.type</name>
<value>BLOCK</value>
</property>
<property>
<name>mapred.output.compress</name>
<value>true</value>
</property>
<property>
<name>mapred.compress.map.output</name>
<value>true</value>
</property>
<property>
<name>mapred.map.output.compression.codec</name>
<value>org.apache.hadoop.io.compress.SnappyCodec</value>
</property>
<property>
<name>mapred.output.compression.codec</name>
<value>org.apache.hadoop.io.compress.SnappyCodec</value>
</property>
ant
runtime/deploy/bin/nutch inject data_snappy_compress/crawldb urls
压缩类型的影响
块大小的影响
3、Hadoop配置Snappy压缩
下载解压:
wget https://snappy.googlecode.com/files/snappy-1.1.0.tar.gz
tar -xzvf snappy-1.1.0.tar.gz
cdsnappy-1.0.5
编译:
./configure
make
make install
复制库文件:
scp /usr/local/lib/libsnappy* host2:/home/hadoop/hadoop-1.1.2/lib/native/Linux-amd64-64/
scp /usr/local/lib/libsnappy* host6:/home/hadoop/hadoop-1.1.2/lib/native/Linux-amd64-64/
scp /usr/local/lib/libsnappy* host8:/home/hadoop/hadoop-1.1.2/lib/native/Linux-amd64-64/
在每一台集群机器上面修改环境变量:
vi /home/hadoop/.bashrc
追加:
export LD_LIBRARY_PATH=/home/hadoop/hadoop-1.1.2/lib/native/Linux-amd64-64