配置1:
<property>
<name>http.content.limit</name>
<value>65536</value>
<description>The length limit for downloaded content using the http
protocol, in bytes. If this value is nonnegative (>=0), content longer
than it will be truncated; otherwise, no truncation at all. Do not
confuse this setting with the file.content.limit setting.
</description>
</property>
这个配置用来限制下载网页的大小,默认是65536字节,如果Fetch阶段结束后发现页面内容没有下载全,就是因为这个地方的限制。把这个值改大(超过页面字节数)就可以下载全部的网页。
配置2:
<property>
<name>http.proxy.host</name>
<value></value>
<description>The proxy hostname. If empty, no proxy is used.</description>
</property>
<property>
<name>http.proxy.port</name>
<value></value>
<description>The proxy port.</description>
</property>
这两个属性是用来设置代理地址和端口.
配置3:
<property>
<name>partition.url.mode</name>
<value>byHost</value>
<description>Determines how to partition URLs. Default value is 'byHost',
also takes 'byDomain' or 'byIP'.
</description>
</property>
这个配置用来设定mapper操作以后,partition操作根据Host进行Hash。结果是具有相同Host的URL会被打到同一个Reduce节点上面.
配置4:
<!-- plugin properties ./src/plugin plugins-->
<property>
<name>plugin.folders</name>
<value>plugins</value>
<description>eclipse env:./src/plugin deploy:plugins</description>
</property>
这个属性是用来指定plugin的目录,在eclipse中执行时需要改为:
./src/plugin
但是在分布式集群运行打成的JOB包时,需要改为plugins