对于大多数教育网中的用户,都是不可以直接上国外网站的(主要由于学校封锁),一定要上的话也只能通过代理。今天我需要抓取一些国外的网站,但发现全部都抓取不成功。经过检查发现需要设置代理,具体设置方法如下:
在/conf/nutch-site.xml中添加如下内容:
<property>
<name>http.proxy.host</name>
<value>***.***.***.***</value>
<description>The proxy hostname. If empty, no proxy is used.</description>
</property>
<property>
<name>http.proxy.port</name>
<value>8080</value>
<description>The proxy port.</description>
</property>
<property>
<name>http.proxy.username</name>
<value></value>
<description>Username for proxy. This will be used by
'protocol-httpclient', if the proxy server requests basic, digest
and/or NTLM authentication. To use this, 'protocol-httpclient' must
be present in the value of 'plugin.includes' property.
NOTE: For NTLM authentication, do not prefix the username with the
domain, i.e. 'susam' is correct whereas 'DOMAIN/susam' is incorrect.
</description>
</property>