1,修改regex-urlfilter.txt,去掉js|JS
# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP)$
2,变更nutch-site.xml,加入parse-js
<property>
<name>plugin.includes</name>
<value>protocol-httpclient|urlfilter-regex|parse-(html|js|tika)|index-(basic|anchor|self)|urlnormalizer-(pass|regex|basic)|scoring-opic</value>
<description>Regular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints plugin. By
default Nutch includes crawling just HTML and plain text via HTTP,
and basic indexing and search plugins. In order to use HTTPS please enable
protocol-httpclient, but be aware of possible intermittent problems with the
underlying commons-httpclient library.
</description>
</property>
3,变更原代码parse-js插件
3-1,变更类:org.apache.nutch.parse.js.JSParseFilter
变更方法:public Parse getParse(String url, WebPage page) {
变更行:if (type != null && !type.trim().equals("") && !type.toLowerCase().startsWith("application/x-javascript"))
变更后:if (type != null && !type.trim().equals("") && !type.toLowerCase().startsWith("application/javascript"))
3-2,parse-js/plugin.xml
"application/x-javascript"->"application/javascript"
(*)解释一下:js mine 类型javascript----text/javascript,application/javascript, and appliation/x-javascript
传统的javascript程序的MIME类型是“text/javascript”,其他使用的还有"application/x-javascript"(x前缀表示这是实验性的,不是标准的类型),RFC4329规定了“text/javascript”类型,因为它普遍被使用。然而,javascript程序并不是真正的文本文件,这就表示这个类型已经意味着过时了,而推荐使用"application/javascript"(去除x前缀)。然而,在写程序的时候,"application/javascript"没有很好的支持。这也行就是"application/x-javascript"被使用在很多网页中的原因。
4,变更nutch/conf/parse-plugin.xml
追加:
<mimeType name="application/javascript">
<plugin id="parse-js" />