Nutch爬取JS

最新推荐文章于 2020-04-02 14:37:46 发布

weixin_34059951

最新推荐文章于 2020-04-02 14:37:46 发布

阅读量169

点赞数

文章标签： javascript python ViewUI

原文链接：https://my.oschina.net/junfrank/blog/291511

版权

2019独角兽企业重金招聘Python工程师标准>>>

1,修改regex-urlfilter.txt,去掉js|JS
   # skip image and other suffixes we can't yet parse
   # for a more extensive coverage use the urlfilter-suffix plugin
   -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP)$

2,变更nutch-site.xml,加入parse-js
   <property>
       <name>plugin.includes</name>
       <value>protocol-httpclient|urlfilter-regex|parse-(html|js|tika)|index-(basic|anchor|self)|urlnormalizer-(pass|regex|basic)|scoring-opic</value>
       <description>Regular expression naming plugin directory names to
       include. Any plugin not matching this expression is excluded.
       In any case you need at least include the nutch-extensionpoints plugin. By
       default Nutch includes crawling just HTML and plain text via HTTP,
       and basic indexing and search plugins. In order to use HTTPS please enable
       protocol-httpclient, but be aware of possible intermittent problems with the
       underlying commons-httpclient library.
       </description>
   </property>

3,变更原代码parse-js插件
   3-1,变更类：org.apache.nutch.parse.js.JSParseFilter
   变更方法：public Parse getParse(String url, WebPage page) {
   变更行：if (type != null && !type.trim().equals("") && !type.toLowerCase().startsWith("application/x-javascript"))
   变更后：if (type != null && !type.trim().equals("") && !type.toLowerCase().startsWith("application/javascript"))

   3-2,parse-js/plugin.xml
   "application/x-javascript"->"application/javascript"
   （*）解释一下：js mine 类型javascript----text/javascript,application/javascript, and appliation/x-javascript
    传统的javascript程序的MIME类型是“text/javascript”，其他使用的还有"application/x-javascript"（x前缀表示这是实验性的，不是标准的类型），RFC4329规定了“text/javascript”类型，因为它普遍被使用。然而，javascript程序并不是真正的文本文件，这就表示这个类型已经意味着过时了，而推荐使用"application/javascript"（去除x前缀）。然而，在写程序的时候，"application/javascript"没有很好的支持。这也行就是"application/x-javascript"被使用在很多网页中的原因。