nutch 运行中配置文件的修改

最新推荐文章于 2022-02-25 18:19:43 发布

youkimra

最新推荐文章于 2022-02-25 18:19:43 发布

阅读量129

点赞数

分类专栏： nutch 文章标签： XML CSS HTML C C++

本文链接：https://blog.csdn.net/youkimra/article/details/83923596

版权

nutch 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

集中了两天时间对nutch的抓取效率进行了研究，根据自己的需求只关心网站的html页面。其余的都filter，配置文件很多，需要记录下以便后面方便：

1 nutch-default.xml
a. http.content.limit -1 表示抓取整个html页面内容。
b. fetcher.threads.per.host 5 fetcher.threads.fetch 100 ，如果fetcher.threads.per.host为1的话后面线程数是不会生效的。
c. plugin.includes 加上urlfilter-(regex|prefix|suffix) 对url进行过滤。
2 regex-urlfilter.xml
# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]
-[\{\}\<\>\\]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/
-[A-Za-z0-9]{20,}

# accept anything else
+.
3 suffix-urlfilter.xml
次数添加你想要过滤的链接后缀如： .zip .rar
4 prefix-urlfilter.txt
http://
https://