集中了两天时间对nutch的抓取效率进行了研究,根据自己的需求只关心网站的html页面。其余的都filter,配置文件很多,需要记录下以便后面方便:
1 nutch-default.xml
a. http.content.limit -1 表示抓取整个html页面内容 。
b. fetcher.threads.per.host 5 fetcher.threads.fetch 100 , 如果fetcher.threads.per.host为1的话后面线程数是不会生效的。
c. plugin.includes 加上urlfilter-(regex|prefix|suffix) 对url进行过滤。
2 regex-urlfilter.xml
# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):
# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]
-[\(\)\{\}\<\>\\]
# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/
-[A-Za-z0-9]{20,}
# accept anything else
+.
3 suffix-urlfilter.xml
次数添加你想要过滤的链接后缀 如: .zip .rar
4 prefix-urlfilter.txt
http://
https://
1 nutch-default.xml
a. http.content.limit -1 表示抓取整个html页面内容 。
b. fetcher.threads.per.host 5 fetcher.threads.fetch 100 , 如果fetcher.threads.per.host为1的话后面线程数是不会生效的。
c. plugin.includes 加上urlfilter-(regex|prefix|suffix) 对url进行过滤。
2 regex-urlfilter.xml
# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):
# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]
-[\(\)\{\}\<\>\\]
# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/
-[A-Za-z0-9]{20,}
# accept anything else
+.
3 suffix-urlfilter.xml
次数添加你想要过滤的链接后缀 如: .zip .rar
4 prefix-urlfilter.txt
http://
https://