Nutch url过滤规则

最新推荐文章于 2024-07-20 17:13:40 发布

lady_ga

最新推荐文章于 2024-07-20 17:13:40 发布

阅读量448

点赞数

文章标签： url permissions file express apache 产品

nutch网上有不少有它的源码解析,但是采集这块还是不太让人容易理解.今天终于知道怎么,弄的.现在把crawl-urlfilter.txt文件贴出来,让大家一块交流,也给自己备忘录一个。

# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# The url filter file used by the crawl command.

# Better for intranet crawling.
# Be sure to change MY.DOMAIN.NAME to your domain name.

# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'. The first matching pattern in the file
# determines whether a URL is included or ignored. If no pattern
# matches, the URL is ignored.

# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-/.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$

# skip URLs containing certain characters as probable queries, etc.

//采集动态网站很重要。必须这样设置。不然像a.jsp?a=001 带有问号的网页就没办法采集。
+[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+/1/[^/]+/1/

# accept hosts in MY.DOMAIN.NAME
###########################7shop24########################################
#+^http://([a-z0-9]*/.)*7shop24.com/
#+^http://www.7shop24.com/indexdtl06.asp/?classid=([0-9]*)&productid=([0-9]*)+$

###############################http://www.redbaby.com.cn/##############################

//采集是有顺序的，不是随便写的。比如：你要采集产品页，你首先得把首页放进来，然后产品是放在分类页面的，你得把//分类也得包括进来，然后再把具体产品规则的正则写进来，这样才能完成你所需要的任务。如：
+^http://www.redbaby.com.cn/$
+^http://www.redbaby.com.cn/([a-zA-Z]*/.)*index.html$
+^http://www.redbaby.com.cn/([a-zA-Z]*)/$
+^http://www.redbaby.com.cn/([a-zA-Z]*)/index/.html+$
+^http://www.redbaby.com.cn/Product/Product_List.aspx/?Site=/d&BranchID=/d&DepartmentID=/d+$
+^http://www.redbaby.com.cn/Product/Product_List.aspx/?Site=/d&BrandID=/d&BranchID=/d+$
+^http://www.redbaby.com.cn/Product/ProductInfo/w/d/w([0-9]*/.)*html$
+^http://www.redbaby.com.cn/Product/Product_List.aspx/?Site=/d&BranchID=/d&DepartmentID=/d&SortID=/d+$
+^http://www.redbaby.com.cn/Product/ProductInfo/w/d/w/d/.htm$
# skip everything else
-.

url匹配可能用到的java正则:

? 对应 /?

_ (下划张) 对应 /w

.(点号) 对应 /.

come from http://nhy520.javaeye.com/blog/489832