使用weblech爬取网页-配置文件

以下是在weblech目录下config文件夹中的配置文件Spider.properties

# Spider configuration file
#
# All of these settings default to sensible values if not specified.

# Directory in which to save downloaded files, defaults to "."
saveRootDirectory = c:/weblech/sites

# Filename in which to save mailto links
mailtoLogFile = mailto.txt

# Tell the spider to reload HTML pages each time, but not images
# or other files
refreshHTMLs = true
refreshImages = false
refreshOthers = false

# Set the extensions the Spider should use to determine which
# pages are of MIME type text/html. The Spider also learns new
# types as it downloads them.
htmlExtensions = htm,html,shtm,shtml

# Similarly for MIME type image/*
imageExtensions = gif,jpg,jpeg,png,bmp

# URL at which we should start the spider
startLocation = http://www.slashdot.org/

# Whether to do depth first search, or the default breadth
# first search when finding URLs to download
depthFirst = false

# Maximum depth of pages to retrieve (the first page is depth
# 0, links from there depth 1, etc). Setting to 0 is "unlimited"
maxDepth = 2

# Basic URL filtering. URLs must contain this string in order
# to be downloaded by WebLech
urlMatch = slashdot.org

# Basic URL prioritisation. URLs which are "interesting" are
# downloaded first, URLs which are "boring" last.
interestingURLs=pollBooth.pl,faq
boringURLs=article.pl

# User Agent header
userAgent = Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)

# Username and password for basic HTTP authentication, if required.
# The same username and password will be used for all authentication
# challenges during a download session.
basicAuthUser = myUser
basicAuthPassword = 1234

# Number of download threads to start
spiderThreads = 1

# How often to checkpoint the Spider. A checkpoint file is named
# "spider.checkpoint" and can be used to start the spider in the
# middle of a run. Setting this value to 0 disables checkpoints.
# Here we checkpoint every 30 seconds
checkpointInterval = 30000

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值