Larbin learning (2)——How to book configuration of larbin

在运行larbin之前需要更改一下配置。

主要该两个文件:larbin.conf和options.h

一般在larbin.conf文件里改(对于larbin.conf的修改,在最后附录中还有详细的介绍):

StartUrl :http://www.hfut.edu.cn                //爬虫开始爬的起始网址
limitToDomain: .edu .com .cn .net .com.cn //限制爬虫抓取的网址域名后缀。

在options.h里修改:(我的配置是:)
//#define DEFAULT_OUTPUT   // do nothing…
#define SIMPLE_SAVE            // save in files named save/dxxxxxx/fyyyyyy 简单输出一个目录两千个文件,包含索引。// 这个很重要,默认是do nothing ,不会有任何结果输出的。
//#define MIRROR_SAVE       // save in files (respect sites hierarchy)镜像方式保存
#define STATS_OUTPUT        // do some stats on pages输出统计
#define FOLLOW_LINKS      // do you want to follow links in pages
#define LINKS_INFO             //是不是得到该页面的子连接
#define NO_DUP              //不允许重复
#define EXIT_AT_END           //完成时退出
#define COOKIES
#define CGILEVEL 0              //处理服务器端程序,也就是但url中包含? & = 之类的querString时也处理。
#define DEPTHBYSITE
#define THREAD_OUTPUT
#define RELOAD
#define GRAPH
#define STATS
#define BIGSTATS

编辑完成后,需要注意的是如果只修改larbin.conf文件,不需要重 新编译,如果修改了options.h文件,需要运行 gmake 重新进行编译。

另外,types.h中有些项也可以自行设置,如 maxUrlsBySite ,maxPageSize


附录1

   通 过./larbin执行larbin。

  • 默认情况下其配置文件为larbin.conf,可通过参数 -c filename  设置自己的配置文件。  
  • 可通过 -scratch 参数让larbin重新开始抓取网页。

    配置文件larbin.conf文件各项简介:(修改配置文件不需要重新编译larbin

  

###############################################
# Who are you ?
# mail of the one who launched larbin (YOUR mail)

From larbin2.6.3@unspecified.mail    # //用于http头,web服务器管理员可通过该地址联

                                                       # // 系执行larbin的人
# name of the bot (sent with http headers)
UserAgent larbin_2.6.3     # //客户端标志    ############################################
# What are the inputs and ouputs of larbin
# port on which is launched the http statistic webserver
# if unset or set to 0, no webserver is launched
httpPort 8081   # //laibin有一个简单的web服务器,可通过 http://localhost::8081 监控其

                       # //运行状态,如果该值 为0,则不启动web服务器。
# port on which you can submit urls to fetch
# no input is possible if you comment this line or use port 0

inputPort 1976

############################################
# parameters to adapt depending on your network
# Number of connexions in parallel (to adapt depending of your network speed)
pagesConnexions 100  # //并行获取网页的数量,该值可依据网络带宽调整
# Number of dns calls in parallel
dnsConnexions 5         # //并行解析dns的数 量
# How deep do you want to go in a site
depthInSite 5             # //网页抓取深度
# do you want to follow external links
noExternalLinks         # //是否允许抓取域名外连接
# time between 2 calls on the same server (in sec) : NEVER less than 30
waitDuration 60         # //对同一个服务器获取 网页的间隔时间
# Make requests through a proxy (use with care)
#proxy www 8080    # //代理地址

##############################################
# now, let's customize the search

# first page to fetch (you can specify several urls)
startUrl http://hi.csdn.net/wphnudt     # //抓取网页的其实URL,可指定多值

# Do you want to limit your search to a specific domain ?
# if yes, uncomment the following line
#limitToDomain .fr .dk .uk end

# What are the extensions you surely don't want
# never forbid .html, .htm and so on : larbin needs them

# // 限制不被下载的对象的后缀,可通过注释或者增加后缀控制下载
forbiddenExtensions
.tar .gz .tgz .zip .Z .rpm .deb
.ps .dvi .pdf
.png .jpg .jpeg .bmp .smi .tiff .gif
.mov .avi .mpeg .mpg .mp3 .qt .wav .ram .rm
.jar .java .class .diff
.doc .xls .ppt .mdb .rtf .exe .pps .so .psd
end


附录2 下面是官方设置的说明文档:

Simple customizationslarbin.conf

The basic configurations are made in larbin.conf. Here are the different fields of this file :

  • From : YOUR mail : sent with http headers : very usefull when someone wants to complain about the robot :-(
  • UserAgent : name of the robot (sent with each request)
  • httpPort : port on which is launched the http statistic webserver (see http://localhost:8081/ when larbin is launched). If you set port to 0, no webserver will be launched. This can allow larbin not to launch a single thread.
  • inputPort : port on which you can submit urls to fetch. If this line does not exist or if the port is 0, no input will be available.
  • pagesConnexions : Number of page you fetch in parallel (to adapt depending of your network speed). Decrease this if you have too many timeouts (see stats) : 10% seems to be a maximum.
  • dnsConnexions : Number of dns calls in parallel. 10 should be ok.
  • depthInSite : How deep do you want to go in a site.
  • noExternalLinks : Only follow links which are related to the same host.
  • waitDuration : time between 2 calls at the same server in seconds. It should never be less than 30 s. However, even with 60 s, it won’t slow the crawler much, and it is a much better behaviour.
  • proxy : if you want to connect through a proxy (host port). Unless you have no other way to connect to the internet, you should not use this because it might slow the crawler a lot, and is probably also not so good for the proxy (especially if it has a cache).
  • StartUrl : Where the search starts. This appears not to be very important, as soon as the page contains external urls.
  • limitToDomain : with this option enabled, you will only crawl pages of some specific domain (.fr and .dk for example).
  • forbiddenExtensions : What are the extensions you don’t want ? (write all of them and terminate your list with end)
options.h

In this file, you can define options which can change what will be done. Here are the different thing you can define (you must recompile larbin if you change one of those) :

  • The first thing you can define is the module you want to use for ouput. This defines what you want to do with the pages larbin gets. Here are the different options :
    • DEFAULT_OUTPUT : This module mainly does nothing, except statistics.
    • SIMPLE_SAVE : This module saves pages on disk. It stores 2000 files per directory (with an index).
    • MIRROR_SAVE : This module saves pages on disk with the hierarchy of the site they come from. It uses one directory per site.
    • STATS_OUTPUT : This modules makes some stats on the pages. In order to see the results, see http://localhost:8081/output.html.

    These modules can be customized in src/types.h .
    If you want to define a new module, please have a look at “src/interf/useroutput.cc”, and do not hesitate to send me your work for inclusion.

  • SPECIFICSEARCH : If this option is set, larbin’s goal is to search for specific document. You must then define 2 arrays (NULL terminated) of char *, contentTypes and privilegedExts , which define respectively the content types which are looked for, and the extension of files (this extension is only used for speeding the search, pages are said to be specific only by looking at the content/type in http headers). You should also define another option telling how you want to manage specific pages :
    • DEFAULT_SPECIFIC : Default way of managing specific files : they are treated as html (ie same size limitations…), except that they are not parsed.
    • SAVE_SPECIFIC : Specific pages are saved on disk. this allows in particular specific pages to be much bigger (see src/types.h for customizating this module).
    • DYNAMIC_SPECIFIC : for big pages, larbin uses dynamically allocated buffers.

    If you want to define a new policy, please have a look at “src/fetch/specbuf.cc” and “src/fetch/specbuf.h”, and do not hesitate to send me your work for inclusion.

  • LINKS_INFO : Associate to each page the list of the links it contains. This information can be used in “useroutput.cc” with page->getLinks() .
  • FOLLOW_LINKS : If this option is not set, html pages won’t be parsed and links won’t be followed. This can be usefull when you feed larbin through the input system.
  • NO_DUP : if this option is set, larbin does not return success when a page with the same content than an old one is encontered.
  • URL_TAGS : if this option is set, an int is associated to every url (by default 0). If you use the input system, you’ll have to give an int and the url instead of just the url. When the pages is fetched, you’ll get it with the int (redirections are followed).
  • EXIT_AT_END : If this option is set, larbin exits when there are not any more urls to get.
  • IMAGES : If set, larbin gets the images contained in pages (ie follow img src links). Make sure to update forbiddenExtensions in larbin.conf according to your needs.
  • ANYTYPE : If set, larbin gets every pages without caring about content type. Make sure to update forbiddenExtensions in larbin.conf according to your needs.
  • COOKIES : If set, larbin manages cookies. Up to now, it is a very simple implementation, but it should be suitable in more than 90% of the situations.
  • CGILEVEL : This option is foolowed by an integer which specified how reluctant to cgi you are. 0 means you want all cgis, 1 means you refuse urls with ‘?’ or ‘=’ inside, 2 means you also want to ban urls with ‘cgi’ inside.
  • MAXBANDWIDTH : This option is followed by an integer which indicates the maximum bandwidth larbin should use. Because of the way bandwidth is limited, larbin might use 10 to 20 per cent more bandwidth than expected. If this option is not set, there is no bandwidth limitation.
  • DEPTHBYSITE : If this option is set, when a links points to another site, the depth of the new url is reinitialized, else it is never.
  • THREAD_OUTPUT : This option must be set if the code in “useroutput.cc” (the code you add) can use blocking instructions (read/write on network file descriptor…). If it is not set, there is only one thread in the program (except the webserver if any), so no locking is needed.
  • RELOAD : If this option is enabled, larbin restarts from where it last stopped when you launch it. This allows to stop and restart larbin as needed (or restart after a crash). If you want to restart from scratch, use the -scratch option.
  • NOWEBSERVER : Do not launch the webserver. This can be usefull if you don’t want to launch any thread.
  • GRAPH : Include nice histograms in the real time stat page.
  • NDEBUG : Disable debugging information in the webserver.
  • NOSTATS : Disable stats information in the webserver.
  • STATS : Display stats on stdout every 8 seconds.
  • BIGSTATS : Display the name of every page that is fetched on stdout. This might slow larbin quite much.
  • CRASH : Should only be used for reporting terrible bugs (with make debug).
src/types.h

If you want to tune larbin a little more, go and see this file (it is supposed to be commented enough). Of course, for those changes to have effects, you have to recompile larbin.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值