java crawler4j github_GitHub - zhuoran/crawler4j: Open Source Simple Web Crawler for Java. Simple F...

#Crawler4j是使用JAVA开发的开源Web爬虫

###Crawler4j通过配置文件配置抓取任务,然后使用多线程进行抓取的Web爬虫.每个抓取任务使用独立线程上下文,支持在配置文件中同时配置多个抓取任务,复杂的抓取任务可通过扩展框架提供的基类实现,可以方便的将爬虫和其他解析存储程序进行集成.

####使用方法请参考 crawler4j-simple 模块

##It's composed of two parts:

crawler4j-core: crawler4j core module.

crawler4j-simple: a simple WEB crawler implementation base on crawler4j-core.

================================================================

Quick Start

0.Install the git and maven command line:

yum install git

or: apt-get install git

cd ~

wget http://www.apache.org/dist/maven/maven-3/3.0.4/binaries/apache-maven-3.0.4-bin.tar.gz

tar zxvf apache-maven-3.0.4-bin.tar.gz

vi .bash_profile

- edit: export PATH=$PATH:~/apache-maven-3.0.4/bin

source .bash_profile

1.Checkout the crawler4j source code:

cd ~

git clone https://github.com/zhuoran/crawler4j.git

cd crawler4j

git checkout -b crawler4j-0.1

git checkout master

2.Import the source code to eclipse project:

cd crawler4j/crawler4j-simple

mvn eclipse:eclipse

Eclipse -> Menu -> File -> Import -> Exsiting Projects to Workspace -> Browse -> Finish

Edit Config:

crawler4j-simple/src/main/resources/crawler4j.xml

3.Build the binary package:

cd ~/crawler4j

mvn clean install -Dmaven.test.skip

ll

4.Install the crawler4j simple demo:

cd ~/crawler4j/crawler4j-simple/target

tar zxvf crawler4j-simple-0.1-assembly.tar.gz

cd crawler4j-simple-0.1/bin

./start.sh

================================================================

Configuration

Crawler Config Example :

See crawler4j-simple/src/main/resources/crawler4j.xml

You can configure multiple crawling tasks in crawler4j.xml

UTF-8

Infoq-News

1000

div.box-content-3

me.zhuoran.crawler4j.simple.ParserDemo

me.zhuoran.crawler4j.simple.InfoqCrawler

[charset] For target html page charset,like UTF-8 GBK .

[name] Task name or thread name.

[delay] Each fetch of a html page delay some time, unit is ms.

[url] URL can be list page url or target page url.

URL can contain one or more list page url must use "," split.

URL can contain one or more target page url must use "|" split.

[extract_links_elementId] This tag can be a regex string or a jsoup query.

[parser] This is class full name of parser.

[crawler] This is class full name of cralwer.

================================================================

Feature

支持更多类型数据自动识别

完善HttpClient部分代码,支持更多抓取模式,支持gzip等格式解析

重构并优化架构

For more, please refer to:

[Wiki](http://www.zhuoran.me/crawler4j/wiki) 准备中

================================================================

Support

Twitter: @lopor

新浪微博: @王小然

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值