java crawler4j github_GitHub - zhuoran/crawler4j: Open Source Simple Web Crawler for Java. Simple F...

weixin_39849762

于 2021-03-02 07:44:57 发布

阅读量74

点赞数

文章标签： java crawler4j github

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/weixin_39849762/article/details/114956412

版权

#Crawler4j是使用JAVA开发的开源Web爬虫

###Crawler4j通过配置文件配置抓取任务,然后使用多线程进行抓取的Web爬虫.每个抓取任务使用独立线程上下文,支持在配置文件中同时配置多个抓取任务,复杂的抓取任务可通过扩展框架提供的基类实现,可以方便的将爬虫和其他解析存储程序进行集成.

####使用方法请参考 crawler4j-simple 模块

##It's composed of two parts:

crawler4j-core: crawler4j core module.

crawler4j-simple: a simple WEB crawler implementation base on crawler4j-core.

================================================================

Quick Start

0.Install the git and maven command line:

yum install git

or: apt-get install git

cd ~

wget http://www.apache.org/dist/maven/maven-3/3.0.4/binaries/apache-maven-3.0.4-bin.tar.gz

tar zxvf apache-maven-3.0.4-bin.tar.gz

vi .bash_profile

- edit: export PATH=$PATH:~/apache-maven-3.0.4/bin

source .bash_profile

1.Checkout the crawler4j source code:

cd ~

git clone https://github.com/zhuoran/crawler4j.git

cd crawler4j

git checkout -b crawler4j-0.1

git checkout master

2.Import the source code to eclipse project:

cd crawler4j/crawler4j-simple

mvn eclipse:eclipse

Eclipse -> Menu -> File -> Import -> Exsiting Projects to Workspace -> Browse -> Finish

Edit Config:

crawler4j-simple/src/main/resources/crawler4j.xml

3.Build the binary package:

cd ~/crawler4j

mvn clean install -Dmaven.test.skip

ll

4.Install the crawler4j simple demo:

cd ~/crawler4j/crawler4j-simple/target

tar zxvf crawler4j-simple-0.1-assembly.tar.gz

cd crawler4j-simple-0.1/bin

./start.sh

================================================================

Configuration

Crawler Config Example :

See crawler4j-simple/src/main/resources/crawler4j.xml

You can configure multiple crawling tasks in crawler4j.xml

UTF-8

Infoq-News

1000

div.box-content-3

me.zhuoran.crawler4j.simple.ParserDemo

me.zhuoran.crawler4j.simple.InfoqCrawler

[charset] For target html page charset,like UTF-8 GBK .

[name] Task name or thread name.

[delay] Each fetch of a html page delay some time, unit is ms.

[url] URL can be list page url or target page url.

URL can contain one or more list page url must use "," split.

URL can contain one or more target page url must use "|" split.

[extract_links_elementId] This tag can be a regex string or a jsoup query.

[parser] This is class full name of parser.

[crawler] This is class full name of cralwer.

================================================================

Feature

支持更多类型数据自动识别

完善HttpClient部分代码,支持更多抓取模式,支持gzip等格式解析

重构并优化架构

For more, please refer to:

[Wiki](http://www.zhuoran.me/crawler4j/wiki) 准备中

================================================================

Support

Twitter: @lopor

新浪微博: @王小然

weixin_39849762

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
java crawler4j github_GitHub - zhuoran/crawler4j: Open Source Simple Web Crawler for Java. Simple F...

#Crawler4j是使用JAVA开发的开源Web爬虫###Crawler4j通过配置文件配置抓取任务,然后使用多线程进行抓取的Web爬虫.每个抓取任务使用独立线程上下文,支持在配置文件中同时配置多个抓取任务,复杂的抓取任务可通过扩展框架提供的基类实现,可以方便的将爬虫和其他解析存储程序进行集成.####使用方法请参考 crawler4j-simple 模块##It's composed of t...
复制链接

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。