网络爬虫,也叫网络蜘蛛,有的项目也把它称作“walker”。维基百科所给的定义是“一种系统地扫描互联网,以获取索引为目的的网络程序”。网络上有很多关于网络爬虫的开源项目,其中比较有名的是Heritrix和Apache Nutch。
有时需要在网上搜集信息,如果需要搜集的是获取方法单一而人工搜集费时费力的信息,比如统计一个网站每个月发了多少篇文章、用了哪些标签,为自然语言处理项目搜集语料,或者为模式识别项目搜集图片等等,就需要爬虫程序来完成这样的任务。而且搜索引擎必不可少的组件之一也是网络爬虫。
很多网络爬虫都是用Python,Java或C#实现的。我这里给出的是Java版本的爬虫程序。为了节省时间和空间,我把程序限制在只扫描本博客地址下的网页(也就是http://johnhan.net/但不包括http://johnhany.net/wp-content/下的内容),并从网址中统计出所用的所有标签。只要稍作修改,去掉代码里的限制条件就能作为扫描整个网络的程序使用。或者对输出格式稍作修改,可以作为生成博客sitemap的工具。
代码也可以在这里下载:johnhany/WPCrawler。
环境需求
我的开发环境是Windows7 + Eclipse。
需要XAMPP提供通过url访问MySQL数据库的端口。
还要用到三个开源的Java类库:
Apache HttpComponents 4.3 提供HTTP接口,用来向目标网址提交HTTP请求,以获取网页的内容;
HTML Parser 2.0 用来解析网页,从DOM节点中提取网址链接;
MySQL Connector/J 5.1.27 连接Java程序和MySQL,然后就可以用Java代码操作数据库。
代码
代码位于三个文件中,分别是:crawler.java,httpGet.java和parsePage.java。包名为net.johnhany.wpcrawler。
crawler.java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
|
package
net
.
johnhany
.
wpcrawler
;
import
java
.
sql
.
Connection
;
import
java
.
sql
.
DriverManager
;
import
java
.
sql
.
ResultSet
;
import
java
.
sql
.
SQLException
;
import
java
.
sql
.
Statement
;
public
class
crawler
{
public
static
void
main
(
String
args
[
]
)
throws
Exception
{
String
frontpage
=
"http://johnhany.net/"
;
Connection
conn
=
null
;
//connect the MySQL database
try
{
Class
.
forName
(
"com.mysql.jdbc.Driver"
)
;
String
dburl
=
"jdbc:mysql://localhost:3306?useUnicode=true&characterEncoding=utf8"
;
conn
=
DriverManager
.
getConnection
(
dburl
,
"root"
,
""
)
;
System
.
out
.
println
(
"connection built"
)
;
}
catch
(
SQLException
e
)
{
e
.
printStackTrace
(
)
;
}
catch
(
ClassNotFoundException
e
)
{
e
.
printStackTrace
(
)
;
}
String
sql
=
null
;
String
url
=
frontpage
;
Statement
stmt
=
null
;
ResultSet
rs
=
null
;
int
count
=
0
;
if
(
conn
!=
null
)
{
//create database and table that will be needed
try
{
sql
=
"CREATE DATABASE IF NOT EXISTS crawler"
;
stmt
=
conn
.
createStatement
(
)
;
stmt
.
executeUpdate
(
sql
)
;
sql
=
"USE crawler"
;
stmt
=
conn
.
createStatement
(
)
;
stmt
.
executeUpdate
(
sql
)
;
sql
=
"create table if not exists record (recordID int(5) not null auto_increment, URL text not null, crawled tinyint(1) not null, primary key (recordID)) engine=InnoDB DEFAULT CHARSET=utf8"
;
stmt
=
conn
.
createStatement
(
)
;
stmt
.
executeUpdate
(
sql
)
;
sql
=
"create table if not exists tags (tagnum int(4) not null auto_increment, tagname text not null, primary key (tagnum)) engine=InnoDB DEFAULT CHARSET=utf8"
;
stmt
=
conn
.
createStatement
(
)
;
stmt
.
executeUpdate
(
sql
)
;
}
catch
(
SQLException
e
)
{
e
.
printStackTrace
(
)
;
}
//crawl every link in the database
while
(
true
)
{
//get page content of link "url"
httpGet
.
getByString
(
url
,
conn
)
;
count
++
;
//set boolean value "crawled" to true after crawling this page
sql
=
"UPDATE record SET crawled = 1 WHERE URL = '"
+
url
+
"'"
;
stmt
=
conn
.
createStatement
(
)
;
if
(
stmt
.
executeUpdate
(
sql
)
>
0
)
{
//get the next page that has not been crawled yet
sql
=
"SELECT * FROM record WHERE crawled = 0"
;
stmt
=
conn
.
createStatement
(
)
;
rs
=
stmt
.
executeQuery
(
sql
)
;
if
(
rs
.
next
(
)
)
{
url
=
rs
.
getString
(
2
)
;
}
else
{
//stop crawling if reach the bottom of the list
break
;
}
//set a limit of crawling count
if
(
count
>
1000
||
url
==
null
)
{
break
;
}
}
}
conn
.
close
(
)
;
conn
=
null
;
System
.
out
.
println
(
"Done."
)
;
System
.
out
.
println
(
count
)
;
}
}
}
|
httpGet.java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
|
package
net
.
johnhany
.
wpcrawler
;
import
java
.
io
.
IOException
;
import
java
.
sql
.
Connection
;
import
org
.
apache
.
http
.
HttpEntity
;
import
org
.
apache
.
http
.
HttpResponse
;
import
org
.
apache
.
http
.
client
.
ClientProtocolException
;
import
org
.
apache
.
http
.
client
.
ResponseHandler
;
import
org
.
apache
.
http
.
client
.
methods
.
HttpGet
;
import
org
.
apache
.
http
.
impl
.
client
.
CloseableHttpClient
;
import
org
.
apache
.
http
.
impl
.
client
.
HttpClients
;
import
org
.
apache
.
http
.
util
.
EntityUtils
;
public
class
httpGet
{
public
final
static
void
getByString
(
String
url
,
Connection
conn
)
throws
Exception
{
CloseableHttpClient
httpclient
=
HttpClients
.
createDefault
(
)
;
try
{
HttpGet
httpget
=
new
HttpGet
(
url
)
;
System
.
out
.
println
(
"executing request "
+
httpget
.
getURI
(
)
)
;
ResponseHandler
<
String
>
responseHandler
=
new
ResponseHandler
<
String
>
(
)
{
public
String
handleResponse
(
final
HttpResponse
response
)
throws
ClientProtocolException
,
IOException
{
int
status
=
response
.
getStatusLine
(
)
.
getStatusCode
(
)
;
if
(
status
>=
200
&&
status
<
300
)
{
HttpEntity
entity
=
response
.
getEntity
(
)
;
return
entity
!=
null
?
EntityUtils
.
toString
(
entity
)
:
null
;
}
else
{
throw
new
ClientProtocolException
(
"Unexpected response status: "
+
status
)
;
}
}
}
;
String
responseBody
=
httpclient
.
execute
(
httpget
,
responseHandler
)
;
/*
//print the content of the page
System.out.println("----------------------------------------");
System.out.println(responseBody);
System.out.println("----------------------------------------");
*/
parsePage
.
parseFromString
(
responseBody
,
conn
)
;
}
finally
{
httpclient
.
close
(
)
;
}
}
}
|
parsePage.java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
|
package
net
.
johnhany
.
wpcrawler
;
import
java
.
sql
.
Connection
;
import
java
.
sql
.
PreparedStatement
;
import
java
.
sql
.
ResultSet
;
import
java
.
sql
.
SQLException
;
import
java
.
sql
.
Statement
;
import
org
.
htmlparser
.
Node
;
import
org
.
htmlparser
.
Parser
;
import
org
.
htmlparser
.
filters
.
HasAttributeFilter
;
import
org
.
htmlparser
.
tags
.
LinkTag
;
import
org
.
htmlparser
.
util
.
NodeList
;
import
org
.
htmlparser
.
util
.
ParserException
;
import
java
.
net
.
URLDecoder
;
public
class
parsePage
{
public
static
void
parseFromString
(
String
content
,
Connection
conn
)
throws
Exception
{
Parser
parser
=
new
Parser
(
content
)
;
HasAttributeFilter
filter
=
new
HasAttributeFilter
(
"href"
)
;
try
{
NodeList
list
=
parser
.
parse
(
filter
)
;
int
count
=
list
.
size
(
)
;
//process every link on this page
for
(
int
i
=
0
;
i
<
count
;
i
++
)
{
Node
node
=
list
.
elementAt
(
i
)
;
if
(
node
instanceof
LinkTag
)
{
LinkTag
link
=
(
LinkTag
)
node
;
String
nextlink
=
link
.
extractLink
(
)
;
String
mainurl
=
"http://johnhany.net/"
;
String
wpurl
=
mainurl
+
"wp-content/"
;
//only save page from "http://johnhany.net"
if
(
nextlink
.
startsWith
(
mainurl
)
)
{
String
sql
=
null
;
ResultSet
rs
=
null
;
PreparedStatement
pstmt
=
null
;
Statement
stmt
=
null
;
String
tag
=
null
;
//do not save any page from "wp-content"
if
(
nextlink
.
startsWith
(
wpurl
)
)
{
continue
;
}
try
{
//check if the link already exists in the database
sql
=
"SELECT * FROM record WHERE URL = '"
+
nextlink
+
"'"
;
stmt
=
conn
.
createStatement
(
ResultSet
.
TYPE_FORWARD_ONLY
,
ResultSet
.
CONCUR_UPDATABLE
)
;
rs
=
stmt
.
executeQuery
(
sql
)
;
if
(
rs
.
next
(
)
)
{
}
else
{
//if the link does not exist in the database, insert it
sql
=
"INSERT INTO record (URL, crawled) VALUES ('"
+
nextlink
+
"',0)"
;
pstmt
=
conn
.
prepareStatement
(
sql
,
Statement
.
RETURN_GENERATED_KEYS
)
;
pstmt
.
execute
(
)
;
System
.
out
.
println
(
nextlink
)
;
//use substring for better comparison performance
nextlink
=
nextlink
.
substring
(
mainurl
.
length
(
)
)
;
//System.out.println(nextlink);
if
(
nextlink
.
startsWith
(
"tag/"
)
)
{
tag
=
nextlink
.
substring
(
4
,
nextlink
.
length
(
)
-
1
)
;
//decode in UTF-8 for Chinese characters
tag
=
URLDecoder
.
decode
(
tag
,
"UTF-8"
)
;
sql
=
"INSERT INTO tags (tagname) VALUES ('"
+
tag
+
"')"
;
pstmt
=
conn
.
prepareStatement
(
sql
,
Statement
.
RETURN_GENERATED_KEYS
)
;
//if the links are different from each other, the tags must be different
//so there is no need to check if the tag already exists
pstmt
.
execute
(
)
;
}
}
}
catch
(
SQLException
e
)
{
//handle the exceptions
System
.
out
.
println
(
"SQLException: "
+
e
.
getMessage
(
)
)
;
System
.
out
.
println
(
"SQLState: "
+
e
.
getSQLState
(
)
)
;
System
.
out
.
println
(
"VendorError: "
+
e
.
getErrorCode
(
)
)
;
}
finally
{
//close and release the resources of PreparedStatement, ResultSet and Statement
if
(
pstmt
!=
null
)
{
try
{
pstmt
.
close
(
)
;
}
catch
(
SQLException
e2
)
{
}
}
pstmt
=
null
;
if
(
rs
!=
null
)
{
try
{
rs
.
close
(
)
;
}
catch
(
SQLException
e1
)
{
}
}
rs
=
null
;
if
(
stmt
!=
null
)
{
try
{
stmt
.
close
(
)
;
}
catch
(
SQLException
e3
)
{
}
}
stmt
=
null
;
}
}
}
}
}
catch
(
ParserException
e
)
{
e
.
printStackTrace
(
)
;
}
}
}
|
程序原理
所谓“互联网”,是网状结构,任意两个节点间都有可能存在路径。爬虫程序对互联网的扫描,在图论角度来讲,就是对有向图的遍历(链接是从一个网页指向另一个网页,所以是有向的)。常见的遍历方法有深度优先和广度优先两种。相关理论知识可以参考树的遍历:https://en.wikibooks.org/wiki/A-level_Computing/AQA/Paper_1/Fundamentals_of_algorithms/Tree_traversal和这里。我的程序采用的是广度优先方式。
程序从crawler.java的main()开始运行。
1
2
3
4
|
Class
.
forName
(
"com.mysql.jdbc.Driver"
)
;
String
dburl
=
"jdbc:mysql://localhost:3306?useUnicode=true&characterEncoding=utf8"
;
conn
=
DriverManager
.
getConnection
(
dburl
,
"root"
,
""
)
;
System
.
out
.
println
(
"connection built"
)
;
|
首先,调用DriverManager连接MySQL服务。这里使用的是XAMPP的默认MySQL端口3306,端口值可以在XAMPP主界面看到:
Apache和MySQL都启动之后,在浏览器地址栏输入“http://localhost/phpmyadmin/”就可以看到数据库了。等程序运行完之后可以在这里检查一下运行是否正确。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
|
sql
=
"CREATE DATABASE IF NOT EXISTS crawler"
;
stmt
=
conn
.
createStatement
(
)
;
stmt
.
executeUpdate
(
sql
)
;
sql
=
"USE crawler"
;
stmt
=
conn
.
createStatement
(
)
;
stmt
.
executeUpdate
(
sql
)
;
sql
=
"create table if not exists record (recordID int(5) not null auto_increment, URL text not null, crawled tinyint(1) not null, primary key (recordID)) engine=InnoDB DEFAULT CHARSET=utf8"
;
stmt
=
conn
.
createStatement
(
)
;
stmt
.
executeUpdate
(
sql
)
;
sql
=
"create table if not exists tags (tagnum int(4) not null auto_increment, tagname text not null, primary key (tagnum)) engine=InnoDB DEFAULT CHARSET=utf8"
;
stmt
=
conn
.
createStatement
(
)
;
stmt
.
executeUpdate
(
sql
)
;
|
连接好数据库后,建立一个名为“crawler”的数据库,在库里建两个表,一个叫“record”,包含字段“recordID”,“URL”和“crawled”,分别记录地址编号、链接地址和地址是否被扫描过;另一个叫“tags”,包含字段“tagnum”和“tagname”,分别记录标签编号和标签名。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
|
while
(
true
)
{
httpGet
.
getByString
(
url
,
conn
)
;
count
++
;
sql
=
"UPDATE record SET crawled = 1 WHERE URL = '"
+
url
+
"'"
;
stmt
=
conn
.
createStatement
(
)
;
if
(
stmt
.
executeUpdate
(
sql
)
>
0
)
{
sql
=
"SELECT * FROM record WHERE crawled = 0"
;
stmt
=
conn
.
createStatement
(
)
;
rs
=
stmt
.
executeQuery
(
sql
)
;
if
(
rs
.
next
(
)
)
{
url
=
rs
.
getString
(
2
)
;
}
else
{
break
;
}
}
}
|
接着在一个while循环内依次处理表record内的每个地址。每次处理时,把地址url传递给httpGet.getByString(),然后在表record中把crawled改为true,表明这个地址已经处理过。然后寻找下一个crawled为false的地址,继续处理,直到处理到表尾。
这里需要注意的细节是,执行executeQuery()后,得到了一个ResultSet结构rs,rs包含SQL查询返回的所有行和一个指针,指针指向结果中第一行之前的位置,需要执行一次rs.next()才能让rs的指针指向第一个结果,同时返回true,之后每次执行rs.next()都会把指针移到下一个结果上并返回true,直至再也没有结果时,rs.next()的返回值变成了false。
还有一个细节,在执行建库建表、INSERT、UPDATE时,需要用executeUpdate();在执行SELECT时,需要使用executeQuery()。executeQuery()总是返回一个ResultSet,executeUpdate()返回符合查询的行数。
httpGet.java的getByString()类负责向所给的网址发送请求,然后下载网页内容。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
|
HttpGet
httpget
=
new
HttpGet
(
url
)
;
System
.
out
.
println
(
"executing request "
+
httpget
.
getURI
(
)
)
;
ResponseHandler
<
String
>
responseHandler
=
new
ResponseHandler
<
String
>
(
)
{
public
String
handleResponse
(
final
HttpResponse
response
)
throws
ClientProtocolException
,
IOException
{
int
status
=
response
.
getStatusLine
(
)
.
getStatusCode
(
)
;
if
(
status
>=
200
&&
status
<
300
)
{
HttpEntity
entity
=
response
.
getEntity
(
)
;
return
entity
!=
null
?
EntityUtils
.
toString
(
entity
)
:
null
;
}
else
{
throw
new
ClientProtocolException
(
"Unexpected response status: "
+
status
)
;
}
}
}
;
String
responseBody
=
httpclient
.
execute
(
httpget
,
responseHandler
)
;
|
这段代码是HTTPComponents的HTTP Client组件中给出的样例,在很多情况下可以直接使用。这部分代码获得了一个字符串responseBody,里面保存着网页中的全部字符。
接着,就需要把responseBody传递给parsePage.java的parseFromString类提取链接。
1
2
3
4
5
6
7
8
9
10
11
|
Parser
parser
=
new
Parser
(
content
)
;
HasAttributeFilter
filter
=
new
HasAttributeFilter
(
"href"
)
;
try
{
NodeList
list
=
parser
.
parse
(
filter
)
;
int
count
=
list
.
size
(
)
;
//process every link on this page
for
(
int
i
=
0
;
i
<
count
;
i
++
)
{
Node
node
=
list
.
elementAt
(
i
)
;
if
(
node
instanceof
LinkTag
)
{
|
在HTML文件中,链接一般都在a标签的href属性中,所以需要创建一个属性过滤器。NodeList保存着这个HTML文件中的所有DOM节点,通过在for循环中依次处理每个节点寻找符合要求的标签,可以把网页中的所有链接提取出来。
然后通过nextlink.startsWith()进一步筛选,只处理以“http://johnhany.net/”开头的链接并跳过以“http://johnhany.net/wp-content/”开头的链接。
1
2
3
4
5
6
7
8
9
10
11
|
sql
=
"SELECT * FROM record WHERE URL = '"
+
nextlink
+
"'"
;
stmt
=
conn
.
createStatement
(
ResultSet
.
TYPE_FORWARD_ONLY
,
ResultSet
.
CONCUR_UPDATABLE
)
;
rs
=
stmt
.
executeQuery
(
sql
)
;
if
(
rs
.
next
(
)
)
{
}
else
{
//if the link does not exist in the database, insert it
sql
=
"INSERT INTO record (URL, crawled) VALUES ('"
+
nextlink
+
"',0)"
;
pstmt
=
conn
.
prepareStatement
(
sql
,
Statement
.
RETURN_GENERATED_KEYS
)
;
pstmt
.
execute
(
)
;
|
在表record中查找是否已经存在这个链接,如果存在(rs.next()==true),不做任何处理;如果不存在(rs.next()==false),在表中插入这个地址并把crawled置为false。因为之前recordID设为AUTO_INCREMENT,所以要用 Statement.RETURN_GENERATED_KEYS获取适当的编号。
1
2
3
4
5
6
7
8
|
nextlink
=
nextlink
.
substring
(
mainurl
.
length
(
)
)
;
if
(
nextlink
.
startsWith
(
"tag/"
)
)
{
tag
=
nextlink
.
substring
(
4
,
nextlink
.
length
(
)
-
1
)
;
tag
=
URLDecoder
.
decode
(
tag
,
"UTF-8"
)
;
sql
=
"INSERT INTO tags (tagname) VALUES ('"
+
tag
+
"')"
;
pstmt
=
conn
.
prepareStatement
(
sql
,
Statement
.
RETURN_GENERATED_KEYS
)
;
pstmt
.
execute
(
)
;
|
去掉链接开头的“http://johnhany.net/”几个字符,提高字符比较的速度。如果含有“tag/”说明其后的字符是一个标签的名字,把这给名字提取出来,用UTF-8编码,保证汉字的正常显示,然后存入表tags。类似地还可以加入判断“article/”,“author/”,或“2013/11/”等对其他链接进行归类。
结果
这是两张数据库的截图,显示了程序的部分结果:
在这里可以获得全部输出结果。可以与本博客的sitemap比较一下,看看如果想在其基础上实现sitemap生成工具,还要做哪些修改。