scrapy 中文的资料挺少的,写文章记录一下,以爬cnbeta新闻为例子,
抓取cnbeta的新闻标题+链接。
1.新建scrapy项目
scrapy startproject cnbeta
目录结构:
1
2
3
4
5
6
7
8
9
|
cnbeta
/
├──
cnbeta
│
├──
__init__
.
py
│
├──
items
.
py
│
├──
pipelines
.
py
│
├──
settings
.
py
│
└──
spiders
│
└──
__init__
.
py
└──
scrapy
.
cfg
|
2.定义数据结构
编辑cnbeta/items.py
1
2
3
4
5
6
7
8
9
10
11
12
|
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html
from
scrapy
.
item
import
Item
,
Field
class
CnbetaItem
(
Item
)
:
# define the fields for your item here like:
# name = Field()
title
=
Field
(
)
url
=
Field
(
)
|
定义了两个字段,分别存储标题和链接
3.编写spider(爬虫)
编辑cnbeta/spiders/cb.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
|
from
scrapy
.
contrib
.
spiders
import
CrawlSpider
,
Rule
from
scrapy
.
contrib
.
linkextractors
.
sgml
import
SgmlLinkExtractor
from
scrapy
.
selector
import
Selector
from
cnbeta
.
items
import
CnbetaItem
class
CBSpider
(
CrawlSpider
)
:
name
=
'cnbeta'
allowed_domains
=
[
'cnbeta.com'
]
start_urls
=
[
'http://www.cnbeta.com'
]
rules
=
(
Rule
(
SgmlLinkExtractor
(
allow
=
(
'/articles/.*\.htm'
,
)
)
,
callback
=
'parse_page'
,
follow
=
True
)
,
)
def
parse_page
(
self
,
response
)
:
item
=
CnbetaItem
(
)
sel
=
Selector
(
response
)
item
[
'title'
]
=
sel
.
xpath
(
'//title/text()'
)
.
extract
(
)
item
[
'url'
]
=
response
.
url
return
item
|
rules指定了含有/articles/.*\.htm的链接都会被匹配.
4.运行爬虫
1
|
scrapy
crawl
cnbeta
-
o
result
.json
-
t
json
|
将结果输出到result.json -t json指定文件格式为json
5.结果
[
{
"url"
:
"http://www.cnbeta.com/articles/268661.htm"
,
"title"
:
[
"\u53ea\u97001\u5143\u4e0d\u62a2\u624d\u75af\uff01\u5c0f\u5ea6Wi-Fi\u5957\u88c5\u9707\u64bc\u4ef716\u65e5\u5f00\u62a2_Baidu \u767e\u5ea6_cnBeta.COM"
]
}
,
{
"url"
:
"http://www.cnbeta.com/articles/268872.htm"
,
"title"
:
[
"\u8c37\u6b4c\u667a\u80fd\u5bb6\u5c45\u8ba1\u5212\u6216\u72af\u7684\u9519\uff1a\u5c01\u95edNest_Google / \u8c37\u6b4c_cnBeta.COM"
]
}
,
{
"url"
:
"http://www.cnbeta.com/articles/268865.htm"
,
"title"
:
[
"\u4e2d\u56fd\u624b\u673a\u7f51\u6c11\u89c4\u6a21\u8fbe5\u4ebf \u5e74\u589e\u957f8009\u4e07\u4eba_cnBeta \u89c6\u70b9\u89c2\u5bdf_cnBeta.COM"
]
}
,
{
"url"
:
"http://www.cnbeta.com/articles/268869.htm"
,
"title"
:
[
"\u524d\u82f9\u679c\u9ad8\u7ea7\u526f\u603b\u88c1\u4e3a\u5927\u5b66\u4ee3\u8a00\uff1f\u539f\u662f\u56fe\u7247\u88ab\u76d7\u7528_cnBeta \u4eba\u7269_cnBeta.COM"
]
}
,
{
"url"
:
"http://www.cnbeta.com/articles/268866.htm"
,
"title"
:
[
"\u6bd4\u7279\u5e01\u4ea4\u6613\u5e73\u53f0\u906d\u9047\u751f\u5b58\u5371\u673a_cnBeta \u89c6\u70b9\u89c2\u5bdf_cnBeta.COM"
]
}
,
{
"url"
:
"http://www.cnbeta.com/articles/268870.htm"
,
"title"
:
[
"\u76db\u5927\u6e38\u620f\u4f20\u5947\u88ab\u4fb5\u6743\u6848\u65b0\u8fdb\u5c55\uff1a\u8ffd\u52a0\u56db\u540d\u88ab\u544a_cnBeta \u89c6\u70b9\u89c2\u5bdf_cnBeta.COM"
]
}
,
{
"url"
:
"http://www.cnbeta.com/articles/268867.htm"
,
"title"
:
[
"\u91d1\u878d\u65f6\u62a5\uff1a\u4e2d\u56fd\u8d2b\u56f0\u5730\u533a\u7f51\u8d2d\u589e\u901f\u8d85\u53d1\u8fbe\u5730\u533a_\u7535\u5b50\u5546\u52a1 - B2C / B2B_cnBeta.COM"
]
}
,
{
"url"
:
"http://www.cnbeta.com/articles/268868.htm"
,
"title"
:
[
"\u79fb\u52a84G\u7248iPhone 5s/5c\u4e0a\u624b\uff1a\u901f\u5ea6\u6539\u53d8\u4f53\u9a8c_Apple iPhone_cnBeta.COM"
]
}
,
{
"url"
:
"http://www.cnbeta.com/articles/268871.htm"
,
"title"
:
[
"[\u7ec4\u56fe]\u7247\u573a\u63a2\u79d8\u544a\u8bc9\u4f60\u7d22\u5c3c4K\u5f71\u7247\u662f\u600e\u4e48\u70bc\u6210\u7684_SONY \u7d22\u5c3c_cnBeta.COM"
]
}
,
{
"url"
:
"http://www.cnbeta.com/articles/268875.htm"
,
"title"
:
[
"USB 3.0\u548c\u534