目标:用Scrapy框架爬取帖子的编号、标题、内容、url,存储到Mongodb数据库
1.定义项目所需爬取的字段( items.py )
import scrapy
# 定义项目所需爬取的字段
class ComplaintspiderItem(scrapy.Item):
# 帖子编号
number = scrapy.Field()
# 帖子题目
title = scrapy.Field()
# 帖子内容
content = scrapy.Field()
# 帖子链接
url = scrapy.Field()
2.爬网页数据,取出item结构化数据(spiders/complaint.py)
import scrapy
from ComplaintSpider.items import ComplaintspiderItem
class ComplaintSpider(scrapy.Spider):
name = 'complaint'
# 设置爬取的域名范围,可省略,不写则表示爬取时候不限域名,结果有可能会导致爬虫失控
allowed_domains = ['wz.sun0769.com']
url = 'http://wz.sun0769.com/index.php/question/questionType?type=4&page='
offset = 0
start_urls = [url + str(offset)]