原理
根据规从页面中提取到“下一页”或“其他分页”链接
用到模块
from pyquery import PyQuery as pq
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
完整代码
# -*- coding: utf-8 -*-
from pyquery import PyQuery as pq
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
class SegSpider(CrawlSpider):
name = "seg"
allowed_domains = ["segmentfault.com"]
start_urls = (
'http://segmentfault.com/t/html5?type=newest&page=1',
)
rules = (
Rule(SgmlLinkExtractor(allow=('\/t\/html5\?type=newest\&page=\d', ))),
Rule(SgmlLinkExtractor(allow=('\/q\/\d+', )), callback='parse_item'),
)
def parse_item(self, response):
html = response.body
v = pq(html)
item = dict()
item['url'] = response.url
item['title'] = v('title').text()
yield item
本文介绍了一个使用Scrapy框架实现的分页爬虫案例,该爬虫针对SegmentFault网站进行设计,能够自动翻页并抓取指定类型的问题详情。文章提供了完整的Python代码示例,包括如何利用PyQuery解析网页内容、定义爬虫规则及链接提取等关键步骤。

1546

被折叠的 条评论
为什么被折叠?



