scrapy学习

最新推荐文章于 2022-02-26 20:14:00 发布

呆傻程序员

最新推荐文章于 2022-02-26 20:14:00 发布

阅读量749

点赞数

分类专栏：爬虫文章标签： scrapy

本文链接：https://blog.csdn.net/fanshuquan/article/details/50781178

版权

爬虫专栏收录该内容

1 篇文章 0 订阅

订阅专栏

常用命令

scrapy –help
scrapy version -v
startproject
genspider a a.com(必须在项目中使用，可产生多个spider)
list
view+爬取地址
parse(采用parse中的过程对页面进行解析并打印出来)
shell+爬取地址(进入scrapy交互式环境，进行一系列测试，不需要具体的工程，常用response对象进行一系列测试)
response.css()
response.xpath()
response.re()
runspider

Xpath

具体的Xpath用法可以参考官方文档，此文档以实例形式给出，非常直观，当然，更加傻瓜式的方法是使用firebug直接获取！

初始测试代码

import scrapy

class StackOverflowSpider(scrapy.Spider):
name = 'stackoverflow'
start_urls = ['http://stackoverflow.com/questions?sort=votes']

def parse(self, response):
    for href in response.css('.question-summary h3 a::attr(href)'):
        full_url = response.urljoin(href.extract())
        yield scrapy.Request(full_url, callback=self.parse_question)

def parse_question(self, response):
    yield {
        'title': response.css('h1 a::text').extract()[0],
        'votes': response.css('.question .vote-count-post::text').extract()[0],
        'body': response.css('.question .post-text').extract()[0],
        'tags': response.css('.question .post-tag::text').extract(),
        'link': response.url,
    }

运行这个代码只需执行

scrapy runspider stackoverflow_spider.py -o top-stackoverflow-questions.json

抓取整个网页

import scrapy

class DmozSpider(scrapy.Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
    "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
    "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]

def parse(self, response):
    filename = response.url.split("/")[-2] + '.html'
    with open(filename, 'wb') as f:
        f.write(response.body)

运行scrapy crawl domz

scrapy基本使用步骤

创建工程
定义item
编写spider
配置pipeline
运行爬虫

spider执行流程

request
返回response
使用selector
存储item

scrapy.Spider类

属性
name
allowed_domains
start_urls
custom_settings(全局配置)
crawler
setings(实例配置)
logger
方法
from_crawler():用于创建spiders
start_request()：生成初始request
make_requests_from_url(url):根据url生成一个request
parse(response):解析器
log()旧版
self.logger.info(“success”)
closed(reason)
子类
CrawlSpider(增加成员rules，比如抓取与不抓取的网页，使用哪个解析器解析；成员parse_start_url(response))
XMLFeedSpider
CSVFeedSpider
SitemapSpider