python爬虫 scrapy框架学习

最新推荐文章于 2024-08-10 23:14:03 发布

learn_is_happy

最新推荐文章于 2024-08-10 23:14:03 发布

阅读量444

点赞数

文章标签： python 爬虫框架

本文链接：https://blog.csdn.net/learn_is_happy/article/details/78836145

版权

草稿专栏收录该内容

5 篇文章 0 订阅

订阅专栏

python爬虫 scrapy框架学习

一、步骤：
新建项目 (Project)：新建一个新的爬虫项目
明确目标（Items）：明确你想要抓取的目标
制作爬虫（Spider）：制作爬虫开始爬取网页
存储内容（Pipeline）：设计管道存储爬取内容

1、新建项目
scrapy startproject filename baidu.com

2、明确目标
在Scrapy中，items是用来加载抓取内容的容器，有点像Python中的Dic，也就是字典，但是提供了一些额外的保护减少错误。
一般来说，item可以用scrapy.item.Item类来创建，并且用scrapy.item.Field对象来定义属性（可以理解成类似于ORM的映射关系）。
接下来，我们开始来构建item模型（model）。
首先，我们想要的内容有：
作者（author）
内容（text）
标签（tags）

3、制作爬虫也是最关键的一步

# -*- coding: utf-8 -*-
import scrapy
import sys
sys.path.append("D:\\pycodes\\quotes")
from quotes.items import quotesItem

class BooksSpider(scrapy.Spider):
    name = 'books'
    allowed_domains = ['quotes.toscrape.com']
    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
        for sel in response.xpath('//div[@class="quote"]'):
            item = quotesItem()
            item['text']=sel.xpath('span[@class="text"]/text()').extract()
            item['author']=sel.xpath('span/small/text()').extract()
            item['tags']=sel.xpath('div/a/text()').extract()
            yield item

4、设计通道

通过设计pipeline通道，来处理item数据。

class DoubanPipeline(object):
    def process_item(self, item, spider):
        return item

class DoubanInfoPipeline(object):
    def open_spider(self,spider):
        self.f=open("result.txt","w")

    def close_spider(self,spider):
        self.f.close()

    def process_item(self,item,spider):
        try:
            line = str(dict(item)) + '\n'
            self.f.write(line)
        except:
            pass
        return item

#

1、选择器xpath的使用
response.xpath(//div/@href).extract()
response.xpath(//div[@href]/text()).extract()
response.xpath(//div[contains(@href,”image”)]/@href

若在div下选择不是直系子节点的p，需要
div.xpath(“.//p”)注意加.

2、xpath.re的应用
Selector 也有一个 .re() 方法，用来通过正则表达式来提取数据。然而，不同于使用 .xpath() 或者 .css() 方法, .re() 方法返回unicode字符串的列表。所以你无法构造嵌套式的 .re() 调用。

下面是一个例子，从上面的 HTML code 中提取图像名字:

response.xpath(‘//a[contains(@href, “image”)]/text()’).re(r’Name:\s*(.*)’)
[u’My image 1’,
u’My image 2’,
u’My image 3’,
u’My image 4’,
u’My image 5’]

3、
例如在XPath的 starts-with() 或 contains() 无法满足需求时， test() 函数可以非常有用。

例如在列表中选择有”class”元素且结尾为一个数字的链接:

from scrapy import Selector

doc = “””
…

…

…
first item
…
second item
…
third item
…
fourth item
…
fifth item
…

…

… “””
sel = Selector(text=doc, type=”html”)
sel.xpath(‘//li//@href’).extract()
[u’link1.html’, u’link2.html’, u’link3.html’, u’link4.html’, u’link5.html’]
sel.xpath(‘//li[re:test(@class, “item-\d$”)]//@href’).extract()
[u’link1.html’, u’link2.html’, u’link4.html’, u’link5.html’]

3、for index,link in enumberate(links):
print (index,link)
0 link1
1 link2
…

4、不一定非按照四个步骤来
有时可以默认不改变items.py
直接在spider.py里生成产生的字典，例如：
yield{

等等

5、递归链接，分布爬取,

在parse(self,response):
方法中加入：

next_page=response.xpath("")
if next_page：
    next_page=response.urljoin(next_page)
    yield scrapy.Request(next_page,callback=self.parse)

6、如何防止出现403错误：
需要调节 setting.py 文件
调节USER_AGENT
USER_AGENT = ‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.54 Safari/536.5’
模拟浏览器访问

learn_is_happy

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python爬虫 scrapy框架学习

python爬虫 scrapy框架学习一、步骤：新建项目 (Project)：新建一个新的爬虫项目明确目标（Items）：明确你想要抓取的目标制作爬虫（Spider）：制作爬虫开始爬取网页存储内容（Pipeline）：设计管道存储爬取内容1、新建项目 scrapy startproject filename baidu.com2、明确目标在Scrapy中，i
复制链接

扫一扫

专栏目录