Python3 网络爬虫学习手册

最新推荐文章于 2024-07-24 16:36:32 发布

Hogwarts扫地老太太

最新推荐文章于 2024-07-24 16:36:32 发布

阅读量1.1k

点赞数

分类专栏： Python 文章标签：大数据

本文链接：https://blog.csdn.net/weixin_45549370/article/details/108688406

版权

Python 专栏收录该内容

15 篇文章 0 订阅

订阅专栏

本手册原创整理，

1. 基本概念

1.1 API （应用编程接口）

2. Scrapy 项目实战

2.1 预备知识

黑马程序员

构件

item——存spider爬到的data

spider——爬取data

middleware—中间件

pipelines——处理item里的数据

配置Infro

request_count ——请求个数

request_method_count ——请求方法

response_count ——响应个数
response_status_count ——响应状态码

spider crawl data —> store in item—> return to pipeline—> trace back to spider

2.2. 安装Python3 并更新

1.1 在有python iDE 的文件夹下，用管理员方式打开cmd

输入pip install scrapy

提示warning

WARNING: You are using pip version 19.3.1; however, version 20.2.3 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.

更新下pip

python -m pip install --upgrade pip

E:\python_3.7.4\Scripts>python -m pip install --upgrade pip
Collecting pip
  Using cached https://files.pythonhosted.org/packages/4e/5f/528232275f6509b1fff703c9280e58951a81abe24640905de621c9f81839/pip-20.2.3-py2.py3-none-any.whl
Installing collected packages: pip
  Found existing installation: pip 19.3.1
    Uninstalling pip-19.3.1:
      Successfully uninstalled pip-19.3.1
Successfully installed pip-20.2.3

暂时安装完成

但是在cmd内输入scrapy时报错

TypeError: attrs() got an unexpected keyword argument 'eq'

出现这个问题的原因是attrs的版本不够（google）

输入

pip3 install attrs==19.2.0 -i http://mirrors.aliyun.com/pypi/simple  --trusted-host mirrors.aliyun.com

再在cmd内输入scrapy，检查安装

显示成功

Usage:
  scrapy <command> [options] [args]

2.3.性能测试

使用

scrapy 开头
测试scrapy性能测试
```
scrapy bench
```

爬取速度
可能遇到接口占用问题，解决方法详见：如何查找接口占用并结束进程

3. 新建项目（默认框架）并查看框架

create project

cmd 中输入

scrapy startproject MyScrapy20200919

MyScrapy20200919——project name

dir

配置files

setting.py

USER_AGENT

放URL
request headers——报文头
spider middlewares
download middlewares：｛key : value｝

value 越大，越优先
item pipelines
ROBOTSTXT_OBEY

一般关闭
COOKIES_ENABLED

一般关闭

4. 新建爬虫

4.1 过程示例

cmd 内输入

scrapy genspider itcast "https://www.csdn.net"

输出：

Created spider 'itcast' using template 'basic' in module:
  MyScrapy20200919.spiders.itcast

结果解释：新建basic模板，爬虫名“itcast”，爬取范围“https://www.csdn.net”

scrapy check itcast

检查name为itcast的爬虫状态

注意：URL一定要先去浏览器测试看是否正常，否则会报错：

 "no results for hostname lookup: {}".format(self._hostStr)
twisted.internet.error.DNSLookupError: DNS lookup failed: no results for hostname lookup: http.

4.2 找自己的爬虫文件

爬虫文件统一放在对应工程的spiders文件夹下：

路径：工程文件名/工程文件名/spiders/ spider_name.py

5. 明确爬取目标

5.1 W3C 标准

5.1.1 HTML标记语言

详见《Python爬虫开发与项目实战》chapter 2.1.1

强制换行标记。(单个使用)
换段落标记。
段落中也可以包含段落
居中对齐标记<\center>
- 无序列表标记
- 列表项目标记
1. 有序列表标记
2. type属性值"1'"表示阿拉伯数字
3. type属性值;"I"表示大写罗马数字
4. "1”表示小写罗马数字
5. 注意:列表可以进行嵌套。
定义型列表
粗体列表
缩进列表
分区显示标记
标题标记。
共有6个级别
n的范围为1~6,不同级别对应不同显示大小的标题,
h1最大,h6最小
图像标记
超链接标记
班级姓名年龄籍贯
1500001 (1)班张三 16

班级	姓名	年龄	籍贯
1500001	(1)班	张三	16

5.1.1 CSS（细化HTML）

5.1.3 JavaScript

一种弱类型的脚本语言，可直接插入HTML页面中

详见《JavaScript DOM 编程艺术》

5.1.3 XPath语言

用于在XML文档中查找信息，主要在爬虫中提取网页信息

5.2 HTTP协议

5.2.1 Cookie 状态管理

详见《Python爬虫开发与项目实战》chapter 2.2.4

Cookie 是Server在User-Agent (客户端) Request后，作为响应一起发送给User-Agent 的唯一JSESSIONID，下次同一User-Agent 再次Request后，方便Server标识

5.3 定义爬取的目标——修改item.py

在项目文件夹MyScrapy20200919下，找到item.py，在其中定义要爬取的item

比如我要爬取自己博客的标题和发布时间，则定义从类scrapy.Item继承下来的类Myscrapy20200919Item的属性

# blog title
title = scrapy.Field()
    
# blog release time
time = scrapy.Field()

以下为完整item.py 内容

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

"""
File name: 
    itcast.py

Target:
    crawl the title and release time of blogs in CSDN
"""

import scrapy


class Myscrapy20200919Item(scrapy.Item):
    # define the fields for your item here like:
    # blog title
    title = scrapy.Field()
    
    # blog release time
    time = scrapy.Field()

5.4 爬取网页内容——修改xxx(爬虫名).py

5.4.1 导入spider类

在4.1中，我定义了自己的爬虫itcast

在spiders文件夹下，找到定义的爬虫itcast.py

因为在Python程序中，每个.py文件都可以视为一个模块。若通过在当前.py文件中导入A.py文件，则可以使用A中.py定义的内容，例如类、变量、函数等。

如果项目有 init.py文件，则可以把工程下的py文件打包成模块，引入其他py文件

如：

#从工程文件夹MyScrapy20200919下的item.py模块导入类Myscrapy20200919Item
from MyScrapy20200919.items import Myscrapy20200919Item

注意文件夹名称一定要写对，否则会报错！

5.4.2 用XPath节点查找网页信息

添加XPath Helper

google 应用商店里下载扩展项——XPath Helper ，然后重启google

在浏览器中开启“Developer tool”

goolge 开发者模式

用Xpath节点查找爬取内容

有关XPath的扫盲详见《Python爬虫开发与项目实战》chapter 2.1.4

在Element中通过高亮找到HTML文档段落，然后在XPathHelper的QUERY中输入节点信息

REQUEST会输出对应request内容

通过XPath查看HTML文档

5.4.2 爬取XPath节点信息——修改爬虫文件

比如，我要爬取的博客标题的XPath节点为==//h4/a==，博客发布时间为==//div[@class = ‘info-box d-flex align-content-center’]/p==

所以我在对应爬虫文件 itcast.py 内修改parse 方法

# _*_ coding: utf-8 _*_

import scrapy
#从工程文件夹MyScrapy20200919下的item.py模块导入如类
from MyScrapy20200919.items import Myscrapy20200919Item

# one spider named itcast
class ItcastSpider(scrapy.Spider):
    # name of spider
    name = 'itcast'
    
    # allowed_domains = ['https://www.csdn.net'] # optional
    # 1. crawl URL adress starting from this list
    start_urls = ['https://blog.csdn.net/weixin_45549370']

    # 2. URL 经过排列->出队列->去重->给下载器
    
    # 3. 下载器下载好的每个response 传入 parse 解析并提取数据
    def parse(self, response):
#        # write in file named "MyBlogs.html"
#        with open("MyBlogs.html", "w", encoding="utf-8") as file:
#            file.write(response.text)
        
        # from Xpath node 
        node_TitleList = response.xpath("//h4/a/text()[2]").extract() 
        node_TimeList = response.xpath("//span[@class = 'date']").extract()
        # //h4/a/text()[2] 和 //span[@class = 'date']    
        # 分别为博客标题和博客发布时间的XPath节点信息
        # extract()把xpath返回的list转换为Unicode字符串
        
        # crawl blog title and release times 
        for node in zip(node_TitleList, node_TimeList):
            # creat object to store info
            item = Myscrapy20200919Item()
         
#            # 如果直接print, 则要把xpath返回的list转换为Unicode字符串
#            blog_title = node.xpath("./a/text()").extract()
#            
#            print (blog_title)
             
            # 把item当做字典，根据在items定义的关键字，存入获取的blog_title    
            item['title'] = node[0]
            item['time'] = node[1]
            
            # 每次迭代获取数据交给pipeline, 写入DBS
            yield item  # 生成器：下一次迭代从此开始，
            # pipeline每处理完一个循环数据，继续回次循环迭代，直到循环终止

注意:

xpath返回的为迭代器，必须用extra()转化为json，否则输出 itcast.json文件时会报错

TypeError: Object of type Selector is not JSON serializable

yield item 返回items 到pipelines

5.5 开启并设置pipelines

启动pipelines——修改setting.py

找到 dict ITEM_PIPELINES

ITEM_PIPELINES 中的value 越小，优先级越高，item先进入优先级高的处理

处理从item获取的数据——修改pipelines.py

在process_item方法下，增加解码函数，并写入.json文件中

# spider文件 yield时返回的函数
    def process_item(self, item, spider):
        # 把item里的json文件转成字典dict，中文用unicode编码
        content = json.dumps(dict(item), ensure_ascii = False) + ",\n"
        
        # write dict in file, and encode into utf-8
        self.file.write(content.encode('utf-8'))
        
        # 告诉engine 当前pipeline已处理好当前item, 
        # 如果还有其他pipeline则把item依次送入所有pipeline中，
        # 当所有pipelines都处理完，则返回spider继续处理下一个循环的item
        return item

打开、写入文件函数（可选）

 	# open outside json file
    def __init__ (self):
        # 以二进制方式打开itcast_pipeline.json，文件状态为可读写
        self.file = open("itcast_pipeline.json", "wb+")
 
    # close written file
    def close_spider(self, spider):
        self.file.close()

完整 pipelines.py 代码如下：

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
# from itemadapter import ItemAdapter
import json

class Myscrapy20200919Pipeline:
    # open outside json file
    def __init__ (self):
        # 以二进制方式打开itcast_pipeline.json，文件状态为可读写
        self.file = open("itcast_pipeline.json", "wb+")
    
    # spider文件 yield时返回的函数
    def process_item(self, item, spider):
        # 把item里的json文件转成字典dict，中文用unicode编码
        content = json.dumps(dict(item), ensure_ascii = False) + ",\n"
        
        # write dict in file, and encode into utf-8
        self.file.write(content.encode('utf-8'))
        
        # 告诉engine 当前pipeline已处理好当前item, 
        # 如果还有其他pipeline则把item依次送入所有pipeline中，
        # 当所有pipelines都处理完，则返回spider继续处理下一个循环的item
        return item
    
    # close written file
    def close_spider(self, spider):
        self.file.close()