爬虫基本原理与实战---1、爬虫实战概述

最新推荐文章于 2024-08-24 00:15:00 发布

置顶

Daphar

最新推荐文章于 2024-08-24 00:15:00 发布

阅读量2.5k

点赞数 2

分类专栏：爬虫 scrapy 文章标签： scrapy 实战

本文链接：https://blog.csdn.net/qd_ltf/article/details/79705426

版权

本文详细介绍了Python Scrapy框架的基本使用，包括开发环境准备、Scrapy架构、爬取数据、保存数据到多种存储、反爬技术以及Scrapy进阶特性。内容涵盖虚拟环境配置、数据库连接、shell调试、数据保存到MySQL、JSON、Elasticsearch以及反爬策略，如更换User-Agent、模拟登录等。

摘要由CSDN通过智能技术生成

一、开发前准备

1、开发环境准备

基础准备(win10)

参考：python2与python3共存安装

参考： pycharm安装及永久激活

参考： mysql及navicat安装与使用及navicat破解

参考：安装cmder替代cmd 推荐

虚拟环境搭建

进入到想要存放虚拟环境的目录下，安装virtualenvwrapper：
pip install virtualenvwrapper-win

创建虚拟环境：
mkvirtualenv -p C:\Python35\python.exe myenv # myenv为虚拟环境名
指定目录下创建虚拟环境变量

虚拟环境常用命令：workon、deactivate、rmvirtualenv

通常需要安装的包

pip install scrapy # scrapy包，可能需要从 https://www.lfd.uci.edu/~gohlke/pythonlibs/，分别安装twisted及scrapy。如：
pip install Twisted-17.9.0-cp36-cp36m-win_amd64.whl
piip install Scrapy-1.5.0-py2.py3-none-any.whl

pip install pypiwin32
# 如果出现win32api错误，尝试 python pywin32_postinstall.py -install

pip install Pillow # 处理图片

pip install mysqlclient # MySQLdb包

pip install fake-useragent # 随机user-agent

pip install requests # 安装requests

pip install selenium # 安装selenium

pip install pyvirtualdisplay # chrome无界面运行（linux环境下可用）

如果有问题，可到 https://www.lfd.uci.edu/~gohlke/pythonlibs/ 寻找解决办法
利用豆瓣源可加速安装
pip install -i https://pypi.douban.com/simple mywrap # mywrap为包名

2、新建爬虫

创建爬虫项目：
进入cmd，进入项目想要存放的目录；
进入到拟创建项目的虚拟环境；
执行命令： scrapy startproject <project_name> [project_dir]

创建爬虫模板：
cd project_name # project_name为项目名，需要进入到项目目录下面
scrapy genspider [options] <name> <domain> # 生成爬虫模板

常用爬虫命令：输入scrapy ，可查看帮助

进入pycharm，设置项目解释器为项目的虚拟环境

3、写main.py文件，便于调试

便于直接使用pycharm调试，需要配置main.py文件如下，当运行main.py时，可通过断点进行调试

# main.py # 已存在，需要复制下面内容到文件中
from scrapy.cmdline import execute
import sys
import os

sys.path.append(os.path.dirname(os.path.abspath(__file__))) # 添加默认路径   

execute(["scrapy", "crawl", "jobbole"]) # 根据实际爬虫进行修改
# execute(["scrapy", "crawl", "zhihu"])  
# execute(["scrapy", "crawl", "lagou"])

4、创建自动生成部分代码的函数

分析网页及需求，规划需要哪些字段及字段属性

编写自动生成数据库创建、插入数据、更新数据、item赋值语句等代码的函数

# gen_item_sql.py # 用于自动生成部分代码，减少手工输入，其中字段要视具体情况进行修改
def gen_item_sql(item_list, table_name, class_item):
    '''
    自动生成代码辅助函数
    '''

    # 创建mysql表
    print('-' * 50, 'sql 表的创建', '-' * 30)
    item_p = []
    for item in item_list:
        if item[-4:] == 'nums':
            item_p.append("%s int(11) DEFAULT 0 NOT NULL" % (item,))
        elif item[-4:] == "date":
            item_p.append("%s date DEFAULT NULL" % (item,))
        elif item[-8:] == "datetime":
            item_p.append("%s datetime DEFAULT NULL" % (item,))
        elif item[-7:] == "content":
            item_p.append("%s longtext DEFAULT NULL" % (item,))
        elif item[-2:] == "id":
            item_p.append("%s varchar(50) NOT NULL" % (item,))
        else:
            item_p.append("%s varchar(300) DEFAULT NULL" % (item,))

    print("CREATE TABLE  %s (" % (table_name,))
    print(",".join(item_p))
    print(") ENGINE=InnoDB DEFAULT CHARSET=utf8;")

    # 在spider文件中，定义类解析item:------------------------------
    print('-' * 50, '在spider中，定义def parse_item(self, response):', '-' * 30)
    print('def parse_item(self, response):')
    print("\titem_loader=MyItemLoader(item={0}(),response=response)".format(class_item))
    for item in item_list:
        print('\titem_loader.add_xpath(\'%s\',\'\')' % (item,))
    print("\titem = item_loader.load_item()")
    print("\tyield item")

    # 在items.py文件中，定义Item类:------------------------------

    print('-' * 50, '在items.p文件中，定义Item类:', '-' * 30)
    print("class {0}(scrapy.Item):".format(class_item))

    for item in item_list:
        print("\t%s =scrapy.Field()" % (item,))
    print()
    print("\tdef get_insert_sql(self):")
    print("\t# 获取插入的sql语句及需要传递的参数")
    sql_str = '\t\tinsert_sql = \'\'\''
    sql_str += '\ninsert into %s ( %s )' % (table_name, ','.join(item_list))
    sql_str += '\nVALUES ( %s )' % (','.join(['%s' for _ in item_list]),)
    sql_str += '\nON DUPLICATE KEY UPDATE %s' % (','.join(['%s= VALUES(%s)' % (x, x) for x in item_list]),)
    sql_str += '\n\'\'\''
    print(sql_str)
    print('\t\tparams = ( %s )' % (','.join(['self["%s"]' % (x,) for x in item_list]),))
    print("\t\treturn insert_sql,params")

    # 在items.py文件中，定义Item类:------------------------------

    print('-' * 50, '对item赋值', '-' * 30)
    for item in item_list:
        print('\titem["%s"] =' % (item,))

if __name__ == "__main__":
    jobbole_fields = ['title', 'create_datetime', 'url', 'url_object_id', 'tags', 'content', 'front_image_url',
                      'front_image_path', 'praise_nums', 'comment_nums', 'fav_nums']
    gen_item_sql(jobbole_fields, 'jobbole_article',class_item = "JobboleArticleItem")

5、创建数据库（如：mysql）

通过复制上面自动生成的创建数据表的sql代码，到navicat中自动生成相应的数据表；
如果字段内容中有moji表情图标，数据库中数据表对应字段应设为utf8mb4

二、scrapy架构烂熟于心

scrapy_frame.jpg-69.2kB

The Engine gets the initial Requests to crawl from the Spider.
The Engine schedules the Requests in the Scheduler and asks for the next Requests to crawl.
The Scheduler returns the next Requests to the Engine.
The Engine sends the Requests to the Downloader, passing through the Downloader Middlewares (see process_request()).
Once the page finishes downloading the Downloader generates a Response (with that page) and sends it to the Engine, passing through the Downloader Middlewares (see process_response()).
The Engine receives the Response from the Downloader and sends it to the Spider for processing, passing through the Spider Middleware (see process_spider_input()).
The Spider processes the Response and returns scraped items and new Requests (to follow) to the Engine, passing through the Spider Middleware (see process_spider_output()).
The Engine sends processed items to Item Pipelines, then send processed Requests to the Scheduler and asks for possible next Requests to crawl.
The process repeats (from step 1) until there are no more requests from the Scheduler.

三、爬取数据

1、基于base生成的爬虫模板

即通过scrapy genspider -t base <name> <domain> # 生成的爬虫模板
关键是要重载两个函数：

定义 def parse(self, response):，实现：

从起始网页开始，获取页面中符合条件（正则匹配）的url链接

yield scrapy.Request()相应的url，通过回调函数parse_item，完成进行字段解析；

如有下一页面的url，则获取其url，yield scrapy.Request()，回调函数为parse，进一步遍历；

定义def parse_item(self, response):实现

对页面的解析，提取出页面中需要的信信息，存入到item中；

如解析中发现有新的可进一步爬取的url，yield scrapy.Request()，回调函数为parse，进一步遍历；

spider原码解读参看： https://blog.csdn.net/qd_ltf/article/details/79792957

2、基于crawl生成的爬虫模板

即通过scrapy genspider -t crawl <name> <domain> # 生成的爬虫模板
关键是要重载def parse_item(self, response):实现页面的解析；
可以重载以下函数：

def parse_start_url(self, response):
    return []
def process_results(self, response, results):
    return results

crawlspider原码解读参看： https://blog.csdn.net/qd_ltf/article/details/79782005

3、字段解析时，shell调试

解析页面时，通常用css或xpath实现字段解析，也通常可以在shell下调试，如：

scrapy shell http://blog.jobbole.com/112744/  # shell命令调试
re2 = response.xpath('//*[@id="post-112744"]/div[1]/h1/text()').extract()

四、保存数据（pipline）

可以根据需要，可以直接把下面部分代码拷贝到相应的文件中

1、保存到mysql

# settings.py  # 需要在setting.py文件中添加以下内容
MYSQL_HOST = '127.0.0.1'
MYSQL_DB_NAME = 'spider'    # 数据库名称
MYSQL_USER = 'root'
MYSQL_PASSWORD = 'root'
SQL_DATETIME_FORMAT = "%Y-%m-%d %H:%M:%S"  # 时间格式
SQL_DATE_FORMAT = "%Y-%m-%d"  # 日期格式

ITEM_PIPELINES = {
   
    'ltfspider.pipelines.MysqlTwistedPipline': 300, #根据具体的项目调整
}

# ------------------------------------
# pipelines.py # 需要在pipelines.py文件中添加以下内容

import MySQLdb
import MySQLdb.cursors
from twisted.enterprise import adbapi


class MysqlTwistedPipline(object):
    '''
    使用twisted将mysql插入变成异步操作
    '''

    def __init__(self, db_pool):
        self.db_pool = db_pool

    @classmethod
    def from_settings(cls, settings):
        # 从settings中获取数据库设置，并返回一个pipline的实例
        return cls(adbapi.ConnectionPool(
            "MySQLdb",
            host=settings['MYSQL_HOST'],
            db=settings['MYSQL_DB_NAME'],
            user=settings['MYSQL_USER']