Python分布式爬虫实战 - 豆瓣读书

最新推荐文章于 2024-02-02 18:28:15 发布

I'm_Jenson

最新推荐文章于 2024-02-02 18:28:15 发布

阅读量946

点赞数 4

分类专栏： python 爬虫数据分析文章标签： python 数据抓取爬虫 jsoup scrapy 爬虫网易云 python 分布式爬虫 linux 网络爬虫

本文链接：https://blog.csdn.net/weixin_44345359/article/details/98621989

版权

python 同时被 3 个专栏收录

6 篇文章 0 订阅

订阅专栏

爬虫

3 篇文章 0 订阅

订阅专栏

数据分析

3 篇文章 1 订阅

订阅专栏

本实例从零到一实现豆瓣读书的所有标签的分布式爬虫编写
在这里插入图片描述

本实例使用到的工具:

IDE:Pycharm
工具:Python,Scrapy,linux,mysql,redis
需要用到的模块:scrapy pymysql scrapy_redis selenium
抓取内容:书名,作者,出版日期,价格,评分,参与评分人数,评论数量,书籍类型

先来捋捋思路:

step1.爬取所有标签页面的链接,保存到数据库
step2.爬取每个标签所有内容页的链接
step3.分布式爬取每个内容页(重点)
step4:linux运行scrapy爬虫

废话不多说,直接开搞

在这里插入图片描述

1.爬取所有标签页面的链接,保存到数据库

这里为了方便,使用requests库进行爬取

import requests
from lxml import etree

# UA,不必多说了吧
header = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36"
}
def crawl_tag_links(url):
    # 爬取总标签页面,也就是这个 "https://book.douban.com/tag/?view=cloud"
    response = requests.get(url, headers=header)
    e = etree.HTML(response.text)
    # 取下所有标签的链接(我这边一共120个标签URL)
    tag_links = e.xpath("//table[@class='tagCol']//a/@href")
    # 取下来的链接是网址的后部分,比如[/小说,/历史......],所以需要补全网址
    tag_links = [f"https://book.douban.com{i}" for i in tag_links]

保存到mysql数据库

import pymysql

def save_tag_links(links):
    # 建立数据库对象 注意修改数据库ip地址和账号密码
    conn = pymysql.connect("192.168.2.208", "root", "123456", "douban")
    # 游标对象
    cursor = conn.cursor()
    # 查询数据表是否存在
    # 返回1表示存在 0表示不存在
    if not cursor.execute("show tables like 'tag_links'"):
        # 创建数据表,这里命名为tag_links
        cursor.execute(
            """
            create table tag_links(
            id int primary key auto_increment,
            url varchar(100),
            status int
            )
            """
        )
    # 准备sql语句
    sql = "insert into tag_links values (%s,%s,%s)"
    # 准备插入数据库的数据
    # 第一个0是数据库的id列,插入数据时候id这一字段是自增的,所以给个0它就可以了
    # 第二个link就是每个标签页的url
    # 第三个0 表示还没被爬取,之后爬取这个标签页面的时候爬取成功后修改这里的0为1
    #               表示已经爬取过..这样哪怕发生意外也不用从新爬取了
    insert_links = [(0, link, 0) for link in links]
    try:
        # 批量插入数据
        cursor.executemany(sql, insert_links)
        # 注意插入数据是事务操作,需要提交
        conn.commit()
    except Exception as err:
        # 出现错误,回滚操作
        conn.rollback()
        print(err)
    finally:
        cursor.close()
        conn.close()

完整代码(写成一个类方便以后调用)

import requests, pymysql
from lxml import etree


class TagSpider():
    def __init__(self):
        # UA
        self.header = {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36"
        }

    def crawl_tag_links(self, url):
        # 爬取所有热门标签页面
        response = requests.get(url, headers=self.header)
        e = etree.HTML(response.text)
        # 取下所有标签的链接
        tag_links = e.xpath("//table[@class='tagCol']//a/@href")
        # 取下来的链接是网址的后部分,比如 (/小说),所以需要补全网址
        tag_links = [f"https://book.douban.com{i}" for i in tag_links]
        # 保存链接到mysql数据库
        self.save_tag_links(tag_links)

    def save_tag_links(self, links):
        # 建立数据库对象
        conn = pymysql.connect("192.168.2.208", "root", "123456", "douban")
        # 游标对象
        cursor = conn.cursor()
        # 查询数据表是否存在
        # 返回1表示存在 0表示不存在
        if not cursor.execute("show tables like 'tag_links'"):
            # 创建数据表,这里命名为tag_links
            cursor.execute(
                """
                create table tag_links(
                id int primary key auto_increment,
                url varchar(100),
                status int
                )
                """
            )
        # 准备sql语句
        sql = "insert into tag_links values (%s,%s,%s)"
        # 准备插入数据库的数据
        # 第一个0是数据库的id列,插入数据时候id这一字段是自增的,所以给个0它就可以了
        # 第二个link就是每个标签页的url
        # 第三个0 表示还没被爬取,之后爬取这个标签页面的时候爬取成功后修改这里的0为1
        #               表示已经爬取过..这样哪怕发生意外也不用从新爬取了
        insert_links = [(0, link, 0) for link in links]
        try:
            # 批量插入数据
            cursor.executemany(sql, insert_links)
            # 注意插入数据是事务操作,需要提交
            conn.commit()
        except Exception as err:
            # 出现错误,回滚操作
            conn.rollback()
            print(err)
        finally:
            cursor.close()
            conn.close()


if __name__ == '__main__':
    # 所有热门标签页的URL
    url = "https://book.douban.com/tag/?view=cloud"
    # 创建对象实例
    get_tag_links = TagSpider()
    # 开始爬取所有标签
    get_tag_links.crawl_tag_links(url)

step2.爬取每个标签所有内容页的链接

Ps:貌似豆瓣有限制,只能查看每个标签的前50页
大致估算了一下,120(标签) x 50(页) x20(个内容页) = 12w条数据
为了节省更多的爬取时间,下面开始使用scrapy爬取

settings.py 配置文件

BOT_NAME = 'doubandushulinks'
SPIDER_MODULES = ['doubandushulinks.spiders']
NEWSPIDER_MODULE = 'doubandushulinks.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36"
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
ITEM_PIPELINES = {
   'doubandushulinks.pipelines.DoubandushulinksPipeline': 300,
}

爬虫文件

思路:分别爬取每个标签前50页,如遇上"没有找到符合条件的图书"就跳过…

# -*- coding: utf-8 -*-
import scrapy, pymysql, re
from urllib.parse import unquote


class DoubanlinksSpider(scrapy.Spider):
    name = 'doubanlinks'
    allowed_domains = ['douban.com']
    # 数据库对象
    conn = pymysql.connect("192.168.2.208", "root", "123456", "douban")
    # 游标
    cursor = conn.cursor()
    # 数据库中提取status为0(表示没爬取过)的URL
    cursor.execute("select url from tag_links where status = 0")
    urls = cursor.fetchall()
    # 养成良好的习惯,用完记得关闭数据库对象
    cursor.close()
    conn.close()
    # 每个标签url后补上页面数,每页编号相隔20
    start_urls = [f"{url[0]}?start={j}" for url in urls for j in range(0, 1000, 20)]

    def parse(self, response):
        # 如果页面出现 "没有找到符合条件的图书" 表示已经到达50也以后了
        if response.xpath("//p[@class='pl2']/text()").extract_first != "没有找到符合条件的图书":
            # 每个链接对应有标签名,保存下来日后做数据分析用
            tag = unquote(re.findall(r"tag/(.+)\?.+", response.url)[0])
            # 提取每一页的所有内容页链接
            content_links = response.xpath("//h2/a/@href").extract()
            # 准备存入数据库
            # 前面的0对应id,最后的0表示没有被爬取过,作用在之前解析过了
            item = {"data": [(0, url, tag, 0) for url in content_links]}
            # 爬取下来的数据只要轻轻的yield一下就可以交给管道处理了
            yield item

pipelines.py 管道

就像我们吃东西进肚子里所经过的大肠小肠十二指肠…

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
import pymysql

class DoubandushulinksPipeline(object):
    # 开始scrapy时调用以下函数
    def open_spider(self,spider):
        # 创建数据库对象
        self.conn = pymysql.connect("192.168.2.208","root","123456","douban")
        self.cursor = self.conn.cursor()
        # 如果没有content_links这个表就创建一个
        if not self.cursor.execute("show tables like 'content_links'"):
            self.cursor.execute(
                """
                create table content_links(
                id int primary key auto_increment,
                url varchar(100),
                type varchar(10),
                status int
                )
                """
            )

    def process_item(self, item, spider):
        # 储存到mysql 数据库
        sql = "insert into content_links values (%s,%s,%s,%s)"
        try:
            # 批量插入数据
            self.cursor.executemany(sql, item["data"])
            self.conn.commit()
            self
        except Exception as err:
            self.conn.rollback()
            print(err)
    
    # scrapy 关闭时调用以下函数
    def close_spider(self,spider):
        self.cursor.close()
        self.conn.close()

总爬取了117078个内容页链接,共花5分钟…
在这里插入图片描述

step3.分布式爬取每个内容页

咳咳~~重点来了,这里重新创建一个新爬虫项目,以免搞乱之前写的代码~!

刚写好程序准备测试,结果:
在这里插入图片描述
…

所以再另外写一个登录程序:
只要爬取之前登录一下,就可以大方的爬取数据了

from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait as wait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

def login(login_url, user_url):
    chrome = webdriver.Chrome()
    # 打开登陆页面
    chrome.get(login_url)
    try:
        # 设置等待时间
        w = wait(chrome, 60)
        # 等待登录(此处可手动登录 或 编写自动登录代码)
        # 判断是否已登录
        w.until(EC.presence_of_element_located((By.CLASS_NAME, "bn-more")), message="login is failed!")
        # 打开用户面板
        chrome.get(user_url)
        w.until(EC.presence_of_element_located((By.ID, "usr-profile-nav-doulists")), message="access user page is failed!")
        # 获取cookies(列表字典类型 -> [{....},{.....},......] )
        json_cookies = chrome.get_cookies()
        cookies = {}
        for cookie in json_cookies:
            # 提取cookies中name和value的键值对组成新cookies字典
            cookies[cookie["name"]] = cookie["value"]
            # 保存到文件
        with open("chrome_cookie.txt", "w") as f:
            # 必须转换成字符串类型
            f.write(str(cookies))
    except Exception as error:
        print(error)
        return False
    chrome.close()
    return True


if __name__ == '__main__':
    login_url = "https://accounts.douban.com/passport/login"
    user_url = "https://www.douban.com/people/215290729/"
    result = login(login_url, user_url)

settings.py 中添加以下内容
编写爬虫文件之前先配置scrapy,好让scrapy可以实现分布式爬虫

# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36"

# Obey robots.txt rules
ROBOTSTXT_OBEY = False
# 设置URL去重
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
# 设置调度器
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
# 设置暂停恢复后是否继续
SCHEDULER_PERSIST = True
#SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderPriorityQueue"
#SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderQueue"
#SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderStack"
# 开启redis管道
ITEM_PIPELINES = {'scrapy_redis.pipelines.RedisPipeline': 400,}
# 设置显示日志等级
LOG_LEVEL = 'DEBUG'
# 设置redis服务器IP
REDIS_HOST = "192.168.2.208"
# 设置redis端口
REDIS_PORT = 6379
# 设置redis数据库编号
REDIS_DB = 1
# 连接数据库配置
REDIS_PARAMS = {
    'socket_timeout': 30,
    'socket_connect_timeout': 30,
    'retry_on_timeout': True,
    'encoding': 'utf-8',
    'db': REDIS_DB
}

爬虫文件

 # -*- coding: utf-8 -*-
import scrapy, ast, re, pymysql, redis, sys

sys.path.append("..")
import settings


class DoubandushuSpider(scrapy.Spider):
    name = 'doubandushu'
    allowed_domains = ['douban.com']
    # 连接redis
    redis_cli = redis.Redis(host=settings.REDIS_HOST, port=settings.REDIS_PORT)
    # 连接数据库
    conn = pymysql.connect("192.168.2.208", "root", "123456", "douban")
    # 游标
    cursor = conn.cursor()

    def start_requests(self):
        # 读取cookies
        cookies = ast.literal_eval(open("chrome_cookie.txt").read())
        # 每次读取200个内容页链接
        sql = "select id,url,type from content_links where status = 0 limit 200"
        self.cursor.execute(sql)
        urls = self.cursor.fetchall()
        # 如果urls为空表示数据库再也没有可爬取的url,跳出循环
        while len(urls) != 0:
            for id, url, type in urls:
                yield scrapy.Request(url, callback=self.parse, cookies=cookies, meta={"id": id, "type": type})
            # 每次爬取完200个内容页url都会重新获取200个
            self.cursor.execute(sql)
            urls = self.cursor.fetchall()

    def parse(self, response):
        # 获取数据
        try:
            name = response.xpath("//h1/span/text()").extract_first()
            info = response.xpath("string(//div[@id='info'])").extract_first()
            info = re.sub(r"[\n\s]", "", info)
            author = re.findall(r"作者:\s*(.+)出版社:", info)
            author = author[0] if author else None
            date = re.findall(r"出版年:\s*(\w{4})", info, re.A)
            date = date[0] if date else None
            price = re.findall(r"定价:\s*(.*[0-9])[\u4e00-\u9fa5]+:I*", info, re.A)
            try:
                price = price[0] if price else re.findall(r"定价:\s*(.*[0-9])[\u4e00-\u9fa5]*ISB*", info, re.A)[0]
            except Exception as error:
                price = 0
            score = response.xpath("//strong/text()").extract_first()
            score = score.replace(" ", "")
            if score == "":
                score = None
            rating_count = response.xpath("//a[@class='rating_people']/span/text()").extract_first()
            comment_count = response.xpath("//header//span[@class='pl']/a/text()").extract_first()
            comment_count = re.sub(r"[全部条\s]", "", comment_count)
            datas = {
                "id": response.meta["id"],
                "name": name,
                "author": author,
                "date": date,
                "price": price,
                "score": score,
                "rating_count": rating_count,
                "comment_count": comment_count,
                "type": response.meta["type"],
                "url": response.url
            }
            # 获取完数据保存数据之前,把内容页url的status设置为1
            # 保证以后获取的URL不会重复,就算重新运行爬虫也可以继续爬取status = 0的url
            self.cursor.execute("update content_links set status = 1 where id = %s", (datas['id']))
            self.conn.commit()
            yield datas
        except Exception as error:
            self.conn.rollback()
            print(error)

愉快的爬取中
飘红不是错误提示,而是日志提醒
在这里插入图片描述

step4:linux运行scrapy爬虫

linux中创建scrapy项目
在这里插入图片描述
替换settings.py

上传爬虫文件到spiders里

运行命令

然后就可以愉快的爬爬爬了

最后:怎么获取数据到本地?

so easy~~~

import redis,ast
redis_cli = redis.Redis(db=1)

while True:
    data = redis_cli.blpop("doubandushu:items")
    print(ast.literal_eval(data[1].decode()))