python漫画爬虫:我不做人了，b站！爬取辉夜大小姐等漫画

最新推荐文章于 2024-04-26 01:45:18 发布

木大木打木大

最新推荐文章于 2024-04-26 01:45:18 发布

阅读量5.8k

点赞数 14

分类专栏： python爬虫

本文链接：https://blog.csdn.net/weixin_43476533/article/details/107504865

版权

python爬虫专栏收录该内容

3 篇文章 0 订阅

订阅专栏

今天我们要爬取这个网站的《辉夜大小姐想让我告白》漫画（穷人靠科技，富人靠硬币，懂，不多说）
主要就两步：1.在主界面找到所有话的链接 2.在每一话找到该话的所有图片

需要源码的直接翻到最后

首先我们找到了每一话的链接
在这里插入图片描述

# 获取章节链接和章节名称
hrefs = re.findall('<li>\n.*?<a href="(.*?\.html)\"\n.*?class="">\n.*?<span>(.*?)</span>',r.text)
for href in hrefs:
    # 拼接章节链接
    chapter_url = 'http://www.90mh.com' + href[0]
    name = href[1]
    chapter_path = root_path + '\\' + name
    print(chapter_path)
    # 辉夜大小姐想让我告白\周刊13话

在进入其中一话，找到每一话的所有图片

# 获取章节图片
    chapter_imges = re.search('chapterImages = (\[.*?\])',chapter_page.text,re.S)
    chapter_src = re.search('chapterPath = "(.*?)"',chapter_page.text).group(1)
''' ...... '''
pic_url = 'https://js1.zzszs.com.cn/' + chapter_src + chapter_imges[i]

最终效果：

成功！

当然，不同网站结构不同，爬取方式也有些许不同。比如动漫之家——参考自这里.
在这里插入图片描述
但方式其实也就那么几种，还是可以摸索出来的，目前我爬了四五个网站，也都成功了，大家可以自己动手试试。

源码：
这里采用了多协程的方式，比正常方式快几十倍，但编写时麻烦些，并且存在有的网址访问超时的情况，故需要多跑几遍.这里我使用了代理，大家需要自己配置，并更改代理ip地址.

import requests
import re
import time
import os
from ast import literal_eval
import asyncio
import aiohttp
import aiofiles


async def get_image(session,href_url,name):
    # 拼接章节链接
    chapter_url = 'http://www.90mh.com' + href_url
    chapter_path = root_path + '\\' + name
    print(chapter_path)

    # 建立章节文件夹
    if not os.path.exists(chapter_path):
        os.mkdir(chapter_path)
    try:
        async with session.get(chapter_url, headers=headers, proxy=proxy, timeout=30) as response:
            r = await response.text()
    except:
        async with session.get(chapter_url, headers=headers, proxy=proxy, timeout=30) as response:
            r = await response.text()
    # 获取章节图片
    chapter_imges = re.search('chapterImages = (\[.*?\])', r, re.S)
    chapter_src = re.search('chapterPath = "(.*?)"', r).group(1)


    chapter_imges = chapter_imges.group(1)
    # 将字符串形式的列表转为列表
    chapter_imges = literal_eval(chapter_imges)

    tasks = []
    for i in range(len(chapter_imges)):
        if i < 10:
            pic_path = chapter_path + '\\' + str(0) + str(i) + '.jpg'
        else:
            pic_path = chapter_path + '\\' + str(i) + '.jpg'
        print(pic_path)
        if not os.path.exists(pic_path):
            pic_url = 'https://js1.zzszs.com.cn/' + chapter_src + chapter_imges[i]
            tasks.append(get_photo(session,pic_url,pic_path))
    if tasks:
        await asyncio.wait(tasks)
    if hrefs:
        href = hrefs.pop()
        task = [asyncio.create_task(get_image(session, href[0], href[1]))]
        await asyncio.wait(task)


async def get_photo(session,pic_url,pic_path):
    try:
        async with session.get(pic_url, headers=pic_headers, timeout=30) as p:
            pic = await p.content.read()
    except:
        async with session.get(pic_url, headers=pic_headers, timeout=50) as p:
            pic = await p.content.read()
    async with aiofiles.open(pic_path, 'wb') as f:
        await f.write(pic)



group_size = 5
ip = '127.0.0.1:7890'
proxy = 'http://' + ip
proxies = {
    'http': 'http://' + ip,
    'https': 'https://' + ip
}
# 漫画主页
url = 'http://www.90mh.com/manhua/zongzhijiushifeichangkeai/'
host = 'www.90mh.com'
headers = {
    'Host': 'www.90mh.com',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36'
}
pic_headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36'
}
root_path = '总之就是非常可爱'

async def main():
    # 建立根文件夹
    if not os.path.exists(root_path):
        os.mkdir(root_path)
    async with aiohttp.ClientSession() as session:
        try:
            async with session.get(url, headers=headers,proxy=proxy, timeout=30) as response:  #
                r = await response.text()
        except:
            async with session.get(url, headers=headers, proxy=proxy, timeout=50) as response:
                r = await response.text()

        # 获取章节链接和章节名称
        global hrefs
        hrefs = re.findall('<li>\n.*?<a href="(.*?\.html)\"\n.*?class="">\n.*?<span>(.*?)</span>',r)

        tasks = []
        if len(hrefs) < group_size:
            num = len(hrefs)
        else:
            num = group_size
        for i in range(num):
            href = hrefs.pop()
            tasks.append(asyncio.create_task(get_image(session,href[0],href[1])))
        await asyncio.wait(tasks)

if __name__ == '__main__':
    asyncio.run(main())

木大木打木大

关注

14
点赞
踩
37

收藏

觉得还不错? 一键收藏
1
评论
python漫画爬虫:我不做人了，b站！爬取辉夜大小姐等漫画

今天我们要爬取这个网站的《辉夜大小姐想让我告白》漫画（穷人靠科技，富人靠硬币，懂，不多说）首先我们找到了每一话的链接# 获取章节链接和章节名称hrefs = re.findall('<li>\n.*?<a href="(.*?\.html)\"\n.*?class="">\n.*?<span>(.*?)</span>',r.text)for href in hrefs: # 拼接章节链接 chapter_url = 'http://w
复制链接

扫一扫