Python3网络爬虫：爬取漫画

最新推荐文章于 2024-05-09 21:16:36 发布

Galaxy__42

最新推荐文章于 2024-05-09 21:16:36 发布

阅读量4.9k

点赞数 1

文章标签：爬虫

本文链接：https://blog.csdn.net/Galaxy__42/article/details/81111552

版权

那个网站漫画爬不到了，等有时间换个网站爬。 1、前言本文使用了requests、bs4、os库与自动化测试工具Selenium。 Selenium安装详情请看 https://germey.gitbooks.io/python3webspider/content/1.2.2-Selenium%E7%9A%84%E5%AE%89%E8%...

摘要由CSDN通过智能技术生成

那个网站漫画爬不到了，等有时间换个网站爬。

1、前言

本文使用了requests、bs4、os库与自动化测试工具Selenium。

Selenium安装详情请看

https://germey.gitbooks.io/python3webspider/content/1.2.2-Selenium%E7%9A%84%E5%AE%89%E8%A3%85.html

2、问题分析

URL: http://www.gugumh.com/manhua/200/

我们先看一下每一章的url。

再到主页看一下源代码，发现只要将主页的URL与li标签中的路径连接起来便可得到每一章的URL。

下面是爬取li标签中每一个路径的代码。


import requests
from bs4 import BeautifulSoup


def Each_chapter():
    url = 'http://www.gugumh.com/manhua/200/'

    response = requests.get(url)
    response.encoding = 'utf-8'
    sel = BeautifulSoup(response.text, 'lxml')
    total_html = sel.find('div', id='play_0')

    total_chapter = []
    for i in total_html.find_all('li'):
        href = i.a['href']
        every_url = 'http://www.gugumh.com' + href
        total_chapter.append(every_url)

    print(total_chapter)


if __name__ == '__main__':
    Each_chapter()

然后再爬取每一章的页数

使用requests请求到的HTML发现没有页数

之前用bs4找了大半天页数一直都找不到，才发现每一章的页面是用js渲染的。

接下来我们可以使用selenium来获取页面源代码

Selenium使用详细内容查看官方文档：http://selenium-python.readthedocs.io/index.html

from selenium import webdriver


url = 'http://www.gugumh.com/manhua/200/697280.html'
browers = webdriver.Chrome()
browers.get(url)
html

最低0.47元/天解锁文章

Galaxy__42

关注

1
点赞
踩
15

收藏

觉得还不错? 一键收藏
2
评论
Python3网络爬虫：爬取漫画

那个网站漫画爬不到了，等有时间换个网站爬。 1、前言本文使用了requests、bs4、os库与自动化测试工具Selenium。 Selenium安装详情请看 https://germey.gitbooks.io/python3webspider/content/1.2.2-Selenium%E7%9A%84%E5%AE%89%E8%...
复制链接

扫一扫