爬取酷漫网漫画和漫客栈漫画

最新推荐文章于 2022-07-12 16:29:51 发布

weixin_30266829

最新推荐文章于 2022-07-12 16:29:51 发布

阅读量1.1k

点赞数

文章标签： python 开发工具

原文链接：http://www.cnblogs.com/wujf-myblog/p/11002313.html

版权

Beautifulsoup爬取方法



# -*- coding: utf-8 -*-
# @Time    : 2019/6/11 9:47
# @Author  : wujf
# @Email   : 1028540310@qq.com
# @File    : 斗罗大陆2.py
# @Software: PyCharm


'''   Beatifulsoup爬取方式        '''
import re
import requests
import urllib.request
from bs4 import BeautifulSoup

urls = ['http://www.kuman.com/mh-1003692/{}/'.format(str(i)) for i in range(1,22)]
for url in urls:

    headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'}
    r =requests.get(url,headers =headers,timeout= 5)
    # r.raise_for_status()

    r.encoding = r.apparent_encoding    #直接加密
    content = r.text

    beaobj = BeautifulSoup(content.replace(' ', ' '),'html5lib')
    lis = beaobj.findAll('li',style="margin-top: -3.6px")
    for li in lis:
        image = re.findall(r'src="(.*?)"',str(li))  ################得到的无法判定对象数据类型，所以一定要加 str 否则报错
        name = image[0].split('/')[-1]
        image_name = 'E:\\Python\\python_image\\%s'%name
        try:
            s= urllib.request.urlretrieve(image[0],image_name)
            print("正在下载%s"%(image[0]))
        except Exception as e:
            print(e)

　　后面付费怎么爬取稍后更新，下载到23页的时候，要vip，可惜这个网站跳不过，那么我们尝试其他网址

下面是xpath爬取方法，可跳过vip验证，直接爬取付费内容（漫客栈的vip）

# -*- coding: utf-8 -*-
# @Time    : 2019/6/11 11:20
# @Author  : wujf
# @Email   : 1028540310@qq.com
# @File    : 漫客栈-斗罗大陆2.py
# @Software: PyCharm

import requests
import urllib.request
from lxml import etree
import urllib.request

headers = {'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'}
url = 'https://www.mkzhan.com/211692/'
r = requests.get(url,headers=headers,timeout=5)
r.encoding = r.apparent_encoding
r.raise_for_status()
html = r.text
html = html.encode('gbk',"ignore").decode('gbk')    #先用gbk编码,忽略掉非法字符,然后再译码
html = html.encode('utf-8').decode('utf-8')

ret = etree.HTML(html)
a = ret.xpath('//a[@class="j-chapter-link"]/@data-hreflink')
print(a)
a.reverse()

x =1
for link in a:
    link = 'https://www.mkzhan.com'+link
    try:
        t = requests.get(link)
        parse = t.text
        parse = parse.encode('gbk', "ignore").decode('gbk')  # 先用gbk编码,忽略掉非法字符,然后再译码
        parse = parse.encode('utf-8').decode('utf-8')
        #print(parse)
        treee = etree.HTML(parse)
        image = treee.xpath('//div[@class="rd-article__pic hide"]/img[@class="lazy-read"]/@data-src')
        for img in image:
            s = urllib.request.urlretrieve(img,r'E:\\Python\\python_image\\漫客栈\\%s.jpg'%x)
            x = x+1
            print("正在下载%s"%img)

    except Exception as e:
        print(e)

转载于:https://www.cnblogs.com/wujf-myblog/p/11002313.html

weixin_30266829

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
爬取酷漫网漫画和漫客栈漫画

Beautifulsoup爬取方法# -*- coding: utf-8 -*-# @Time : 2019/6/11 9:47# @Author : wujf# @Email : 1028540310@qq.com# @File : 斗罗大陆2.py# @Software: PyCharm''' Beatifulsoup爬取方式 ...
复制链接

扫一扫