python 爬取网上数据Crawler data(1.漫画)

1.简单获取网页单张图片:
2.拿取网页代码:
在这里插入图片描述

import requests 

r = requests.get('https://img.wallpapersafari.com/desktop/1536/864/5/53/uyvkzZ.jpeg')
with open('图片.jpeg','wb') as f:
    f.write(r.content)
    f.close()

3.注意:有时获取的网页图片读不了时:
在这里插入图片描述
是因为网页反爬虫功能代码中需要添加:headers ={‘User-Agent’: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) ’
‘Chrome/99.0.4844.84 Safari/537.36 OPR/85.0.4341.60’} #在网页检查代码内’控制台(console)'输入(window.navigator.userAgent)获取代码
.

import requests

headers ={'User-Agent':  'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
                'Chrome/99.0.4844.84 Safari/537.36 OPR/85.0.4341.60'}
r = requests.get('https://get.wallhere.com/photo/space-galaxy-planet-fantasy-'
                 'wallpaper-desktop-landscape-surreal-1579551.png')
with open('图片.png', 'wb') as f:
    f.write(r.content)
    f.close()       

获取图片:在这里插入图片描述

4.获取整个网页图片:
在这里插入图片描述

#解析网页;
获取代码:

import requests ,re 
from bs4 import BeautifulSoup

def get_content(target):
#模拟电脑读取网页(对应网页反爬取手段):

    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
                             'Chrome/99.0.4844.84 Safari/537.36 OPR/85.0.4341.60'}
                             
#获取网页元素:
    r = requests.get(url = target,headers = headers)
    
#解析网页元素(让元素认可以阅读)
    textwrap = BeautifulSoup(r.content, 'lxml')
    pictrues = textwrap.find('div', class_='hub-photomodal')
    pictrues = pictrues.find_all('a')
    for pictrue in pictrues:
    
#拿到要获取元素的储存位置:
        response = requests.get(pictrue.get('href'))
        return response
if __name__=='__main__':


    headers ={'User-Agent':  'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
                 'Chrome/99.0.4844.84 Safari/537.36 OPR/85.0.4341.60'}
    server = 'https://wallhere.com'
    r = requests.get(server)
    r.encoding = 'utf-8'
    text = BeautifulSoup(r.text,'lxml')
    picture_urls = text.find('div',class_ = 'hub-mediagrid hub-fleximages hub-loadphoto')
    picture_urls  = picture_urls .find_all('a')
    # print(picture_urls )
    for url in picture_urls:
        urls = url.get('href')
        # print(urls)
        url_img = re.findall('wallpaper',urls)
        # print(url_img)
        # print(type(urls))
        try:
            if url_img[0] == 'wallpaper':
                url = server + urls
                # print(url)
            else:
                continue
        except IndexError as e :
            continue
       response = get_content(url)
       picture_r = requests.get(response)
       with open('图片\%s' % (response.strip().split('-')[-1]), 'wb') as file:
            file.write(picture_r.content)
            file.close()

5.漫画获取下载:
检查网页元素:

在这里插入图片描述

检查图片反扒:view-source:https://www.dmzj.com/view/yaoshenji/41917.html(F12,右键查看元素);

   import requests  , os
   from bs4 import BeautifulSoup
   
#获取数据地址:
   url = 'http://www.txydd.com/chapter/10967/358966.html'
   download_header = {'Referer': 'http://www.txydd.com/chapter/10967/358966.html'}  #对应反爬取作用
   r = requests.get(url,headers =download_header)
   r.encoding = 'urt-8'
   bs = BeautifulSoup(r.text,'lxml')
   chapters = bs.find('ol',id = 'j_chapter_list')
   chapters = chapters.find_all('img')
comics_name = os.mkdir('虫师')
for chapter_url in chapters_url:
    chapter_content = requests.get(chapter_url).content
    with open(os.path.join(os.path.dirname + '%s' %(chapter_url.strip().split("_")[-1])),'wb') as f:
        f.write(chapter_content)
        f.close()



         
  • 0
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值