Python《必应bing桌面图片爬取》

最新推荐文章于 2024-05-30 10:15:00 发布

星海千寻

最新推荐文章于 2024-05-30 10:15:00 发布

阅读量1.7k

点赞数 2

分类专栏： Python 文章标签： python 爬虫

本文链接：https://blog.csdn.net/qq_29367075/article/details/111714545

版权

Python 专栏收录该内容

45 篇文章 1 订阅

订阅专栏

桌面壁纸，来自于bing，必应的壁纸网址。https://bing.ioliu.cn/
每一页都有12张照片，每个照片有对应的download高清大图的地址，有多个分页。

但是，麻烦的是打开后，按不了F12，于是用python直接爬取页面，才发现是这样的。
在这里插入图片描述

123就是F12的code，这个网址禁止了F12，禁止了ctrl+shirt+i，禁止了ctrl+s。

但是这不影响啊，我们用urrlib.request可以获得整个页面的信息。
每个图片的文本描述信息是在< h3>元素里的。
每个图片的下载地址是在< a class = “ctrl download”>元素里的
总页数信息是在< div class=“page”>的< span>里的。

每一页面的url如下是：
https://bing.ioliu.cn/?p=1
https://bing.ioliu.cn/?p=2
https://bing.ioliu.cn/?p=3
https://bing.ioliu.cn/?p=4

完整代码如下：

import time
from concurrent.futures import ThreadPoolExecutor
import time
import os
import re
from urllib.parse import urlencode

import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import  Options

rootrurl = 'https://bing.ioliu.cn/?'
save_dir = 'D:/estimages/'
headers = {
    "Referer": rootrurl,
    'User-Agent': "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36",
    'Accept-Language': 'en-US,en;q=0.8',
    'Cache-Control': 'max-age=0',
    'Connection': 'keep-alive'
}  ###设置请求的头部，伪装成浏览器

def saveOneImg(dir, img_url, title):
    new_headers = {
        "Referer": img_url,
        'User-Agent': "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36",
        'Accept-Language': 'en-US,en;q=0.8',
        'Cache-Control': 'max-age=0',
        'Connection': 'keep-alive'
    }  ###设置请求的头部，伪装成浏览器，实时换成新的 header 是为了防止403 http code问题，防止反盗链，

    try:
        img = requests.get(img_url, headers=new_headers)  # 请求图片的实际URL
        if (str(img).find('200') > 1):
            with open(
                    '{}/{}.jpg'.format(dir, title), 'wb') as jpg:  # 请求图片并写进去到本地文件
                jpg.write(img.content)
                print(img_url)
                jpg.close()
            return True
        else:
            return False
    except Exception as e:
        print('exception occurs: ' + img_url)
        print(e)
        return False


def getSubTitleName(str):
    # cop = re.compile("[^\u4e00-\u9fa5^a-z^A-Z^0-9]")  # 匹配不是中文、大小写、数字的其他字符
    cop = re.compile("[^\u4e00-\u9fa5]")  # 匹配不是中文、大小写、数字的其他字符
    string1 = cop.sub('', str)  # 将string1中匹配到的字符替换成空字符
    return string1


def getOnePage(i):
    params = {
        'p': i,
    }
    url = rootrurl + urlencode(params)
    print(url)
    html = BeautifulSoup(requests.get(url, headers=headers).text, features="html.parser")
    titles = html.find_all('h3')
    lis = html.find_all('a', {'class': 'ctrl download'})

    i = 0
    for a in lis:
        saveOneImg(save_dir, rootrurl[:-2] + a.get('href'), getSubTitleName(titles[i].get_text()))
        i = i + 1


def getNumOfPages():
    html = BeautifulSoup(requests.get(rootrurl, headers=headers).text, features="html.parser")
    return int(html.find('div', {'class': 'page'}).find('span').get_text().split('/')[1])


if __name__ == '__main__':
    getTotal = getNumOfPages()

    for i in range(1, getTotal+1):
        getOnePage(i)
    pass

效果如下：
请添加图片描述

请添加图片描述

星海千寻

关注

2
点赞
踩
5

收藏

觉得还不错? 一键收藏
1
评论
Python《必应bing桌面图片爬取》

桌面壁纸，来自于bing，必应的壁纸网址。https://bing.ioliu.cn/每一页都有12张照片，每个照片有对应的download高清大图的地址，有多个分页。但是，麻烦的是打开后，按不了F12，于是用python直接爬取页面，才发现是这样的。123就是F12的code，这个网址禁止了F12，禁止了ctrl+shirt+i，禁止了ctrl+s。但是这不影响啊，我们用urrlib.request可以获得整个页面的信息。每个图片的文本描述信息是在< h3>元素里的。每个图片的下
复制链接

扫一扫

专栏目录