爬取pixiv每日推荐

最新推荐文章于 2024-08-14 19:17:19 发布

Eloik

最新推荐文章于 2024-08-14 19:17:19 发布

阅读量1w

点赞数 10

分类专栏： # Python爬虫实战文章标签：爬虫 python

本文链接：https://blog.csdn.net/weixin_45826022/article/details/109406389

版权

Python爬虫实战专栏收录该内容

5 篇文章 5 订阅

订阅专栏

文章目录

踩点

进入网站

在这里插入图片描述

我们一直往下拉，发现它是动态加载的。
一直往下滑，发现只能加载 500 个图片，说明每日推荐一天500张，~~好家伙~~

爬取单张图片

我们先点击一张图片进去

在这里插入图片描述

发现图片可以放大

在这里插入图片描述

那我们肯定是要获取像素高的图片了
按F12查看一下网页

在这里插入图片描述

我们发现这个网址就是原图！

在这里插入图片描述

获取一下这个网页代码，保存到txt文档中。

headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36',
    'referer': 'https://www.pixiv.net/ranking.php?mode=daily&content=illust',
}

url = 'https://www.pixiv.net/artworks/85300112'
response = requests.get(url, headers=headers)
f = open('M:/a.txt', 'wb')
f.write(response.text.encode('utf8'))
f.close()

复制一下刚刚找到的原图网址，看看能否在txt中找到：

在这里插入图片描述

找一下关系，诶，我们发现该网址前面的键为original，复制一下搜索original，只有三个匹配结果，对比三个结果发现其余两个一个是大写，一个不带引号，因此 "original"我们可以根据这个东西，使用正则直接将原图网址提取出来。

在这里插入图片描述

我们用正则提取一下原图网址：

headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36',
    'referer': 'https://www.pixiv.net/ranking.php?mode=daily&content=illust',
}

url = 'https://www.pixiv.net/artworks/85300112'
response = requests.get(url, headers=headers)

picture = re.search('"original":"(.+?)"},"tags"', response.text)
print(picture.group(1))

结果如下：成功提取到原图地址！

在这里插入图片描述

获取到原图地址后，我们将图片下载保存一波：

headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36',
    'referer': 'https://www.pixiv.net/ranking.php?mode=daily&content=illust',
}

url = 'https://www.pixiv.net/artworks/85300112'
response = requests.get(url, headers=headers)

picture = re.search('"original":"(.+?)"},"tags"', response.text)
print(picture.group(1))

pic = requests.get(picture.group(1), headers=headers)
f = open('M:/1.%s' % (picture.group(1)[-3:]), 'wb')
f.write(pic.content)
f.close()

成功获取！

在这里插入图片描述

我们发现自己命名好难受，再把图片名字提出来，在刚刚获取到原图网址的地方，往上看一下，找到了 "illustTitle" ，查找一下，发现这个键也是唯一的，真好，那我们在写个正则把图片名字爬下来。

在这里插入图片描述

headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36',
    'referer': 'https://www.pixiv.net/ranking.php?mode=daily&content=illust',
}

url = 'https://www.pixiv.net/artworks/85300112'
response = requests.get(url, headers=headers)

name = re.search('"illustTitle":"(.+?)"', response.text)
print(name.group(1))
picture = re.search('"original":"(.+?)"},"tags"', response.text)
print(picture.group(1))

pic = requests.get(picture.group(1), headers=headers)
f = open('M:/%s.%s' % (name.group(1), picture.group(1)[-3:]), 'wb')
f.write(pic.content)
f.close()

运行成功

在这里插入图片描述

将获取单张图片的代码写成一个函数：

headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36',
    'referer': 'https://www.pixiv.net/ranking.php?mode=daily&content=illust',
}
# 保存路径
path = 'M:/'


def getSinglePic(url):
    response = requests.get(url, headers=headers)
    # 提取图片名称
    name = re.search('"illustTitle":"(.+?)"', response.text)
    # 提取图片原图地址
    picture = re.search('"original":"(.+?)"},"tags"', response.text)
    pic = requests.get(picture.group(1), headers=headers)
    f = open(path + '%s.%s' % (name.group(1), picture.group(1)[-3:]), 'wb')
    f.write(pic.content)
    f.close()

换个图片测试一下：

url = 'https://www.pixiv.net/artworks/85317626'
getSinglePic(url)

没有问题：

在这里插入图片描述

获取日推所有图片网址

前面在踩点的时候，我们已经发现了这个网页是动态加载的，所以我们滑到最底下，去查看network

在这里插入图片描述

对比这几个包，我们发现每次请求的网址都是 https://www.pixiv.net/ranking.php?mode=daily&content=illust&p=?&format=json ，改变的只是p的值，一共获取了10个包，每个包里面有50个图片，所以我们直接请求这个网址，把p从1枚举到10，便可以获取到500张图片的信息！

但是我们究竟该提取什么东西呢？
我们展开单个图片，看到了两个图片地址，然而点进去发现图片不是原图，像素太差
我们前面已经实现了单张图片的抓取，我们只需要提取到所有这500张图片对应的地址，就可以直接获取到原图了。
但是没有找到图片地址，我们点开几张图片进去，看一下网址，

https://www.pixiv.net/artworks/85311013
https://www.pixiv.net/artworks/85318602
https://www.pixiv.net/artworks/85316875

发现只有后面的数字发生了变化，而后面这个数字

在这里插入图片描述

正是 illust_id，所以我们只需要提取到所有的 illust_id 即可！

我们查看一下 illust_id 的位置

在这里插入图片描述

搜索一下，发现正好50个！

在这里插入图片描述

直接上正则提取 illust_id

headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36',
    'referer': 'https://www.pixiv.net/ranking.php?mode=daily&content=illust',
}

url = 'https://www.pixiv.net/ranking.php?mode=daily&content=illust&p=10&format=json'
res = requests.get(url, headers=headers)
illust_id = re.findall('"illust_id":(\d+?),', res.text)
print(len(illust_id), illust_id)

输出结果，获取成功

在这里插入图片描述

拼接一下图片id，获取到图片网址

url = 'https://www.pixiv.net/ranking.php?mode=daily&content=illust&p=10&format=json'
res = requests.get(url, headers=headers)
illust_id = re.findall('"illust_id":(\d+?),', res.text)
picUrl = ['https://www.pixiv.net/artworks/' + i for i in illust_id]

for i in picUrl:
    print(i)

输出，成功

在这里插入图片描述

下载所有图片

将获取所有图片的代码写成一个函数，并直接在函数里面调用下载图片的函数

def getAllPicUrl():
    count = 1
    for n in range(1, 10 + 1):
        url = 'https://www.pixiv.net/ranking.php?mode=daily&content=illust&p=%d&format=json' % n
        response = requests.get(url, headers=headers)
        illust_id = re.findall('"illust_id":(\d+?),', response.text)
        picUrl = ['https://www.pixiv.net/artworks/' + i for i in illust_id]
        for url in picUrl:
            print('正在下载第 %d 张图片' % count, end='   ')
            getSinglePic(url)
            print('下载成功', end='\n')
            count += 1
    return None

运行测试一下

getAllPicUrl()

完美运行

在这里插入图片描述

总代码(不是)

import requests
import re

headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36',
    'referer': 'https://www.pixiv.net/ranking.php?mode=daily&content=illust',
}
# 下载路径
path = 'M:/'


def getSinglePic(url):
    response = requests.get(url, headers=headers)
    # 提取图片名称
    name = re.search('"illustTitle":"(.+?)"', response.text)
    # 提取图片原图地址
    picture = re.search('"original":"(.+?)"},"tags"', response.text)
    pic = requests.get(picture.group(1), headers=headers)
    f = open(path + '%s.%s' % (name.group(1), picture.group(1)[-3:]), 'wb')
    f.write(pic.content)
    f.close()


def getAllPicUrl():
    count = 1
    for n in range(1, 10 + 1):
        url = 'https://www.pixiv.net/ranking.php?mode=daily&content=illust&p=%d&format=json' % n
        response = requests.get(url, headers=headers)
        illust_id = re.findall('"illust_id":(\d+?),', response.text)
        picUrl = ['https://www.pixiv.net/artworks/' + i for i in illust_id]
        for url in picUrl:
            print('正在下载第 %d 张图片' % count, end='   ')
            getSinglePic(url)
            print('下载成功', end='\n')
            count += 1
    return None


getAllPicUrl()

正在高兴的下载着，结果。。。。

在这里插入图片描述

看看报错信息。路径名非法，文件不能以****.jpg命名？
去文件夹查看一下。
~~卧twii’j’w!@1~~

既然如此，那么我们就把不合法的名字替换掉

# 全局变量
repeat = 1


name = re.search('"illustTitle":"(.+?)"', response.text)
    name = name.group(1)
    if re.search('[\\\ \/ \* \? \" \: \< \> \|]', name) != None:
        name = re.sub('[\\\ \/ \* \? \" \: \< \> \|]', str(repeat), name)
        repeat += 1

再次运行，发现已经没了问题

在这里插入图片描述

总代码(撒花)

import requests
import re

headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36',
    'referer': 'https://www.pixiv.net/ranking.php?mode=daily&content=illust',
}

path = 'M:/'
repeat = 1


def getSinglePic(url):
    global repeat
    response = requests.get(url, headers=headers)
    # 提取图片名称
    name = re.search('"illustTitle":"(.+?)"', response.text)
    name = name.group(1)
    if re.search('[\\\ \/ \* \? \" \: \< \> \|]', name) != None:
        name = re.sub('[\\\ \/ \* \? \" \: \< \> \|]', str(repeat), name)
        repeat += 1
    # 提取图片原图地址
    picture = re.search('"original":"(.+?)"},"tags"', response.text)
    pic = requests.get(picture.group(1), headers=headers)
    f = open(path + '%s.%s' % (name, picture.group(1)[-3:]), 'wb')
    f.write(pic.content)
    f.close()


def getAllPicUrl():
    count = 1
    for n in range(1, 10 + 1):
        url = 'https://www.pixiv.net/ranking.php?mode=daily&content=illust&p=%d&format=json' % n
        response = requests.get(url, headers=headers)
        illust_id = re.findall('"illust_id":(\d+?),', response.text)
        picUrl = ['https://www.pixiv.net/artworks/' + i for i in illust_id]
        for url in picUrl:
            print('正在下载第 %d 张图片' % count, end='   ')
            getSinglePic(url)
            print('下载成功', end='\n')
            count += 1
    return None

getAllPicUrl()