用Python一晚上爬取P站几万张图片的详细教程

最新推荐文章于 2024-04-22 10:11:51 发布

Pymili

最新推荐文章于 2024-04-22 10:11:51 发布

阅读量4.8k

点赞数 4

分类专栏： Python爬虫文章标签： python 爬虫数据挖掘

本文链接：https://blog.csdn.net/qq_53280175/article/details/121894718

版权

Python爬虫专栏收录该内容

7 篇文章 1 订阅

订阅专栏

使用Python的requests加re库爬取P站图片并下载，作者下载了47129页爬取到2万多站图片。爬取使用时间20小时。总共可以爬几十万多页的，但是太多了，博主只爬取了4万页就停止了。大家可以根据自己的需求爬取。

解析网页

本次使用requests库和re库，requests库可以查看我的教程：我们来一起学Python爬虫吧-第一章requests库_PYmili的博客-CSDN博客

re库的使用方法可以查看我的教程：Python超详细的正则表达式_PYmili的博客-CSDN博客_python正则表达式

请先下载requests库，要爬取网站：P站-Pixiv-原创画师分享平台-触站（原画师通）

pip install requests

我们随便打开一个作品想查看源码

查看源码得知其中的图片链接藏在其中

<script type="application/ld+json">
                {
  "@context": "https://ziyuan.baidu.com/contexts/cambrian.jsonld",
  "@id": "https://www.huashi6.com/draw/174366",
  "title": "水滴石穿的好女人",
  "images": [
    "//img2.huashi6.com/images/resource/2019/06/06/7507844h7p0.png?imageMogr2/quality/100/interlace/1/thumbnail/800x/format/jpeg",
    "//img2.huashi6.com/images/resource/2019/06/06/7507844h7p1.png?imageMogr2/quality/100/interlace/1/thumbnail/800x/format/jpeg"
  ],
  "description": "“滴水系列”让·奥鲁塔和表情差。泳装快乐…。夏天受欢迎了!FGO和游戏部出本!布景也加油! !◎你的社团“necomicle”，星期一被安排在西地区“A”块- 38a。网页目录→twitter: twitter/necomi_info",
  "pubDate": "2021-12-08T16:59:54",
  "upDate": "2021-12-08T16:59:54"
}
        </script>

那我们获取其中的源码就可以获得链接

r=requests.get(url=_url)
scr=re.search('<script type=.*?>(.*?)</script>', str(r.text), re.S)

单单获取一页内容是不行的，但是每一页代码都是一页的模板，那么就好办了。全部代码如下：

import requests
import re

import os
import time

"""
主函数获取网页代码解析并保存
"""
def main(_url): #_url 请求地址
    print(_url)
    r=requests.get(url=_url)
    scr=re.search('<script type=.*?>(.*?)</script>', str(r.text), re.S)    #解析script
    if not scr:
        print("Error")
    else:
        for i in scr.group(1).split(','):
            image=re.search('(//img2.huashi6.com/images/resource/.*?)\?(.*?)"', i ,re.S)  #提取图片url
            if not image:
                pass
            else:
                print(image.group(1))
                #    保存
                try:
                    req=requests.get(url="https:"+image.group(1))
                    p,f=os.path.split(image.group(1))
                    with open(f"Index_Image\{f}", "wb") as w:
                        w.write(req.content)
                    print("True")
                except:
                    print("False")
if __name__ in "__main__":
    for i in range(6900, 70000):#6900为要爬取的第一页，70000为要爬取至页数
        main(f"https://www.huashi6.com/draw/{i}")
        time.sleep(4)    #控制速度以免网站封掉ip

有问题可以联系QQ群：706128290

Pymili

关注

4
点赞
踩
13

收藏

觉得还不错? 一键收藏
打赏
3
评论
用Python一晚上爬取P站几万张图片的详细教程

使用Python的requests加re库爬取P站图片并下载，作者下载了47129页爬取到2万多站图片。爬取使用时间20小时。总共可以爬几十万多页的，但是太多了，博主只爬取了4万页就停止了。大家可以根据自己的需求爬取。
复制链接

扫一扫