爬一爬百思不得姐上的搞笑图片

最新推荐文章于 2021-02-07 21:17:53 发布

Juno的学习日记

最新推荐文章于 2021-02-07 21:17:53 发布

阅读量669

点赞数

分类专栏：爬虫文章标签：多线程爬虫

本文链接：https://blog.csdn.net/weixin_45075241/article/details/90545740

版权

爬虫专栏收录该内容

9 篇文章 0 订阅

订阅专栏

前两天看了一个爬百思不得姐上段子的视频，然后特意去百思不得姐网址看了一下，发现还有声音，就想爬一下声音这个一栏。使用的是我新学的多线程O(∩_∩)O，没想到居然掉进一个坑。这个网站的声音有十页，但是十页的内容都一毛一样，爬的时候看着我设置的提示信息，有点怀疑人生，比如一下出现5个“xxxxxx已经下载完成”，找了好久才发现是网站的问题。哎本着来都来了的心态，就再爬一下图片吧。
网址：http://www.budejie.com/pic/。
在后面加数字几就代表第几页，也没有什么套路，直接requests请求就可以爬到。代码也和我爬表情包的差不多，不一样的是我的解析库用了pyquery，对于不可以作为文件名的字符我也进行了处理：

import requests
from urllib import request
import os
import time
import threading
from queue import Queue
from pyquery import PyQuery as pq
import re
class Productor(threading.Thread):
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36"
    }
    def __init__(self, page_queue, pic_queue, *args, **kwargs):
        super(Productor, self).__init__(*args, **kwargs)
        self.page_queue = page_queue
        self.pic_queue = pic_queue

    def run(self):
        while True:
            if self.page_queue.empty():
                break
            url = self.page_queue.get()
            self.get_page_parse(url)

    def get_page_parse(self, url):
        response = requests.get(url, headers=self.headers)
        doc = pq(response.text)
        pics = doc(".j-r-list-tool-ct-fx div")
        for pic in pics.items():
            pic_url = pic.attr("data-pic")
            pic_name = pic.attr("data-text")
            try:
                suffix = os.path.splitext(pic_url)[1]
                filename = pic_name + suffix
                filename = re.sub(" ", "", filename)
                filename = re.sub("[-《》#?？]", "", filename).strip()   #对有特殊字符的文件名进行处理
            except:
                continue
            self.pic_queue.put((pic_url, filename))
            time.sleep(0.5)
class Consumer(threading.Thread):
    def __init__(self, page_queue, pic_queue, *args, **kwargs):
        super(Consumer,self).__init__(*args, **kwargs)
        self.page_queue = page_queue
        self.pic_queue = pic_queue

    def run(self):
        while True:
            if self.page_queue.empty() and self.pic_queue.empty():
                break
            try:
                audio_url,filename = self.pic_queue.get()
                request.urlretrieve(audio_url,"baisipic/" + filename)
                print(filename + "下载完成")
            except:
                continue
            time.sleep(0.5)
def main():
    page_queue = Queue(50)
    pic_queue = Queue(800)
    for i in range(1,51):
        url_s = "http://www.budejie.com/pic/" + str(i)
        page_queue.put(url_s)

    for i in range(3):
        p = Productor(page_queue, pic_queue)
        p.start()
    for i in range(3):
        t = Consumer(page_queue, pic_queue)
        t.start()
if __name__ == '__main__':
    main()

爬到的结果：
在这里插入图片描述

Juno的学习日记

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
爬一爬百思不得姐上的搞笑图片

前两天看了一个爬百思不得姐上段子的视频，然后特意去百思不得姐网址看了一下，发现还有声音，就想爬一下声音这个一栏。使用的是我新学的多线程O(∩_∩)O，没想到居然掉进一个坑。这个网站的声音有十页，但是十页的内容都一毛一样，爬的时候看着我设置的提示信息，有点怀疑人生，比如一下出现5个“xxxxxx已经下载完成”，找了好久才发现是网站的问题。哎本着来都来了的心态，就再爬一下图片吧。网址：http://...
复制链接

扫一扫