python 爬虫图图岛多线程并发爬取搜索内容的全部数据（解决href关联问题）

最新推荐文章于 2024-05-04 16:08:54 发布

乎你

最新推荐文章于 2024-05-04 16:08:54 发布

阅读量10w+

点赞数

分类专栏：爬虫文章标签：乱码 xpath streaming sdl url

本文链接：https://blog.csdn.net/m0_50944918/article/details/111391965

版权

本文详细介绍了如何使用Python爬虫，结合XPath解析技术，实现对图图岛网站的多线程并发爬取。在过程中，解决了URL乱码问题，并探讨了streaming模式在爬虫中的应用。

摘要由CSDN通过智能技术生成

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @Time    : 2020/12/18 19:30
# @Author  : huni
# @File    : 图图岛多线程.py
# @Software: PyCharm
import requests
from lxml import etree
import os
from threading import Thread
from queue import Queue
from urllib import parse

class CrawlInfo(Thread):
    def __init__(self,url_queue,html_queue):
        Thread.__init__(self)
        self.url_queue = url_queue
        self.html_queue = html_queue
    def run(self):
        while self.url_queue.empty() == False:
            url = self.url_queue.get()
            resp1 = requests.get(url=url, headers=headers)
            # 处理中文乱码问题
            resp1_text = resp1.text.encode('ISO-8859-1').decode('utf-8')
            if resp1.status_co

最低0.47元/天解锁文章

乎你

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
1
评论
python 爬虫图图岛多线程并发爬取搜索内容的全部数据（解决href关联问题）

#!/usr/bin/env python# -*- coding: utf-8 -*-# @Time : 2020/12/18 19:30# @Author : huni# @File : 图图岛多线程.py# @Software: PyCharmimport requestsfrom lxml import etreeimport osfrom threading import Threadfrom queue import Queuefrom urllib impo
复制链接

扫一扫