第十五天-爬虫项目实战

最新推荐文章于 2024-08-15 12:11:47 发布

退休的梦想

最新推荐文章于 2024-08-15 12:11:47 发布

阅读量660

点赞数 12

分类专栏：自学Python从零开始到量化交易文章标签：爬虫

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/qq_34960590/article/details/136408405

版权

自学Python从零开始到量化交易专栏收录该内容

29 篇文章 0 订阅

订阅专栏

本文介绍了如何使用Python编写一个多线程爬虫程序，包括PageSpider、DetailSpider和DataParse模块，用于从指定网站抓取页面、解析详情并保存到Excel中，展示了如何处理请求、解析HTML和并发控制。

摘要由CSDN通过智能技术生成

目录

3.DetailSpider.py

6.HanderRequest.py

1.介绍

1. 使用多线程爬取网站

2.爬取数据后保存至excel

3.爬取网站(仅做测试)网创类项目爬取：https://www.maomp.com/

4..实现效果

2.代码

1.main.py

# coding:utf-8
import threading

import requests
from  queue import Queue
from PageSpider import PageSpider
from DetailSpider import DetailSpider
from DataParse import DataParse
import xlsxwriter
import time
"""
爬取网站：https://www.maomp.com/wzjc/
爬取信息，保存至Excel
"""

def start_page(threadsize,page_queue,detail_queue):
    # 开启线程，开始采集page页面
    page_spider_threadsize = threadsize
    page_spider_list = []
    for i in range(1,page_spider_threadsize+1):
        pageSpiderThread = PageSpider(thread_name="页面采集线程"+str(i), page_queue=page_queue, detail_queue=detail_queue)
        # 启动线程
        pageSpiderThread.start()
        page_spider_list.append(pageSpiderThread)
    # 查看队列是否有数据
    while not page_queue:
        pass
    # 释放资源
    for page_spider in page_spider_list:
        if page_spider.is_alive():
            page_spider.join()


def start_detail(threadsize,detail_queue,data_queue):
    # 开启线程，开始采集page页面
    detail_spider_threadsize = threadsize
    detail_spider_list = []
    for i in range(1, detail_spider_threadsize + 1):
        detailSpiderThread = DetailSpider(thread_name="详情页采集线程" + str(i), detail_queue=detail_queue,
                                      data_queue=data_queue)
        # 启动线程
        detailSpiderThread.start()
        detail_spider_list.append(detailSpiderThread)
    # 查看队列是否有数据
    while not detail_queue:
        pass
    # 释放资源
    for detail_spider in detail_spider_list:
        if detail_spider.is_alive():
            detail_spider.join()

def start_data_parse(threadsize,data_queue,book):
    # 开启线程，开始采集page页面
    lock=threading.Lock()
    sheet1 = book.add_worksheet("sheet1")
    title_data = ("网址", "标题", "发布时间", "内容")
    # 添加表头
    for index, title_datum in enumerate(title_data):
        sheet1.write(0, index, title_datum)

    spider_list = []
    for i in range(1, threadsize + 1):
        thread = DataParse(thread_name="数据解析线程" + str(i), data_queue=data_queue,lock=lock,sheet=sheet1)
        # 启动线程
        thread.start()
        spider_list.append(thread)
    # 查看队列是否有数据
    while not data_queue:
        pass
    # 释放资源
    for parse in spider_list:
        if parse.is_alive():
            parse.join()

def main(xlswriter=None):
    #定义页面队列,存放page页信息
    page_queue = Queue()
    #定义详情页队列
    detail_queue = Queue()
    #定义详情页数据队列
    data_queue = Queue()
    page_start=1
    page_end=1
    for i in range(page_start,page_end+1):
        page_url="https://www.maomp.com/wzjc/page/{}/".format(i)
        page_queue.put(page_url)
    print("页面队列:",page_queue.qsize())

    #启动采集分页
    start_page(threadsize=3,page_queue=page_queue,detail_queue=detail_queue)
    #启动详情页采集
    start_detail(threadsize=3, detail_queue=detail_queue, data_queue=data_queue)
    # 启动数据解析
    #创建存放excel文件夹
    book = xlsxwriter.Workbook(time.strftime("%Y%m%d%H%M%S",time.gmtime())+"文件.xlsx")
    start_data_parse(threadsize=5,data_queue=data_queue,book=book)
    book.close()
    print("分页数据个数:",page_queue.qsize())
    print("详情页数据个数:", detail_queue.qsize())
    print("数据数据个数:", data_queue.qsize())

if __name__ == '__main__':
   main()

2.PageSider.py

# coding:utf-8
import threading
from lxml import etree
import HanderRequest


class PageSpider(threading.Thread):
    """
    页面url，请求多线程类
    """

    def __init__(self,thread_name,page_queue,detail_queue):
        super(PageSpider,self).__init__()
        self.thread_name=thread_name
        self.page_queue=page_queue
        self.detail_queue=detail_queue

    def parse_detail_url(self,content):
        """
        解析page页获取详情页url
        :param content:  page页text
        :return:  返回详情页url
        """
        #页码返回数据html实例化
        item_html=etree.HTML(content)
        #解析出索引详情页URL
        detail_urls=item_html.xpath("//h2[@class='entry-title']/a/@href")
        for url in detail_urls:
            #将详情页url存放到队列中
            self.detail_queue.put(url)

    def run(self):
        #实际发送请求
        print("{}启动".format(self.thread_name))
        #需要从page_queue队列中获取数据
        try:
            while not self.page_queue.empty():
            #从队列中获取数据，并设置为非阻塞状态
               page_url= self.page_queue.get(block=False)
               #请求页面链接
               response_text=HanderRequest.send_reqeust(page_url)
               if response_text:
                   #解析详情url
                   self.parse_detail_url(response_text)
        except Exception as e:
            print("{} 执行异常:{}".format(self.thread_name,e))

        print("{}结束".format(self.thread_name))

3.DetailSpider.py

# coding:utf-8
import threading
from lxml import etree
import HanderRequest


class DetailSpider(threading.Thread):
    """
    详情页url，请求详情页
    """

    def __init__(self,thread_name,detail_queue,data_queue):
        super(DetailSpider,self).__init__()
        self.thread_name=thread_name
        self.data_queue=data_queue
        self.detail_queue=detail_queue


    def run(self):
        #实际发送请求
        print("{}启动".format(self.thread_name))
        #需要从page_queue队列中获取数据
        try:
            while not self.detail_queue.empty():
            #从队列中获取数据，并设置为非阻塞状态
               detail_url= self.detail_queue.get(block=False)
               #请求页面链接
               response_text=HanderRequest.send_reqeust(detail_url)
               if response_text:
                   data={
                       "url":detail_url,
                       "html_content":response_text
                   }
                   #存放data_queuq数据
                   self.data_queue.put(data)

        except Exception as e:
            print("{} 执行异常:{}".format(self.thread_name,e))

        print("{}结束".format(self.thread_name))

4.DataParse.py

# coding:utf-8
import threading
from lxml import etree
import Constant



class DataParse(threading.Thread):
    """
    详情页数据处理
    """

    def __init__(self,thread_name,data_queue,lock,sheet):
        super(DataParse,self).__init__()
        self.thread_name=thread_name
        self.data_queue=data_queue
        self.lock=lock
        self.sheet=sheet


    def __list_join(self,list):
        return "".join(list)

    def __parse(self,data):
        """
        解析data_queue数据
        保存至excel中
        :return:
        """

        html= etree.HTML(data.get("html_content"))
        data={
            "url":data.get("url"),
            "title": self.__list_join(html.xpath("//h1[@class='entry-title']/text()")),
            "put_date":self.__list_join(html.xpath("//span[@class='my-date']/text()")),
            "content_html":self.__list_join(html.xpath("//div[@class='single-content']//p/text()"))
        }
        #多线程，使用lock来进行控制并发
        with self.lock:
            #写入Excel
            for index,e in enumerate(data):
                self.sheet.write(Constant.CURR_EXCEL_COL,index,data.get(e))
            Constant.CURR_EXCEL_COL += 1

    def run(self):
        #实际发送请求
        print("{}启动".format(self.thread_name))
        #需要从page_queue队列中获取数据
        try:
            while not self.data_queue.empty():
                #从队列中获取数据，并设置为非阻塞状态
                data_content= self.data_queue.get(block=False)
                #解析html数据
                self.__parse(data_content)

        except Exception as e:
            print("{} 执行异常:{}".format(self.thread_name,e))

        print("{}结束".format(self.thread_name))

5.Constant.py

# coding:utf-8

# excel写入到第几列
CURR_EXCEL_COL=1

6.HanderRequest.py

注意修改cookie

# coding:utf-8

import requests

def send_reqeust(url):
    #发送数据
    headers={
        "Cookie":"xxx",
        "user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36"
    }
    response=requests.get(url,headers=headers)
    if response.status_code==200 and response:
        return response.text

退休的梦想

关注

12
点赞
踩
13

收藏

觉得还不错? 一键收藏
0
评论
第十五天-爬虫项目实战

1. 使用多线程爬取网站2.爬取数据后保存至excel4..实现效果。
复制链接

扫一扫

专栏目录

退休的梦想 CSDN认证博客专家 CSDN认证企业博客

码龄8年

51: 原创

55万+: 周排名

3万+: 总排名

4万+: 访问

: 等级

1206: 积分

450: 粉丝

653: 获赞

2: 评论

626: 收藏

私信

关注

热门文章

分类专栏

PyQT6 3篇
python 1篇
selenium 1篇
自学Python从零开始到量化交易 29篇
中医基础 1篇
docker 1篇
ocr 1篇
etl 3篇
kettle 3篇
mysql 3篇
sql 1篇
sqlserver 1篇
centos 1篇
Java 2篇

最新评论

Celery使用异步、定时任务使用
北风之神c: 总结的很全面，写得赞，博主用心了。 celery对目录层级文件名称格式要求太高，只适合规划新的项目，对不规则文件夹套用难度高。所以新手使用celery很仔细的建立文件夹名字、文件夹层级、python文件名字。所以网上的celery博客教程虽然很多，但是并不能学会使用，因为要运行起来需要以下6个方面都掌握好，博客文字很难表达清楚或者没有写全面以下6个方面。 celery消费任务不执行或者报错NotRegistered，与很多方面有关系，如果要别人排错，至少要发以下6方面的截图，因为与一下6点关系很大。 1)整个项目目录结构, 2）@task入参 ,3）celery的配置，4）celery的配置 include ,5）cmd命令行启动参数 --queues= 的值,6）用户在启动cmd命令行时候，用户所在的文件夹。在不规范的文件夹路径下，使用celery难度很高，一般教程都没教。 [项目文件夹目录格式不规范下的celery使用演示](https://github.com/ydf0509/celery_demo) 。此国产分布式函数调度框架 funboost python万能通用函数加速器 https://funboost.readthedocs.io/ ，从用法调用难度，用户所需代码量，超高并发性能，qps控频精确程度，支持的中间件类型，任务控制方式，稳定程度等19个方面全方位超过celery。发布性能提高1000%，消费性能提高2000%。 python万能分布式函数调度框架funboost支持python所有类型的并发模式和一切知名消息队列中间件，python函数加速器，框架包罗万象,万能编程功能宝典，一统编程思维，与业务不绑定，适用范围广。 funboot能支持celery作为中间件，用户可以使用funboost的极简api来使用celery核心调度，不用手动复杂的配置操作celery funboost 自动化操作celery https://github.com/ydf0509/funboost_support_celery_demo pip install funboost
第二十三天-数据分析入门实战
CSDN-Ada助手: 恭喜你这篇博客进入【CSDN每天值得看】榜单，全部的排名请看 https://bbs.csdn.net/topics/618228167。
第十三天-mysql交互
白话机器学习: 博主的文章细节很到位，兼顾实用性和可操作性，感谢博主的分享，期待博主持续带来更多好文
第一天-环境搭建与基础语法
CSDN-Ada助手: Python入门技能树或许可以帮到你：https://edu.csdn.net/skill/python?utm_source=AI_act_python
第一天-环境搭建与基础语法
CSDN-Ada助手: 推荐 Python入门技能树：https://edu.csdn.net/skill/python?utm_source=AI_act_python

最新文章

2024

目录

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。