多线程+多几页面抓取+手动输入招聘岗位==腾讯招聘爬虫

最新推荐文章于 2024-08-03 19:27:22 发布

WOHHH234

最新推荐文章于 2024-08-03 19:27:22 发布

阅读量2k

点赞数 1

分类专栏：爬虫小白文章标签：爬虫 python 多线程

本文链接：https://blog.csdn.net/qq_52491868/article/details/120027114

版权

爬虫小白专栏收录该内容

2 篇文章 0 订阅

订阅专栏

爬虫小白，最近跟着b站的一个up主学习到的技能，在这里如遇大佬请多多指教

运行结果：

一：多线程：cpu密集的程序适合使用多线程，可以充分利用计算机的多核，平时爬取网络数据的时候都是使用单线程获取数据的速度较慢，多线程就是多条线路执行一个任务返回进程

threading模块==============================》》》》线程模块

使用流程：

T=thread(target=事件函数名)

T.start()

T.join()#阻塞等待线程，避免堵塞线程的产生

多线程应用场景：

IO操作多的程序，包括网络io,本地磁盘io

爬虫发请求响应：网络io

爬虫处理所抓数据：本地磁盘IO

所以使用多线程便携的爬虫可以提升抓取数据的效率

二：队列：当多个进程执行一个任务的时候容易出现堵塞，无法确定数据的先后，因此队列解决了这个问题，将数据放入队列，在提取使用

From queue import Queue

常用的方法：

创建队列 q=Queue()

入队列：q.put()

出队列：q.get()

判断队列是否为空：q.empty()

当为空值的时候堵塞解决方法：

·q.get(block=True,timeout=2)添加延迟时间2秒之后自动跳出

·.get(block=False)

·while not q.empty():

Q.get()

三：线程锁：当多个线程操作同一个共享资源的时候，进行加锁，是未了防止当其中一个进程在进行没有结束时下面的进程启动

爬虫：

这次爬取的是腾讯招聘，对招聘职位具体信息的爬取，一共有两个页面

思路

获取两个页面的连接，可以发现第一个需要改动的是keyword我们需要查找的岗位，第二个是pageindex是页数，第二过页面跳转唯一可变的是postid，只需要在第一个页面中提取postid作为第二个页面的postid。这样可以获取到两个页面信息。

在第一个页面里面解析获取postid,为跳转链接准备，在第二个页面获取岗位信息。在这之中，需要两个队列和2个线程锁，一个页面对应一个队列和线程锁

代码：

定义两个页面的连接，队列，线程锁

def __init__(self):
        self.one_q=Queue()#队列1 
        self.two_q=Queue()#队列2
        
        self.one_lock=Lock()#锁1
        self.two_lock=Lock()#锁2
        
        
        self.number=0
        
        self.one_url='https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1630301002746&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword={}&pageIndex={}&pageSize=10&language=zh-cn&area=cn'
        self.two_url='https://careers.tencent.com/tencentcareer/api/post/ByPostId?timestamp=1630388009179&postId={}&language=zh-cn'
        self.headers={
         'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36'
    }

首先将需要的连接进入队列，word是可手动输入的岗位名称，urllib.parse.quote(kw)将输入的汉字变成计算机可识别编码， total=self.get_total(word)是自定义一个功能，在不确定页数的时候，自动获取岗位全部页数，完整连接。self.one_q.put(one_url)将连接放入队列

def url_in(self):
        kw=input('请输入想查找的职位：')
        word=urllib.parse.quote(kw)
        total=self.get_total(word)
        for pageindex in range(1,total+1):
            one_url=self.one_url.format(word,pageindex)
            self.one_q.put(one_url)

这段代码是在上述中total功能的写入，获取页数

def get_total(self,word):
        
        one_url=self.one_url.format(word,1)
        html=requests.get(url=one_url,headers=headers).json()
        count=html['Data']['Count']
        total=count//10 if count%10==0 else count//10+1
        print(total)
        return total

接下来是解析地一个连接获取postid，完成跳转页面的链接，里面采取的线程锁，但锁了就要打开进行下一个进程，每当队列不是空的时候，都提取链接，然后解析，如果空了就退出

def one_parse(self):
        while True:
            self.one_lock.acquire()  #避免多个线程判断一个队列，锁了
            if not self.one_q.empty():
                one_url=self.one_q.get()
                self.one_lock.release()#开锁

                html=requests.get(url=one_url,headers=self.headers).json()
                rep=html['Data']['Posts']
                for i in rep:
                    post_id=i['PostId']
                    URL=self.two_url.format(post_id)。#完成需要的二级链接
                    #给二级队列
                    self.two_q.put(URL)
                    #print(URL)
            else:
                self.one_lock.release()
                break

需要强调的是每当放入队列的时候，函数原本是需要写入使用的对象，但是对象在队列之后，def括号里面的使用对象在队列中，就不用再写，默认self即可

接下来是获取第二个页面=====》主要需要的岗位信息

    def two_parse(self):
        while True:
            
            try:
                self.two_lock.acquire()
                URL=self.two_q.get(timeout=3) #一级页面和二级页面容易冲突，时间延迟等一级页面首先完成
                self.two_lock.release()
                html=requests.get(url=URL,headers=self.headers).json()
                item={}
                item['name']=html['Data']['RecruitPostName']
                item['typ']=html['Data']['CategoryName']
                item['add']=html['Data']['LocationName']
                item['req']=html['Data']['Requirement']
                item['duty']=html['Data']['Responsibility']
                
                self.two_lock.acquire()
                self.number+=1
                self.two_lock.release()
                print(item)
            except Exception as e:
                self.two_lock.release()
                break

最后是直接多线程的使用，将两个队列分别放到两个线程中运行即可

    def run(self):
        self.url_in()
        #创建多线程
        t1_list=[]
        t2_list=[]
        for i in range(2):
            t1=Thread(target=self.one_parse)
            t1_list.append(t1)
            t1.start()
        
        for i in range(2):
            t2=Thread(target=self.two_parse)
            t2_list.append(t2)
            t2.start()
        for t1 in t1_list:
            t1.join()
            
        for t2 in t2_list:
            t2.join()

完整的代码

'''多级页面的多线程---腾讯招聘'''
import requests
from threading import Thread,Lock
from queue import Queue
import urllib.parse
import json,time
class Tenxun():
    def __init__(self):
        self.one_q=Queue()
        self.two_q=Queue()
        
        self.one_lock=Lock()
        self.two_lock=Lock()
        
        
        self.number=0
        
        self.one_url='https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1630301002746&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword={}&pageIndex={}&pageSize=10&language=zh-cn&area=cn'
        self.two_url='https://careers.tencent.com/tencentcareer/api/post/ByPostId?timestamp=1630388009179&postId={}&language=zh-cn'
        self.headers={
         'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36'
    }
    def url_in(self):
        kw=input('请输入想查找的职位：')
        word=urllib.parse.quote(kw)
        total=self.get_total(word)
        for pageindex in range(1,total+1):
            one_url=self.one_url.format(word,pageindex)
            self.one_q.put(one_url)
    def get_total(self,word):
        
        one_url=self.one_url.format(word,1)
        html=requests.get(url=one_url,headers=headers).json()
        count=html['Data']['Count']
        total=count//10 if count%10==0 else count//10+1
        print(total)
        return total
    def one_parse(self):
        while True:
            self.one_lock.acquire()  #避免多个线程判断一个队列，锁了
            if not self.one_q.empty():
                one_url=self.one_q.get()
                self.one_lock.release()
                html=requests.get(url=one_url,headers=self.headers).json()
                rep=html['Data']['Posts']
                for i in rep:
                    post_id=i['PostId']
                    URL=self.two_url.format(post_id)
                    #给二级队列
                    self.two_q.put(URL)
                    #print(URL)
            else:
                self.one_lock.release()
                break
    def two_parse(self):
        while True:
            
            try:
                self.two_lock.acquire()
                URL=self.two_q.get(timeout=3) #一级页面和二级页面容易冲突，时间延迟等一级页面首先完成
                self.two_lock.release()
                html=requests.get(url=URL,headers=self.headers).json()
                item={}
                item['name']=html['Data']['RecruitPostName']
                item['typ']=html['Data']['CategoryName']
                item['add']=html['Data']['LocationName']
                item['req']=html['Data']['Requirement']
                item['duty']=html['Data']['Responsibility']
                
                self.two_lock.acquire()
                self.number+=1
                self.two_lock.release()
                print(item)
            except Exception as e:
                self.two_lock.release()
                break
    def run(self):
        self.url_in()
        #创建多线程
        t1_list=[]
        t2_list=[]
        for i in range(2):
            t1=Thread(target=self.one_parse)
            t1_list.append(t1)
            t1.start()
        
        for i in range(2):
            t2=Thread(target=self.two_parse)
            t2_list.append(t2)
            t2.start()
        for t1 in t1_list:
            t1.join()
            
        for t2 in t2_list:
            t2.join()
    
if __name__=="__main__":
    start_time=time.time()
    spider=Tenxun()
    spider.run()
    end_time=time.time()
    print('time:%.2f'%(end_time-start_time))

WOHHH234

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
多线程+多几页面抓取+手动输入招聘岗位==腾讯招聘爬虫

爬虫小白，最近跟着b站的一个up主学习到的技能，在这里如遇大佬请多多指教运行结果：一：多线程：cpu密集的程序适合使用多线程，可以充分利用计算机的多核，平时爬取网络数据的时候都是使用单线程获取数据的速度较慢，多线程就是多条线路执行一个任务返回进程threading模块==============================》》》》线程模块使用流程：T=thread(target=事件函数名)T.start()T.join()#阻塞等待线程，避免堵塞线程的产生...
复制链接

扫一扫