【爬虫专栏20】拉勾网爬虫（单线程和多线程）

最新推荐文章于 2022-04-11 18:16:45 发布

夏友

最新推荐文章于 2022-04-11 18:16:45 发布

阅读量642

点赞数 1

分类专栏：爬虫和数据分析文章标签：队列 xpath python

本文链接：https://blog.csdn.net/summer_bird/article/details/104922242

版权

爬虫和数据分析专栏收录该内容

18 篇文章 0 订阅

订阅专栏

拉勾网爬虫

爬取方法

emmmm这里就是从主页开始，找到页码的规律

这个规律还是挺好找的，就是页码变了而已
在这里插入图片描述
下面是拉钩主页页面
这个审查元素幅值xpath标签啥的我就不多说了吧

注意事项

1.#拉勾网有反爬，cookies变化
参考网址https://www.cnblogs.com/kuba8/p/10808023.html解决cookies变化问题

2.出现数据存在空格，换行符，需要利用strip或者replace函数去清洗数据
下面两种方法都可以有效进行清洗

#1.
set = list(set(lists))
set.sort(key=lists.index)
set.remove('')
#2.
s=[x.strip() for x in list1 if x.strip()!='']

关键示例

针对福利一项的清洗

#公司福利
    welfare = x.xpath('//*[@id="s_position_list"]/ul/li/div[2]/div[2]/text()')
    welfare=[exp.replace('“', '').replace('”', '') for exp in welfare if exp.strip()!='']

这是针对pandas的应用

data = {'names':names, 'direction':dire, 'money':money, 'experience':experience, 'condition':condition,
            'company':company, 'welfare':welfare}
    basic_data = pd.DataFrame.from_dict(data = data)

单线程示例

利用xpath

同时为了更好的清理数据，还利用了pandas的dataframe模块

import requests
import re
from requests.exceptions import  RequestException
from lxml import etree
from queue import Queue
import threading
import pandas as pd 
import time

def get_one_page(url):
        try:
            time.sleep(0.5)
            headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'}
            
            #应对拉钩的反爬措施
            s = requests.Session() # 创建一个session对象
            s.get(url, headers=headers, timeout=3)  # 用session对象发出get请求，请求首页获取cookies
            cookie = s.cookies  # 为此次获取的cookies
            response = s.post(url, headers=headers, cookies=cookie, timeout=3)  # 获取此次文本
            
            #response = requests.get(url, headers = headers)
            #response.encoding = response.apparent_encoding
            if response.status_code==200:
                #print(response)
                return response.text
                #return response.content.decode("utf8", "ignore")
            return None
        except RequestException:
            return None

def parse_one_page(html):
    
    x = etree.HTML(html)
 
    #职位名称
    names = x.xpath('//*[@id="s_position_list"]/ul/li/div[1]/div[1]/div[1]/a/h3/text()')
    #//*[@id="s_position_list"]/ul/li[3]/div[1]/div[1]/div[1]/a/h3
    #print(names)
    
    #地点
    dire = x.xpath('//*[@id="s_position_list"]/ul/li/div[1]/div[1]/div[1]/a/span/em/text()')
    #print(dire)
    
    #薪资
    money = x.xpath('//*[@id="s_position_list"]/ul/li/div[1]/div[1]/div[2]/div/span/text()')
    #print(len(money))
    
    #经验
    experience = x.xpath('//*[@id="s_position_list"]/ul/li/div[1]/div[1]/div[2]/div/text()')
    #爬虫数据清洗 
    experience=[exp.strip() for exp in experience if exp.strip()!='']
    #print(experience)
    
    
    #公司条件
    condition = x.xpath('//*[@id="s_position_list"]/ul/li/div[1]/div[2]/div[2]/text()')
    condition=[exp.strip() for exp in condition if exp.strip()!='']
    
    #公司名称
    company = x.xpath('//*[@id="s_position_list"]/ul/li/div[1]/div[2]/div[1]/a/text()')
    
    #公司福利
    welfare = x.xpath('//*[@id="s_position_list"]/ul/li/div[2]/div[2]/text()')
    welfare=[exp.replace('“', '').replace('”', '') for exp in welfare if exp.strip()!='']
    #print(welfare)
    
    #利用字典存储多个内容，这样就可以避免使用for语句使元组隔开后分开读取了，是另外一种可行方法
    data = {'names':names, 'direction':dire, 'money':money, 'experience':experience, 'condition':condition,
            'company':company, 'welfare':welfare}
    basic_data = pd.DataFrame.from_dict(data = data)
    basic_data.to_csv(r'xxx.csv', index=False, mode='a', header=False)
    #print(basic_data)



def main():
    page_queue=Queue()
    html = get_one_page(url)
    #print(html)
    print('打印第',(j+1),'页')
    parse_one_page(html)


#这里主要是为多线程埋下伏笔
i = 'dianlusheji'

for j in range(5):
    url = 'https://www.lagou.com/zhaopin/{}/{}/'.format(i, j+1)

    if __name__=='__main__':
        main()

多线程示例

与单线程类似，利用队列的知识，随机抽取内容能够加快速度，最后3000条左右数据用了80秒，开了三个线程，本来可以更快

下面只添加必要的代码块
其他和单线程基本一样

#要爬取的队列标题
crawl_list = ['danpianji', 'dianlusheji', 'zidonghua', 'qianrushi', 'yingjian', 'Python']

类里面的各个参数定义

def run(self):
        # # 任务开始事件
        # start_time = time.time()
        while True:
            if self.page_queue.empty():
                # # 任务结束时间
                # end_time = time.time()
                # # 需要时间
                # print(end_time - start_time)
                break
            else:
                print(self.name, '将要从队列中取任务')
                #这里就是利用了队列的特性，抽取之后就行了,get抽了之后对应的页码就消失了，不然就会重复抽取了
                page = self.page_queue.get()
                print(self.name, '取出的任务是：', page)
                
                for j in range(30):
                    url = 'https://www.lagou.com/zhaopin/{}/{}/'.format(page, j+1)
                    main(url, j)
                print(self.name, '完成任务：', page)