合作项目 : 人工智能专业相关职位数据分析 (爬虫+数据处理)

最新推荐文章于 2024-05-27 17:50:54 发布

餐霞散人

最新推荐文章于 2024-05-27 17:50:54 发布

阅读量1.6k

点赞数

分类专栏：爬虫数据科学 python pandas AI 之路

本文链接：https://blog.csdn.net/qq_27171347/article/details/81746589

版权

本文介绍了通过Scrapy爬虫获取51job上的人工智能相关职位信息，包括机器学习、深度学习等多个方向。接着，对爬取的数据进行了详细清洗，如处理tags、提取有效信息、统一薪资单位等，最终整理成规范化的数据集。

摘要由CSDN通过智能技术生成

1 项目背景
2 Scrapy 爬取51job具体信息
3 数据清洗

1 项目背景

当前人工智能的发展愈来愈快,市场需求空间很大,为了分析开设人工智能专业的必要性,开展此项目进行具体的数据分析。通过对智联、51job、中华英才网和拉勾等企业常驻的招聘平台的招聘信息进行数据爬取及数据分析，得到的信息便能较大程度上反映出市场对某职位的需求情况及能力要求。

2 Scrapy 爬取51job具体信息

爬取的工作方向是人工智能，及人工智能下细分方向：机器学习，深度学习，图像识别，人脸识别，NLP，无人驾驶，语音识别，算法研究员
主要爬取的内容有工作名称，公司名称，薪水待遇，学历及经验要求，职位描述及需求等具体信息，并保存至csv文件，spider代码如下：

import scrapy
from ..items import FiveoneItem
from scrapy.selector import Selector,HtmlXPathSelector
from scrapy.http import Request

class JobSpider(scrapy.Spider):
    name = 'job'
    # allowed_domains = ['example.com']
    # 需爬取的内容网页
    start_urls = ['https://search.51job.com/list/000000,000000,0000,00,9,99,%25E7%25AE%2597%25E6%25B3%2595%25E7%25A0%2594%25E7%25A9%25B6%25E5%2591%2598,2,1.html?lang=c&stype=&postchannel=0000&workyear=99&cotype=99&degreefrom=99&jobterm=99&companysize=99&providesalary=99&lonlat=0%2C0&radius=-1&ord_field=0&confirmdate=9&fromType=&dibiaoid=0&address=&line=&specialarea=00&from=&welfare=']

    def parse(self, response):
        hxs = Selector(response=response).xpath('//div[@class="el"]')
        for obj in hxs:
            detailUrl = obj.xpath('.//a/@href').extract_first()
            # print(detailUrl)
            # title = obj.xpath('.//p[@class="t1 "]//a/@title').extract_first()
            # print(title)
            yield scrapy.Request(detailUrl, self.detailparse)
        visited_urls = set()
        # nextLink = Selector(response=response).xpath('//li[@class="bk"]/a/@href').extract()
        # 51job下一页链接标签与上一页一致,直接取会因重复导致spider过早关闭无法爬完所有页面,这里改找最后一个标签
        nextLink = Selector(response=response).xpath('//li[@class="bk"][last()]/a/@href').extract_first()
        if nextLink is not None:
            print("*******************下一页链接**************************")
            print(nextLink)
            yield scrapy.Request(url=nextLink, callback=self.parse)
    #爬取详情页具体内容
    def detailparse(self, response):
        item = FiveoneItem()
        hxs = Selector(response=response).xpath('//div[@class="tHeader tHjob"]')
        for obj in hxs:
            title = obj.xpath('.//h1/@title').extract_first()
            item['title'] = title
            print(title)
            salary = obj.xpath('.//strong/text()').extract_first()
            item['salary'] = salary
            print(salary)
            company = obj.xpath('.//p[@class="cname"]/a/@title').extract_first()
            item['company'] = company
            print(company)
            tags = obj.xpath('.//p[@class="msg ltype"]/@title').extract_first()
            item['tags'] = tags
            print(tags)
        details = Selector(response=response).xpath('//div[@class="bmsg job_msg inbox"]')
        jobrequest = []
        for obj in details:
            detail = obj.xpath('./p/text()').extract()
            for i in detail:
                jobrequest.append(i)
            print(jobrequest)
            item['jobrequest'] = jobrequest
        print('*******************')
        yield(item)

这里可不编辑pipeline,settings或items,直接采用scrapy自带的持久化功能存储

#存储json格式：
scrapy crawl 项目名称 -o 项目名称.json -s FEED_EXPORT_ENCIDING=utf-8
#存储csv（表格）形式：
scrapy crawl 项目名称 -o 项目名称.csv -s FEED_EXPORT_ENCIDING=utf-8

存储后得到如下文件：
这里写图片描述

3 数据清洗

import re
import numpy as np 
import pandas as pd  
import matplotlib as mpl  
from pyecharts import Geo 
import matplotlib.pyplot as plt
from pyecharts import WordCloud
#用来正常显示中文标签 
plt.rcParams['font.sans-serif']=['SimHei'] 
#用来正常显示负号 
plt.rcParams['axes.unicode_minus']=False

3.1 导入csv

a=pd.read_csv('aiall.csv')
len(a)

    14383 #共14383条数据

#去除无效数据
b = a.dropna()

b.head()

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	company	jobrequest	salary	tags	title
0	北京捷通华声科技股份有限公司	职位描述：, 1.负责智能机器人知识库的建设，保证机器人智能问答的准确率；…	0.7-1万/月	北京 \| 无工作经验 \| 本科 \| 招3人 \| 08-14发布	人工智能训练工程师
2	寰宇优才教育科技（北京）有限公司	【岗位方向】：,1、 Java+大数据软件开发工程师实习生,2、人工智能+Python开发…	6-8千/月	北京-朝阳区 \| 无工作经验 \| 招16人 \| 08-14发布	Java+人工智能实习工程师
3	广州市润东信息科技有限公司	1. 研究机器学习、深度学习等领域的前沿技术并结合业务场景解决实际问题；,2. 通过对数据的…	1.5-2万/月	广州-番禺区 \| 3-4年经验 \| 本科 \| 招1人 \| 08-14发布	人工智能工程师
4	江苏厚学网信息技术股份有限公司	岗位职责:,1. 开发软电话模块，完成与终端呼叫设备的对接；,2. 负责模块的需求分析，代码…	10-20万/年	南京-秦淮区 \| 3-4年经验 \| 大专 \| 招4人 \| 08-14发布	C++高级开发工程师(人工智能方向)
5	郑州汇之众网络科技有限公司	1、热爱编程事业，热衷IT行业；, 2、本科以上学历，计算机及理工类专业优先；, 3、认真遵…	6-8千/月	郑州 \| 无工作经验 \| 本科 \| 招若干人 \| 08-14发布	零基础人工智能开发实习生

len(b)

3.2 处理tags,提取其中有效信息

new_tags = b['tags'].str.split('|')
new_tags.head()

    0        [北京  ,   无工作经验  ,   本科  ,   招3人  ,   08-14发布]
    2           [北京

最低0.47元/天解锁文章

餐霞散人

关注

0
点赞
踩
6

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录