python爬取51job网

最新推荐文章于 2024-05-01 14:35:23 发布

Dream____Fly

最新推荐文章于 2024-05-01 14:35:23 发布

阅读量1.1k

点赞数 1

分类专栏： python爬虫文章标签： python爬虫 python bs4方法 python爬取51job

本文链接：https://blog.csdn.net/dream____fly/article/details/99681086

版权

python爬虫专栏收录该内容

14 篇文章 0 订阅

订阅专栏

废话不说了，直接展示代码！！！

import urllib.request
from bs4 import BeautifulSoup
import re
import time

'''
项目目标：51job爬取职业，地区，薪资，工资，公司，
首先根据url爬取整个网页
其次根据爬取的页面获取所要的数据
最后用字典一一保存，最后保存在文件夹中
'''

class python_job():
    #定义一个字典
    def __init__(self):
        self.date = {}

    #根据url，获取51job的网站数据
    def get_content(self,namber):
        url = 'https://search.51job.com/list/200200,000000,0000,00,9,99,python,2,'
        #拼接url
        new_url = url + str(namber) + '.html'
        #创建头消息
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36'
        }
        # 创建消息头
        request = urllib.request.Request(url=new_url, headers=headers)
        #获取网页信息
        content = urllib.request.urlopen(request)
        return self.get_response(content)

    #根据首页获取工作栏
    def get_response(self,content):
        # 生成soup对象
        soup = BeautifulSoup(content, 'lxml')
        # 表头信息分成分类信息heat_ret与所要信息body_ret
        heat_ret = soup.find_all('div',class_ = 'el title')
        body_ret = soup.select('.dw_table > .el')
        body_ret.pop(0)
        # return heat_ret,body_ret
        return self.get_head_body(heat_ret,body_ret)

    #整合表头表内容信息
    def get_head_body(self,heat_ret,body_ret):
        #将分类信息提取出来
        head = heat_ret[0].find_all('span')
        fp = open('python.txt', 'a', encoding='utf8')
        # 数据整理
        for i in body_ret:
            body_head = i.find('a')
            # 将所要信息提取出来
            body_body = i.find_all('span',class_ =re.compile(r'^t\d') )
            #职位名
            self.date[head[0].text] = str(body_head.text).strip()
            #公司名
            self.date[head[1].text] = str(body_body[0].text).strip()
            #工作地点
            self.date[head[2].text] = str(body_body[1].text).strip()
            #薪资
            self.date[head[3].text] = str(body_body[2].text).strip()
            #发布时间
            self.date[head[4].text] = str(body_body[3].text).strip()
            #写入文件
            fp.write(str(self.date) + '\n')
        #关闭文件
        fp.close()
        time.sleep(5)
        print('下载一页完成。。。。。。')

    #进行页面跳转调用函数
    def first(self,i):
            print('第%s页开始下载。。。。。。'%i)
            return self.get_content(i)

if __name__ == '__main__':
    staet_num = int(input('请输入起始页码:'))
    end_num = int(input('请输入终止页码:'))
    a = python_job()
    for i in range(staet_num, end_num + 1):
        a.first(i)

Dream____Fly

关注

1
点赞
踩
4

收藏

觉得还不错? 一键收藏
1
评论
python爬取51job网

废话不说了，直接展示代码！！！import urllib.requestfrom bs4 import BeautifulSoupimport reimport time'''项目目标：51job爬取职业，地区，薪资，工资，公司，首先根据url爬取整个网页其次根据爬取的页面获取所要的数据最后用字典一一保存，最后保存在文件夹中'''class python_job(): ...
复制链接

扫一扫

专栏目录