[Python] 前程无忧招聘网爬取软件工程职位网络爬虫 https://www.51job.com

最新推荐文章于 2024-01-12 23:25:49 发布

路白#

最新推荐文章于 2024-01-12 23:25:49 发布

阅读量581

点赞数

本文链接：https://blog.csdn.net/weixin_43847567/article/details/104854129

版权

首先进入该网站的https://www.51job.com/robots.txt页面

给出提示：

找不到该页       File not found

您要查看的页已删除，或已改名，或暂时不可用。

请尝试以下操作:
如果您已经在地址栏中输入该网页的地址，请确认其拼写正确。
打开 www.51job.com 主页，然后查找指向您感兴趣信息的链接。
单击后退按钮，尝试其他链接。

　　注：

网络爬虫：自动或人工识别robots.txt，再进行内容爬取
约束性:robots协议建议但非约束性，不遵守可能存在法律风险

如果一个网站不设置robots协议说明所有内容都可以爬取，所以为可爬取内容。

源程序如下：

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @File  : HtmlParser.py
# @Author: 赵路仓
# @Date  : 2020/2/28
# @Desc  : 前程无忧求职网的爬虫程序
# @Contact : 398333404@qq.com

from bs4 import BeautifulSoup
import requests
import csv
import re
import io

# 请求头
head = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36'
}
# 链接
url = "https://search.51job.com/list/000000,000000,0000,00,9,99,%25E8%25BD%25AF%25E4%25BB%25B6,2,1.html?lang=c&postchannel=0000&workyear=99&cotype=99&degreefrom=99&jobterm=99&companysize=99&ord_field=0&dibiaoid=0&line=&welfare="


# csv写入表头
def headcsv():
    with open('/position.csv', 'w', encoding='utf-8', newline='') as csvfile:
        writer = csv.writer(csvfile)
        writer.writerow(["职位", "公司", "所在地", "薪酬", "日期", "网址"])


# txt写入表头
def headtxt():
    ftxt = open('E:/data/position.txt', 'w', encoding='utf-8')
    ftxt.write("职位 公司 所在地 薪酬 日期 网址")
    ftxt.close()


def position(url, head):
    # fcsv = open('/position.csv', 'a', encoding='utf-8', newline='')
    ftxt = open('E:/data/position.txt', 'a', encoding='utf-8')
    try:
        r = requests.get(url, headers=head, timeout=3)
        # 设置解析编码格式
        r.encoding = r.apparent_encoding
        print(r.apparent_encoding)
        # 打印状态码
        print(r.status_code)
        # 打印页面代码
        # print(r.text)
        # print(soup.prettify())
        text = r.text
        soup = BeautifulSoup(text, 'html.parser')
        # 一条招聘信息
        item = soup.find_all(class_='el', recursive=True)
        num = 0
        for i in item:
            num += 1
            if num > 16:
                itemdetail = i.text.replace(" ", "").replace("\n", " ").replace("   ", " ").lstrip() + i.find("a").attrs['href']
                print(itemdetail)
                ftxt.write(itemdetail.replace("\n","")+'\r')
                print("写入成功")
        ftxt.close()
    except:
        print("爬取职位过程中出错！")


def write(url, head):
    for i in range(1, 2000):
        url = "https://search.51job.com/list/000000,000000,0000,00,9,99,%25E8%25BD%25AF%25E4%25BB%25B6,2,"+str(i)+".html?lang=c&postchannel=0000&workyear=99&cotype=99&degreefrom=99&jobterm=99&companysize=99&ord_field=0&dibiaoid=0&line=&welfare="
        print(url)
        position(url, head)


if __name__ == "__main__":
    # head()
    write(url, head)

所爬取条目分布为职位公司所在地薪酬日期网址，保存路径为E:/data/position.txt可自行修改路径或者文件格式。

路白#

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
[Python] 前程无忧招聘网爬取软件工程职位网络爬虫 https://www.51job.com

首先进入该网站的https://www.51job.com/robots.txt页面给出提示：1 找不到该页 File not found2 3 您要查看的页已删除，或已改名，或暂时不可用。4 5 请尝试以下操作:6 如果您已经在地址栏中输入该网页的地址，请确认其拼写正确。7 打开 www.51job.com 主页，然后查找指向您感兴趣信息的链接。8 单击...
复制链接

扫一扫