Python-爬虫小demo(Boss直聘)


QQ:1755497577(备注:CSDN)

B站:code_ant(java相关培训视频)

微信搜索公众号:CodeAnt


简述

很早以前都知道Python爬虫的强大,但是一直没有尝试过,今天就来尝试一下写个Python的小爬虫

准备

python环境

先介绍以下python的安装教程吧,我习惯于在windows上开发,所以讲一下windows和liunx的环境安装,针对Mac OS系统的安装自行百度。

windows 10

进入官网,https://www.python.org/downloads/windows/;选择稳定版本的windowsx86-64 executable installer;下载打开后,选择自动配置环境变量

over,就是这么简单(这相当于是java中的jdk安装)

centos 7.x

# py环境安装
# 执行项目使用python3 xxx.py

# 安装依赖库(因为没有这些依赖库可能在源代码构件安装时因为缺失底层依赖库而失败)。
yum -y install wget gcc zlib-devel bzip2-devel openssl-devel ncurses-devel sqlite-devel readline-devel tk-devel gdbm-devel db4-devel libpcap-devel xz-devel libffi-devel


wget https://www.python.org/ftp/python/3.7.4/Python-3.7.4.tgz
tar -xvf Python-3.7.4.tar

cd Python-3.7.4
./configure --prefix=/usr/local/python37 --enable-optimizations
make && make install

# 环境变量
cd ~
cat>.bash_profile<<EOF
export PATH=$PATH:/usr/local/python37/bin
EOF

source .bash_profile

echo "---------------测试------------------"
python3 --version

python编辑器

等同于Java开发工具idea,idea也可以用于python开发,这里我是用的是pycharm,安装可参考可参考:https://www.runoob.com/w3cnote/pycharm-windows-install.html


爬虫

直聘网爬虫思路:

  • 获取cookie
  • 模拟浏览器请求(request库)
  • 抓取数据(XPath)
  • 分析数据
  • 保存数据

main.py

# 直聘网招聘信息爬虫demo

from lxml import etree
import requests

import Info

# url = "https://www.zhipin.com/c101040100-p100101/?ka=sel-city-101040100"  # 接口地址
url = "https://www.zhipin.com/c101040100-p100101/"  # 接口地址

# 消息头数据

headers = {
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
    'accept-encoding': 'gzip, deflate, br',
    'accept-language': 'zh-CN,zh;q=0.9',
    'cookie': 'lastCity=101040100; _uab_collina=1559894257053587868292; _bl_uid=ejj1ezbR49w72poUUr06qa8iy55n; __c=1567474454; __g=-; Hm_lvt_194df3105ad7148dcf2b98a91b5e727a=1565361209,1566378390,1566981114,1567474455; __l=l=%2Fwww.zhipin.com%2F&r=https%3A%2F%2Fwww.baidu.com%2Fs%3Fie%3DUTF-8%26wd%3Dboss%25E7%259B%25B4%25E8%2581%2598&friend_source=0&friend_source=0; __zp_stoken__=c688fL0GhqXlN%2FYY%2F2ydR1HFd8NS%2B8oaaNAjTZSdiGKLVMq%2BPk1q%2FaMCVkpzfOn1kk38E6u8nCHUaLXH2leUN3NrhA%3D%3D; __a=50395184.1559894257.1566981114.1567474454.68.6.4.68; Hm_lpvt_194df3105ad7148dcf2b98a91b5e727a=1567475125',
    # 'referer': 'https://www.zhipin.com/c101040100-p100101/',
    'referer': 'https://www.zhipin.com/c101040100-p100101/?page=2&ka=page-2',
    'upgrade-insecure-requests': '1',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36'
}

payload = {
}

# 获取html
def getData(data_url, data_headers, data_payload):
    r = requests.get(data_url, json=data_payload, headers=data_headers, verify=False)
    s = str(r.content, 'utf8')
    # print(s)
    return s


# 分析数据
# 教程:https://www.cnblogs.com/lei0213/p/7506130.html
def analysisData(data='', xpath_str_f='', xpath_str_b=''):
    html = etree.HTML(data)
    values = []
    try:
        for n in range(1, 20):
            x = xpath_str_f + str(n) + xpath_str_b
            if x == '' or x.__len__() == 0 :
                break
            values.append(html.xpath(x))
    except Exception:
        pass
    else:
        pass
    # for i in values:
    #     print(i[0].text)
    return values


# 爬取一页的数据并打印
def getOnePage(page_url):
    r = getData(page_url, headers, payload)
    names = analysisData(r, '//*[@id="main"]/div/div[2]/ul/li[', ']/div/div[1]/h3/a/div[1]')
    moneys = analysisData(r, '//*[@id="main"]/div/div[2]/ul/li[', ']/div/div[1]/h3/a/span')
    companyNames = analysisData(r, '//*[@id="main"]/div/div[2]/ul/li[', ']/div/div[2]/div/h3/a')
    addrs = analysisData(r, '//*[@id="main"]/div/div[2]/ul/li[', ']/div/div[1]/p')
    companyStatuss = analysisData(r, '//*[@id="main"]/div/div[2]/ul/li[', ']/div/div[2]/div/p')

    infos = ''
    print(names.__len__())
    for i in range(1, names.__len__()):
        info = Info.Info(names[i][0].text,
                         moneys[i][0].text,
                         companyNames[i][0].text,
                         addrs[i][0].text,
                         companyStatuss[i][0].text,)

        if info == 'None' or info == None:
            continue
        # infos = infos + info.tostring()
        print(info.tostring())
    return infos

for i in range(2, 4):
    temp = url + '?page=' + str(i) + '&ka=page-' + str(i)
    print(temp)
    getOnePage(temp)

Info.py

class Info:
    name = ''
    money = ''
    companyName = ''
    addr = ''
    companyStatus = ''

    def __init__(self, name, money, companyName, addr, companyStatus):
        self.name = name
        self.money = money
        self.companyName = companyName
        self.companyStatus = companyStatus
        self.addr = addr

    def tostring(self):
        print("name:" + self.name
              + "\tmoney:" + self.money
              + "\tcompanyName:" + self.companyName
              + "\taddr:" + self.addr
              + "\tcompanyStatus:" + self.companyStatus + "\n"
              )
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
好的,我了解你的问题了。这是一个非常实用的问题,下面我将为你详细介绍Python爬取boss直聘招聘信息的实现方法。 首先,我们需要准备以下工具和库: - Python 3.6 或以上版本 - requests 库 - BeautifulSoup 库 - pandas 库 接下来,我们按照以下步骤进行实现: 1. 导入需要的库: ```python import requests from bs4 import BeautifulSoup import pandas as pd ``` 2. 定义一个函数,用于获取页面内容: ```python def get_page(url): headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'} response = requests.get(url, headers=headers) soup = BeautifulSoup(response.text, 'html.parser') return soup ``` 3. 定义一个函数,用于获取职位信息: ```python def get_jobs(soup): job_list = [] jobs = soup.select('.job-list ul li .job-primary .info-primary') for job in jobs: job_name = job.select('.job-title')[0].text.strip() company_name = job.select('.company-text .name')[0].text.strip() salary = job.select('.job-limit .red')[0].text.strip() job_info = job.select('.job-title')[0].attrs['href'] job_detail = get_job_detail(job_info) job_list.append([job_name, company_name, salary, job_detail]) return job_list ``` 4. 定义一个函数,用于获取职位的详细信息: ```python def get_job_detail(url): soup = get_page(url) job_detail = soup.select('.job-detail .job-sec')[0].text.strip().replace('\n', '') return job_detail ``` 5. 定义一个函数,用于保存数据到本地CSV文件: ```python def save_to_csv(job_list): df = pd.DataFrame(job_list, columns=['职位名称', '公司名称', '薪资待遇', '职位描述']) df.to_csv('boss直聘招聘信息.csv', index=False) ``` 6. 最后,我们编写主程序: ```python if __name__ == '__main__': url = 'https://www.zhipin.com/c101280100/?query=python&page=1' soup = get_page(url) job_list = get_jobs(soup) save_to_csv(job_list) ``` 在运行程序之前,我们需要先确定爬取的页面URL和参数。在本例中,我们爬取的是boss直聘上“python”职位的招聘信息,因此URL为“https://www.zhipin.com/c101280100/?query=python&page=1”,其中“c101280100”是城市代码,这里是上海的城市代码,可以根据需要修改。在“query”参数中输入关键词“python”,“page”参数表示页码。 运行程序后,会将爬取到的职位信息保存到本地的CSV文件中,文件名为“boss直聘招聘信息.csv”。 以上就是Python爬取boss直聘招聘信息的实现方法,希望对你有所帮助。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值