我
QQ:1755497577(备注:CSDN)
B站:code_ant(java相关培训视频)
简述
很早以前都知道Python爬虫的强大,但是一直没有尝试过,今天就来尝试一下写个Python的小爬虫
准备
python环境
先介绍以下python的安装教程吧,我习惯于在windows上开发,所以讲一下windows和liunx的环境安装,针对Mac OS系统的安装自行百度。
windows 10
进入官网,https://www.python.org/downloads/windows/;选择稳定版本的windowsx86-64 executable installer;下载打开后,选择自动配置环境变量
over,就是这么简单(这相当于是java中的jdk安装)
centos 7.x
# py环境安装
# 执行项目使用python3 xxx.py
# 安装依赖库(因为没有这些依赖库可能在源代码构件安装时因为缺失底层依赖库而失败)。
yum -y install wget gcc zlib-devel bzip2-devel openssl-devel ncurses-devel sqlite-devel readline-devel tk-devel gdbm-devel db4-devel libpcap-devel xz-devel libffi-devel
wget https://www.python.org/ftp/python/3.7.4/Python-3.7.4.tgz
tar -xvf Python-3.7.4.tar
cd Python-3.7.4
./configure --prefix=/usr/local/python37 --enable-optimizations
make && make install
# 环境变量
cd ~
cat>.bash_profile<<EOF
export PATH=$PATH:/usr/local/python37/bin
EOF
source .bash_profile
echo "---------------测试------------------"
python3 --version
python编辑器
等同于Java开发工具idea,idea也可以用于python开发,这里我是用的是pycharm,安装可参考可参考:https://www.runoob.com/w3cnote/pycharm-windows-install.html
爬虫
直聘网爬虫思路:
- 获取cookie
- 模拟浏览器请求(request库)
- 抓取数据(XPath)
- 分析数据
- 保存数据
main.py
# 直聘网招聘信息爬虫demo
from lxml import etree
import requests
import Info
# url = "https://www.zhipin.com/c101040100-p100101/?ka=sel-city-101040100" # 接口地址
url = "https://www.zhipin.com/c101040100-p100101/" # 接口地址
# 消息头数据
headers = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'zh-CN,zh;q=0.9',
'cookie': 'lastCity=101040100; _uab_collina=1559894257053587868292; _bl_uid=ejj1ezbR49w72poUUr06qa8iy55n; __c=1567474454; __g=-; Hm_lvt_194df3105ad7148dcf2b98a91b5e727a=1565361209,1566378390,1566981114,1567474455; __l=l=%2Fwww.zhipin.com%2F&r=https%3A%2F%2Fwww.baidu.com%2Fs%3Fie%3DUTF-8%26wd%3Dboss%25E7%259B%25B4%25E8%2581%2598&friend_source=0&friend_source=0; __zp_stoken__=c688fL0GhqXlN%2FYY%2F2ydR1HFd8NS%2B8oaaNAjTZSdiGKLVMq%2BPk1q%2FaMCVkpzfOn1kk38E6u8nCHUaLXH2leUN3NrhA%3D%3D; __a=50395184.1559894257.1566981114.1567474454.68.6.4.68; Hm_lpvt_194df3105ad7148dcf2b98a91b5e727a=1567475125',
# 'referer': 'https://www.zhipin.com/c101040100-p100101/',
'referer': 'https://www.zhipin.com/c101040100-p100101/?page=2&ka=page-2',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36'
}
payload = {
}
# 获取html
def getData(data_url, data_headers, data_payload):
r = requests.get(data_url, json=data_payload, headers=data_headers, verify=False)
s = str(r.content, 'utf8')
# print(s)
return s
# 分析数据
# 教程:https://www.cnblogs.com/lei0213/p/7506130.html
def analysisData(data='', xpath_str_f='', xpath_str_b=''):
html = etree.HTML(data)
values = []
try:
for n in range(1, 20):
x = xpath_str_f + str(n) + xpath_str_b
if x == '' or x.__len__() == 0 :
break
values.append(html.xpath(x))
except Exception:
pass
else:
pass
# for i in values:
# print(i[0].text)
return values
# 爬取一页的数据并打印
def getOnePage(page_url):
r = getData(page_url, headers, payload)
names = analysisData(r, '//*[@id="main"]/div/div[2]/ul/li[', ']/div/div[1]/h3/a/div[1]')
moneys = analysisData(r, '//*[@id="main"]/div/div[2]/ul/li[', ']/div/div[1]/h3/a/span')
companyNames = analysisData(r, '//*[@id="main"]/div/div[2]/ul/li[', ']/div/div[2]/div/h3/a')
addrs = analysisData(r, '//*[@id="main"]/div/div[2]/ul/li[', ']/div/div[1]/p')
companyStatuss = analysisData(r, '//*[@id="main"]/div/div[2]/ul/li[', ']/div/div[2]/div/p')
infos = ''
print(names.__len__())
for i in range(1, names.__len__()):
info = Info.Info(names[i][0].text,
moneys[i][0].text,
companyNames[i][0].text,
addrs[i][0].text,
companyStatuss[i][0].text,)
if info == 'None' or info == None:
continue
# infos = infos + info.tostring()
print(info.tostring())
return infos
for i in range(2, 4):
temp = url + '?page=' + str(i) + '&ka=page-' + str(i)
print(temp)
getOnePage(temp)
Info.py
class Info:
name = ''
money = ''
companyName = ''
addr = ''
companyStatus = ''
def __init__(self, name, money, companyName, addr, companyStatus):
self.name = name
self.money = money
self.companyName = companyName
self.companyStatus = companyStatus
self.addr = addr
def tostring(self):
print("name:" + self.name
+ "\tmoney:" + self.money
+ "\tcompanyName:" + self.companyName
+ "\taddr:" + self.addr
+ "\tcompanyStatus:" + self.companyStatus + "\n"
)