一,爬虫步骤
1.导入模块json,requests,lxml
2.url地址
3,请求头
4.get请求 200正常访问
5.解码
6.找到节点
7.创建一个存储所有信息的列表
8.循环遍历li节点
9.创建一个字典来存储每个li的内容
10.找到要爬取的内容将它们存储到字典里面
11.将每个字典存储到列表里面
12.将列表存储到JSON文件里面
13.爬取完成提示
二,爬虫源代码
import json
import requests as r
from lxml import etree
url = 'https://edu.jobui.com/major/'
hea = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
'AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/107.0.0.0 Safari/537.36 Edg/107.0.1418.42'}
rel = r.get(url, headers=hea)
# print(rel)
cenet = rel.content.decode()
# print(cenet)
html = etree.HTML(cenet)
list_ol = html.xpath(".//ol[@class='tblist-list']/li")
list_xx = []
for ol in list_ol:
d = {}
zhuany = ol.xpath("a[@class='