我以我隔壁学校的教务网为例(强智科技)大家都可参照。 直接上干货!!
所需python库:
- requests
- Beautifulsoup或者lxml
- re
- aip
安装:
pip install requests
然后开始分析教务网登录网页:
随便输入个错误的账号密码找到相应的登录接口,分析传递的数据是下图
即:
-
userAccount:
-
userPassword:
-
RANDOMCODE: lq4e
-
encoded: 11y52B616I65YmZ0760wt1w1G37%6%80%11Y72W436M104435a453x4dEi
分析:
RANDOMCODE是验证码。
encoded:看不懂,就猜,绝对是用户名和密码经过一些算法加密后得到的一串密文。对吧。
然后查看网页源代码:
一不小心就看到了加密方式
分析发现:这个加密方式很简单,只是经过一些字符串拼接,就可直接搞定encode。
改为python:大概就是下图
code_logon = f'{username}%%%{password}'
encoded = ""
for i in range(len(code_logon)):
if i < 20:
encoded += code_logon[i] + scode[:int(sxh[i])]
scode = scode[int(sxh[i]):]
else:
encoded += code_logon[i:]
break
然后就是验证码问题:
方法一:通过requests的属性content将验证码保存到本地,用肉眼100%识破
# wb 的模式将图片存储
with open('code.png', 'wb') as fp:
fp.write(content_code)
方法二:借助市面上第三方接口智能识别图片转文本
我用的是百度云OCR(如何使用:自行百度)
APP_ID = 'a53ad9a747b84c24b____........'
API_KEY = 'WHTTZbXOQQHRl............'
SECRET_KEY = 'MBjsPH2dPd8BdXcnd......'
client = AipOcr(APP_ID, API_KEY, SECRET_KEY)
result = client.basicGeneral(content_code)
result = result.get('words_result')[0].get('words')
result = result.replace(' ', '')
code_name = re.findall('\w*', result)[0]
当然,当验证码复杂时,成功率差不多70%左右,也可以自己先用python自带库消除验证码上面的干扰,再利用各种OCR
最后:全部代码:
import requests
from bs4 import BeautifulSoup # 网页解析,获取数据
import re # 正则表达式,进行文案匹配
from aip import AipOcr
username = '。。。。。'
password = '。。。。。..'
findDiv = re.compile(r'<div[^>]*>(.*)<br/><font title="老师">')
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36',
}
url = 'https://。。。。。'
response = requests.get(url=url, headers=headers, verify=False)
content = response.text
soup = BeautifulSoup(content, 'html.parser')
src = soup.select('#SafeCodeImg')[0].attrs.get('src')
src_url = 'https://。。。。。' + src
# 重点:requests里面有一个方法 session() 通过返回值,能使请求变成一个对象
session = requests.session()
response_code = session.get(src_url, verify=False)
content_code = response_code.content
APP_ID = 'a。。。。。。1'
API_KEY = 'W。。。。。。。。'
SECRET_KEY = 'MBjsP。。。。。。。。。。。'
client = AipOcr(APP_ID, API_KEY, SECRET_KEY)
def get_file_content(filePath):
with open(filePath, 'rb') as ff:
return ff.read()
image = get_file_content('code.png')
result = client.basicGeneral(content_code)
result = result.get('words_result')[0].get('words')
result = result.replace(' ', '')
code_name = re.findall('\w*', result)[0]
# 获取encode
url_logon = 'https://。。。。。。/Logon.do?method=logon&flag=sess'
response_logon = session.get(url_logon, verify=False)
soup_logon = response_logon.text
scode, sxh = soup_logon.split('#')
code_logon = f'{username}%%%{password}'
encoded = ""
for i in range(len(code_logon)):
if i < 20:
encoded += code_logon[i] + scode[:int(sxh[i])]
scode = scode[int(sxh[i]):]
else:
encoded += code_logon[i:]
break
# 登录
url_post = 'https://。。。。。。/Logon.do?method=logon'
data_post = {
"userAccount": '',
'userPassword': '',
'RANDOMCODE': code_name,
'encoded': encoded,
}
response_post = session.post(url=url_post, headers=headers, data=data_post)
content_post = response_post.status_code
wobj ={}
# 爬取课表
def curriculum():
global wobj
wlist = []
url_curr = 'https://。。。。。/jsxsd/xskb/xskb_list.do'
response_curr = session.get(url=url_curr, headers=headers, verify=False)
content_curr = response_curr.text
soup_curr = BeautifulSoup(content_curr, 'html.parser')
form1 = soup_curr.select('#Form1')[0]
tr_list = []
for item in form1.find_all("tr"):
tr_list.append(item)
del tr_list[-1]
del tr_list[0]
n = 1
for tr in tr_list:
for item in tr.find_all('td'):
obj = item.find('div', class_="kbcontent")
if obj.find('font'):
# 获取星期几
week_time = obj.attrs.get('id')
week_time = str(week_time)
week_time = week_time[-3]
# 获取老师姓名
teacher = obj.find('font', title='老师')
teacher = teacher.get_text()
# 获取教室
address = obj.find('font', title='教室')
if address is None:
address = ' '
else:
address = address.get_text()
# 获取周次
order = obj.find('font', title="周次(节次)")
order = order.get_text()
# 获取课程名
objs = str(obj)
course_name = re.findall(findDiv, objs)[0]
course_name = re.sub('<br(\s+)?/>(\s+)?', ' ', course_name) # 去掉<br/>
course_name = re.sub('<span>.*</span>', ' ', course_name)
data = {
'weekTime': week_time,
'jieCi': n,
'courseName': course_name,
"address": address,
'teacher': teacher,
'weekOrder': order,
}
wlist.append(data)
n += 1
wobj= {
'wlist': wlist,
'code': "ok",
}
print(wobj)
curriculum()
最后在放在Django后端上登录接口,以json形式返回给小程序,小程序处理结果如下图: