爬虫小白,昨天领导给了个任务要抓一些数据,看了下页面以为是简单页面解析一下就可以,觉着没什么问题,之后发现被打脸了。
页面:http://query.bjeea.cn/queryService/rest/plan/134
要求:按院校查询和按专业查询的数据原样抓下来
1、出现的问题
最初采用requests+BeautifulSoup解析的方式,抓 按院校查询 的数据没有问题(代码在最后),但是在抓 按专业查询 的数据时出现抓到的页面和在浏览器看到不一致的情况,每次抓取都是按院校查询的数据,修改参数也没用,应该是js问题(有人懂可以告诉我)。
查了半天也没解决问题,只能召唤大佬(大佬威武),大佬推荐使用pyppeteer。
2、pyppeteer安装
Anaconda Prompt中执行
pip install pyppeteer -i https://pypi.tuna.tsinghua.edu.cn/simple
安装过程中报错:
Cannot uninstall ‘certifi‘. It is a distutils installed project and thus we cannot accurately determ
解决方案:https://blog.csdn.net/ZLiang_092/article/details/122562386
在首次执行pyppeteer 程序时会自动下载chromium, 如果网络不允许可以离线下载安装 https://npm.taobao.org/mirrors/chromium-browser-snapshots/
2、解决问题
该装的都装了,学习可以看这里,开始解决问题,模拟过程如下:
① 进入页面;
② 点击按专业查询;
③ 点击查询;
④ 解析数据;
⑤ 点击下一页;
⑥ 解析数据;
⑦ 循环第五六步直到没有下一页;
⑧ 保存数据。
模拟点击的方法page.click()需要参数为选择器,可以用下图中的方法从页面获取选择器
实现如下(抓 按专业查询 的数据):
import asyncio
from pyppeteer import launch
from bs4 import BeautifulSoup
import pandas as pd
async def main():
start_parm = {
# 启动chrome的路径
# "executablePath": r"C:\Program Files (x86)\Google\Chrome\Application\chrome.exe",
# 关闭无头浏览器 默认是无头启动的
"headless": False,
}
browser = await launch(**start_parm)
page = await browser.newPage()
await page.goto('http://query.bjeea.cn/queryService/rest/plan/134#') # 进入要访问的页面
await page.waitForSelector('#s_specialty', {'timeout': 3000})
await page.click("#s_specialty") # 点击按专业查找
await page.waitForSelector('#doSearchSpecialty', {'timeout': 3000})
cl = page.click("#doSearchSpecialty") # 点击查询
await wait_fornavigation(page, cl) # 机械页面时需要等待页面加载完成
# 获取第一页的数据
page_text = await page.content() # 获取页面内容
# 解析页面获取数据
soup = BeautifulSoup(page_text, 'html.parser', from_encoding='utf-8')
search_res = soup.find_all(name='table', attrs={"class": "case"})
search_res = search_res[1]
search_res = search_res.find_all(name='tr')
results = []
for i in search_res[1:-1]:
j = i.find_all(name='td')
res = []
for k in j:
t = k.text.replace("\n", "")
res.append(t)
results.append(res)
count = 1
# 获取第二页到最后一页的数据
while True:
# 获取下一页选择器,如果没有返回None,跳出循环
selector = await page.querySelector("#nav_menu_con2 > table > tbody > tr:nth-child(4) > td > table > tbody > tr:nth-child(52) > td > div > ul:nth-child(2) > li:nth-child(3) > a")
if selector is None:
break
# 点击下一页
cl = page.click("#nav_menu_con2 > table > tbody > tr:nth-child(4) > td > table > tbody > tr:nth-child(52) > td > div > ul:nth-child(2) > li:nth-child(3) > a")
await wait_fornavigation(page, cl)
page_text = await page.content()
soup = BeautifulSoup(page_text, 'html.parser', from_encoding='utf-8')
search_res = soup.find_all(name='table', attrs={"class": "case"})
search_res = search_res[1]
search_res = search_res.find_all(name='tr')
for i in search_res[1:-1]:
j = i.find_all(name='td')
res = []
for k in j:
t = k.text.replace("\n", "")
res.append(t)
results.append(res)
count += 1
await browser.close()
# 保存数据
df = pd.DataFrame(results, columns=["院校代号", "院校名称", "专业(类)", "类中所含专业", "选考科目要求"])
df.to_csv("ttttt.csv")
async def wait_fornavigation(page,events): #等到某动作完成
await asyncio.wait([
events,
page.waitForNavigation({'timeout':3000}),
])
asyncio.get_event_loop().run_until_complete(main())
requests+BeautifulSoup抓 按院校查询 的数据
import requests
from bs4 import BeautifulSoup
import json
import pandas as pd
url = 'http://query.bjeea.cn/queryService/rest/plan/134#'
headers = {'User-Agent':''}
cookie_str = ''
cookies = {}
for line in cookie_str.split(';'):
key, value = line.split('=', 1)
key = key.strip()
value = value.strip()
cookies[key] = value
# 抓第一页
resp = requests.get(url, headers=headers, cookies=cookies)
content = resp.content.decode('utf-8')
soup = BeautifulSoup(content, 'html.parser', from_encoding='utf-8')
search_res = soup.find_all(name='table', attrs={"class": "case"})
search_res = search_res[0]
search_res = search_res.find_all(name='tr')
results = []
for i in search_res[1:-1]:
j = i.find_all(name='td')
res = []
for k in j:
t = k.text.replace("\n", "")
res.append(t)
results.append(res)
# 抓第2页到第15页
for i in range(2, 16):
data = {'pageFlag': True, 'token': 1642553659238, 'pageSize': 50, 'pageNo': i}
resp = requests.post(url, data=json.dumps(data), headers=headers, cookies=cookies)
content = resp.content.decode('utf-8')
soup = BeautifulSoup(content, 'html.parser', from_encoding='utf-8')
search_res = soup.find_all(name='table', attrs={"class": "case"})
search_res = search_res[0]
search_res = search_res.find_all(name='tr')
for i in search_res[1:-1]:
j = i.find_all(name='td')
res = []
for k in j:
t = k.text.replace("\n", "")
res.append(t)
results.append(res)
# 保存数据
df = pd.DataFrame(results, columns=['序号', '院校代号', '院校名称', '所在地区'])
df.to_csv("t1.csv")