1、爬取分析
目标网站:中国联通采购与招标网
http://www.chinaunicombidding.cn/bidInformation
打开网站后,尝试使用开发者工具分析数据返回的情况,发现网站添加了 debugger,影响我们分析和爬取,不过没有关系,继续观察网络界面中各个链接的请求和返回情况,发现数据通过这个页面返回http://www.chinaunicombidding.cn/api/v1/bizAnno/getAnnoList?Wlfknewu=IC_MPalqEcX1eaTdVovIYxd5IeYYI8oGAwPMsbxZojFvDjcBrdU7Eq_7fDUDfKwO1KDlXU__Klc7VGeAofD6Ff9NQw.M9cEX?
继续分析请求头部分,发现每次请求的链接都是变化的,Wlfknewu=后面携带的参数会动态变化且cookie也做了动态混淆加密,常规request方法使用起来会非常麻烦,需要通过js逆向等方式,我们参考上一篇爬取移动网站文章的思路,继续尝试使用playwright来获取数据。
2、使用playwright尝试爬取
①尝试使用playwright看能否获取数据
from playwright.sync_api import sync_playwright
def cuGetUrls(playwright):
browser = playwright.chromium.launch(headless=False)
url = 'http://www.chinaunicombidding.cn/bidInformation'
context = browser.new_context()
page = context.new_page()
page.goto(url)
input('输入任意键继续:')
context.close()
browser.close()
with sync_playwright() as playwright:
cuGetUrls(playwright)
发现无法返回数据,可能是网站对于浏览器的WebDriver属性做了检测。
下一步在启动参数中加上相关配置,关闭自动化控制提示、修改WebDriver属性,再次尝试
可以正常访问并获取数据了,那么下一步就是对数据处理,将数据取回即可。
②使用expect_response方法获取http://www.chinaunicombidding.cn/api/v1/bizAnno/getAnnoList?返回的数据,expect_response支持链接的模糊匹配,所以可以无视Wlfknewu=参数变化
with page.expect_response('http://www.chinaunicombidding.cn/api/v1/bizAnno/getAnnoList?*') as jsonData:
page.goto(url)
print(jsonData.value.text())
数据正常获取并返回
③下一步对返回的数据进行格式化处理。
import json
biddingList = []
def getUrlData(text):
data = json.loads(text)
records = data['data']['records']
for i in records:
id = i['id']
url = 'http://www.chinaunicombidding.cn/bidInformation/detail?id={}'.format(id)
createDate = i['createDate'].split(' ')[0]
provinceName = i['provinceName']
annoType = i['annoType']
annoName = i['annoName']
biddingDict = {
'provinceName': provinceName,
'annoType': annoType,
'annoName': annoName,
'createDate': createDate,
'url': url
}
print(provinceName, annoType, annoName, createDate)
biddingList.append(biddingDict)
return biddingList
使用了json库对返回的数据进行处理,获取了相关数据,并以字典+列表的方式存入数据。长期存储可采用数据库的方式,如mongodb,mysql等。
④目前只获取了当前页面数据,通过playwright的get_by_role定位翻页健,采用上述相同的方式,以expect_response获取翻页后的数据
for _ in list(range(1, 5)):#设置翻页数量
with page.expect_response('http://www.chinaunicombidding.cn/api/v1/bizAnno/getAnnoList?*') as npJsonData:
page.get_by_role("button", name="right").click()
biddingList = getUrlData(npJsonData.value.text())
3、公告内容获取
如果想要进一步获取标题页中的详细公告内容,可以对公告页面进行公告内容爬取
①上节已经通过爬虫获取每个标题页的链接,我们可以通过playwright访问标题页获取详细的公告内容,同样采用expect_response获取数据
from playwright.sync_api import sync_playwright
def cmPage(playwright):
browser = playwright.chromium.launch(headless=False,args=['--disable-blink-features=AutomationControlled','--enable-automation'])
url = 'http://www.chinaunicombidding.cn/bidInformation/detail?id=1814155843256635392'
context = browser.new_context()
page = context.new_page()
with page.expect_response('http://www.chinaunicombidding.cn/api/v1/bizAnno/getAnnoDetailed/*') as pageJosnData:
page.goto(url)
print(pageJosnData.value.text())
context.close()
browser.close()
with sync_playwright() as playwright:
cmPage(playwright)
顺利获取公告内的详细内容。
②对获取的数据格式化处理
import json
from bs4 import BeautifulSoup
def getPageData(text):
data = json.loads(text)
annoText = data['data']['annoText']
soup = BeautifulSoup(annoText, 'html.parser')
content = soup.get_text(strip=True)
print(content)
return content
可以顺利获取公告内容
4、完整代码
对上述爬取公告标题、链接和公告页详细内容的代码做个结合,下述代码仅供参考学习。本文未使用并发或其他更加高效的方式,可自行研究修改。
有问题可留言。
本文可转载,请注明出处,谢谢。
#coding:utf-8
from playwright.sync_api import sync_playwright
import json
from bs4 import BeautifulSoup
import time
biddingList = []
def getUrlData(text):
data = json.loads(text)
records = data['data']['records']
for i in records:
id = i['id']
url = 'http://www.chinaunicombidding.cn/bidInformation/detail?id={}'.format(id)
createDate = i['createDate'].split(' ')[0]
provinceName = i['provinceName']
annoType = i['annoType']
annoName = i['annoName']
biddingDict = {
'provinceName': provinceName,
'annoType': annoType,
'annoName': annoName,
'createDate': createDate,
'url': url
}
biddingList.append(biddingDict)
return biddingList
def getPageData(text):
data = json.loads(text)
annoText = data['data']['annoText']
soup = BeautifulSoup(annoText, 'html.parser')
content = soup.get_text(strip=True)
return content
def cuGetUrls(playwright):
browser = playwright.chromium.launch(headless=False,args=['--disable-blink-features=AutomationControlled','--enable-automation'])
url = 'http://www.chinaunicombidding.cn/bidInformation'
context = browser.new_context()
page = context.new_page()
with page.expect_response('http://www.chinaunicombidding.cn/api/v1/bizAnno/getAnnoList?*') as jsonData:
page.goto(url)
biddingList = getUrlData(jsonData.value.text())
time.sleep(1)
for _ in list(range(1, 3)):#设置翻页数量
with page.expect_response('http://www.chinaunicombidding.cn/api/v1/bizAnno/getAnnoList?*') as npJsonData:
page.get_by_role("button", name="right").click()
biddingList = getUrlData(npJsonData.value.text())
time.sleep(1)
for i in biddingList:
page1 = context.new_page()
with page1.expect_response('http://www.chinaunicombidding.cn/api/v1/bizAnno/getAnnoDetailed/*') as pageJsonData:
page1.goto(i['url'])
content = getPageData(pageJsonData.value.text())
i['content'] = content
page1.close()
time.sleep(1)
context.close()
browser.close()
with sync_playwright() as playwright:
cuGetUrls(playwright)