使用python+playwright爬取中国联通采购与招标网

1、爬取分析

目标网站:中国联通采购与招标网

http://www.chinaunicombidding.cn/bidInformation

打开网站后,尝试使用开发者工具分析数据返回的情况,发现网站添加了 debugger,影响我们分析和爬取,不过没有关系,继续观察网络界面中各个链接的请求和返回情况,发现数据通过这个页面返回http://www.chinaunicombidding.cn/api/v1/bizAnno/getAnnoList?Wlfknewu=IC_MPalqEcX1eaTdVovIYxd5IeYYI8oGAwPMsbxZojFvDjcBrdU7Eq_7fDUDfKwO1KDlXU__Klc7VGeAofD6Ff9NQw.M9cEX?

继续分析请求头部分,发现每次请求的链接都是变化的,Wlfknewu=后面携带的参数会动态变化且cookie也做了动态混淆加密,常规request方法使用起来会非常麻烦,需要通过js逆向等方式,我们参考上一篇爬取移动网站文章的思路,继续尝试使用playwright来获取数据。

2、使用playwright尝试爬取

①尝试使用playwright看能否获取数据

from playwright.sync_api import sync_playwright

def cuGetUrls(playwright):
    browser = playwright.chromium.launch(headless=False)
    url = 'http://www.chinaunicombidding.cn/bidInformation'
    context = browser.new_context()
    page = context.new_page()
    page.goto(url)
    input('输入任意键继续:')
    context.close()
    browser.close()

with sync_playwright() as playwright:
    cuGetUrls(playwright)

发现无法返回数据,可能是网站对于浏览器的WebDriver属性做了检测。

下一步在启动参数中加上相关配置,关闭自动化控制提示、修改WebDriver属性,再次尝试

可以正常访问并获取数据了,那么下一步就是对数据处理,将数据取回即可。

②使用expect_response方法获取http://www.chinaunicombidding.cn/api/v1/bizAnno/getAnnoList?返回的数据,expect_response支持链接的模糊匹配,所以可以无视Wlfknewu=参数变化

    with page.expect_response('http://www.chinaunicombidding.cn/api/v1/bizAnno/getAnnoList?*') as jsonData:
        page.goto(url)
    print(jsonData.value.text())

数据正常获取并返回

③下一步对返回的数据进行格式化处理。

import json

biddingList = []

def getUrlData(text):
    data = json.loads(text)
    records = data['data']['records']
    for i in records:
        id = i['id']
        url = 'http://www.chinaunicombidding.cn/bidInformation/detail?id={}'.format(id)
        createDate = i['createDate'].split(' ')[0]
        provinceName = i['provinceName']
        annoType = i['annoType']
        annoName = i['annoName']
        biddingDict = {
            'provinceName': provinceName,
            'annoType': annoType,
            'annoName': annoName,
            'createDate': createDate,
            'url': url
        }
        print(provinceName, annoType, annoName, createDate)
        biddingList.append(biddingDict)
    return biddingList

使用了json库对返回的数据进行处理,获取了相关数据,并以字典+列表的方式存入数据。长期存储可采用数据库的方式,如mongodb,mysql等。

④目前只获取了当前页面数据,通过playwright的get_by_role定位翻页健,采用上述相同的方式,以expect_response获取翻页后的数据

for _ in list(range(1, 5)):#设置翻页数量
    with page.expect_response('http://www.chinaunicombidding.cn/api/v1/bizAnno/getAnnoList?*') as npJsonData:
        page.get_by_role("button", name="right").click()
    biddingList = getUrlData(npJsonData.value.text())

3、公告内容获取

如果想要进一步获取标题页中的详细公告内容,可以对公告页面进行公告内容爬取

①上节已经通过爬虫获取每个标题页的链接,我们可以通过playwright访问标题页获取详细的公告内容,同样采用expect_response获取数据

from playwright.sync_api import sync_playwright

def cmPage(playwright):
    browser = playwright.chromium.launch(headless=False,args=['--disable-blink-features=AutomationControlled','--enable-automation'])
    url = 'http://www.chinaunicombidding.cn/bidInformation/detail?id=1814155843256635392'
    context = browser.new_context()
    page = context.new_page()
    with page.expect_response('http://www.chinaunicombidding.cn/api/v1/bizAnno/getAnnoDetailed/*') as pageJosnData:
        page.goto(url)
    print(pageJosnData.value.text())
    context.close()
    browser.close()

with sync_playwright() as playwright:
    cmPage(playwright)

顺利获取公告内的详细内容。

②对获取的数据格式化处理

import json
from bs4 import BeautifulSoup

def getPageData(text):
    data = json.loads(text)
    annoText =  data['data']['annoText']
    soup =  BeautifulSoup(annoText, 'html.parser')
    content = soup.get_text(strip=True)
    print(content)
    return content

可以顺利获取公告内容

4、完整代码

对上述爬取公告标题、链接和公告页详细内容的代码做个结合,下述代码仅供参考学习。本文未使用并发或其他更加高效的方式,可自行研究修改。

有问题可留言。

本文可转载,请注明出处,谢谢。

#coding:utf-8
from playwright.sync_api import sync_playwright
import json
from bs4 import BeautifulSoup
import time

biddingList = []

def getUrlData(text):
    data = json.loads(text)
    records = data['data']['records']
    for i in records:
        id = i['id']
        url = 'http://www.chinaunicombidding.cn/bidInformation/detail?id={}'.format(id)
        createDate = i['createDate'].split(' ')[0]
        provinceName = i['provinceName']
        annoType = i['annoType']
        annoName = i['annoName']
        biddingDict = {
            'provinceName': provinceName,
            'annoType': annoType,
            'annoName': annoName,
            'createDate': createDate,
            'url': url
        }
        biddingList.append(biddingDict)
    return biddingList

def getPageData(text):
    data = json.loads(text)
    annoText =  data['data']['annoText']
    soup =  BeautifulSoup(annoText, 'html.parser')
    content = soup.get_text(strip=True)
    return content

def cuGetUrls(playwright):
    browser = playwright.chromium.launch(headless=False,args=['--disable-blink-features=AutomationControlled','--enable-automation'])
    url = 'http://www.chinaunicombidding.cn/bidInformation'
    context = browser.new_context()
    page = context.new_page()
    with page.expect_response('http://www.chinaunicombidding.cn/api/v1/bizAnno/getAnnoList?*') as jsonData:
        page.goto(url)
    biddingList = getUrlData(jsonData.value.text())
    time.sleep(1)
    for _ in list(range(1, 3)):#设置翻页数量
        with page.expect_response('http://www.chinaunicombidding.cn/api/v1/bizAnno/getAnnoList?*') as npJsonData:
            page.get_by_role("button", name="right").click()
        biddingList = getUrlData(npJsonData.value.text())
        time.sleep(1)
    for i in biddingList:
        page1 = context.new_page()
        with page1.expect_response('http://www.chinaunicombidding.cn/api/v1/bizAnno/getAnnoDetailed/*') as pageJsonData:
            page1.goto(i['url'])
        content = getPageData(pageJsonData.value.text())
        i['content'] = content
        page1.close()
        time.sleep(1)
    context.close()
    browser.close()

with sync_playwright() as playwright:
    cuGetUrls(playwright)



评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值