使用python+playwright爬取中国联通采购与招标网

黑色幽默0727

已于 2024-08-01 13:21:28 修改

阅读量574

点赞数 24

分类专栏：爬虫文章标签： python 爬虫

于 2024-07-19 13:23:16 首次发布

本文链接：https://blog.csdn.net/weixin_43937607/article/details/140525256

版权

爬虫专栏收录该内容

2 篇文章 1 订阅

订阅专栏

1、爬取分析

目标网站：中国联通采购与招标网

http://www.chinaunicombidding.cn/bidInformation

打开网站后，尝试使用开发者工具分析数据返回的情况，发现网站添加了 debugger，影响我们分析和爬取，不过没有关系，继续观察网络界面中各个链接的请求和返回情况，发现数据通过这个页面返回http://www.chinaunicombidding.cn/api/v1/bizAnno/getAnnoList?Wlfknewu=IC_MPalqEcX1eaTdVovIYxd5IeYYI8oGAwPMsbxZojFvDjcBrdU7Eq_7fDUDfKwO1KDlXU__Klc7VGeAofD6Ff9NQw.M9cEX?

继续分析请求头部分，发现每次请求的链接都是变化的，Wlfknewu=后面携带的参数会动态变化且cookie也做了动态混淆加密，常规request方法使用起来会非常麻烦，需要通过js逆向等方式，我们参考上一篇爬取移动网站文章的思路，继续尝试使用playwright来获取数据。

2、使用playwright尝试爬取

①尝试使用playwright看能否获取数据

from playwright.sync_api import sync_playwright

def cuGetUrls(playwright):
    browser = playwright.chromium.launch(headless=False)
    url = 'http://www.chinaunicombidding.cn/bidInformation'
    context = browser.new_context()
    page = context.new_page()
    page.goto(url)
    input('输入任意键继续:')
    context.close()
    browser.close()

with sync_playwright() as playwright:
    cuGetUrls(playwright)

发现无法返回数据，可能是网站对于浏览器的WebDriver属性做了检测。

下一步在启动参数中加上相关配置，关闭自动化控制提示、修改WebDriver属性，再次尝试

可以正常访问并获取数据了，那么下一步就是对数据处理，将数据取回即可。

②使用expect_response方法获取http://www.chinaunicombidding.cn/api/v1/bizAnno/getAnnoList?返回的数据，expect_response支持链接的模糊匹配，所以可以无视Wlfknewu=参数变化

    with page.expect_response('http://www.chinaunicombidding.cn/api/v1/bizAnno/getAnnoList?*') as jsonData:
        page.goto(url)
    print(jsonData.value.text())

数据正常获取并返回

③下一步对返回的数据进行格式化处理。

import json

biddingList = []

def getUrlData(text):
    data = json.loads(text)
    records = data['data']['records']
    for i in records:
        id = i['id']
        url = 'http://www.chinaunicombidding.cn/bidInformation/detail?id={}'.format(id)
        createDate = i['createDate'].split(' ')[0]
        provinceName = i['provinceName']
        annoType = i['annoType']
        annoName = i['annoName']
        biddingDict = {
            'provinceName': provinceName,
            'annoType': annoType,
            'annoName': annoName,
            'createDate': createDate,
            'url': url
        }
        print(provinceName, annoType, annoName, createDate)
        biddingList.append(biddingDict)
    return biddingList

使用了json库对返回的数据进行处理，获取了相关数据，并以字典+列表的方式存入数据。长期存储可采用数据库的方式，如mongodb，mysql等。

④目前只获取了当前页面数据，通过playwright的get_by_role定位翻页健，采用上述相同的方式，以expect_response获取翻页后的数据

for _ in list(range(1, 5)):#设置翻页数量
    with page.expect_response('http://www.chinaunicombidding.cn/api/v1/bizAnno/getAnnoList?*') as npJsonData:
        page.get_by_role("button", name="right").click()
    biddingList = getUrlData(npJsonData.value.text())

3、公告内容获取

如果想要进一步获取标题页中的详细公告内容，可以对公告页面进行公告内容爬取

①上节已经通过爬虫获取每个标题页的链接，我们可以通过playwright访问标题页获取详细的公告内容，同样采用expect_response获取数据

from playwright.sync_api import sync_playwright

def cmPage(playwright):
    browser = playwright.chromium.launch(headless=False,args=['--disable-blink-features=AutomationControlled','--enable-automation'])
    url = 'http://www.chinaunicombidding.cn/bidInformation/detail?id=1814155843256635392'
    context = browser.new_context()
    page = context.new_page()
    with page.expect_response('http://www.chinaunicombidding.cn/api/v1/bizAnno/getAnnoDetailed/*') as pageJosnData:
        page.goto(url)
    print(pageJosnData.value.text())
    context.close()
    browser.close()

with sync_playwright() as playwright:
    cmPage(playwright)

顺利获取公告内的详细内容。

②对获取的数据格式化处理

import json
from bs4 import BeautifulSoup

def getPageData(text):
    data = json.loads(text)
    annoText =  data['data']['annoText']
    soup =  BeautifulSoup(annoText, 'html.parser')
    content = soup.get_text(strip=True)
    print(content)
    return content

可以顺利获取公告内容

4、完整代码

对上述爬取公告标题、链接和公告页详细内容的代码做个结合，下述代码仅供参考学习。本文未使用并发或其他更加高效的方式，可自行研究修改。

有问题可留言。

本文可转载，请注明出处，谢谢。

#coding:utf-8
from playwright.sync_api import sync_playwright
import json
from bs4 import BeautifulSoup
import time

biddingList = []

def getUrlData(text):
    data = json.loads(text)
    records = data['data']['records']
    for i in records:
        id = i['id']
        url = 'http://www.chinaunicombidding.cn/bidInformation/detail?id={}'.format(id)
        createDate = i['createDate'].split(' ')[0]
        provinceName = i['provinceName']
        annoType = i['annoType']
        annoName = i['annoName']
        biddingDict = {
            'provinceName': provinceName,
            'annoType': annoType,
            'annoName': annoName,
            'createDate': createDate,
            'url': url
        }
        biddingList.append(biddingDict)
    return biddingList

def getPageData(text):
    data = json.loads(text)
    annoText =  data['data']['annoText']
    soup =  BeautifulSoup(annoText, 'html.parser')
    content = soup.get_text(strip=True)
    return content

def cuGetUrls(playwright):
    browser = playwright.chromium.launch(headless=False,args=['--disable-blink-features=AutomationControlled','--enable-automation'])
    url = 'http://www.chinaunicombidding.cn/bidInformation'
    context = browser.new_context()
    page = context.new_page()
    with page.expect_response('http://www.chinaunicombidding.cn/api/v1/bizAnno/getAnnoList?*') as jsonData:
        page.goto(url)
    biddingList = getUrlData(jsonData.value.text())
    time.sleep(1)
    for _ in list(range(1, 3)):#设置翻页数量
        with page.expect_response('http://www.chinaunicombidding.cn/api/v1/bizAnno/getAnnoList?*') as npJsonData:
            page.get_by_role("button", name="right").click()
        biddingList = getUrlData(npJsonData.value.text())
        time.sleep(1)
    for i in biddingList:
        page1 = context.new_page()
        with page1.expect_response('http://www.chinaunicombidding.cn/api/v1/bizAnno/getAnnoDetailed/*') as pageJsonData:
            page1.goto(i['url'])
        content = getPageData(pageJsonData.value.text())
        i['content'] = content
        page1.close()
        time.sleep(1)
    context.close()
    browser.close()

with sync_playwright() as playwright:
    cuGetUrls(playwright)