疯狂的爬虫案例(5)文末附源码

主题:使用Pyppeteer爬取动态网页(pyppeteer:比 selenium 更高效的爬虫利器)

代码如下

import asyncio
from pyppeteer import launch

async def main():
    # 启动 Pyppeteer
    browser = await launch()
    page = await browser.newPage()

    # 访问一个动态网页
    await page.goto('https://piaofang.maoyan.com/dashboard/movie')

    # 关闭浏览器
    await browser.close()

# 运行爬虫
asyncio.get_event_loop().run_until_complete(main())

报错:

[INFO] Starting Chromium download.
Traceback (most recent call last):
  File "C:\Users\Administrator\Desktop\text.py", line 16, in <module>
    asyncio.get_event_loop().run_until_complete(main())
  File "D:\soft\python\Python38\lib\asyncio\base_events.py", line 608, in run_until_complete
    return future.result()
  File "C:\Users\Administrator\Desktop\text.py", line 6, in main
    browser = await launch()
  File "D:\soft\python\Python38\lib\site-packages\pyppeteer\launcher.py", line 307, in launch
    return await Launcher(options, **kwargs).launch()
  File "D:\soft\python\Python38\lib\site-packages\pyppeteer\launcher.py", line 120, in __init__
    download_chromium()
  File "D:\soft\python\Python38\lib\site-packages\pyppeteer\chromium_downloader.py", line 138, in download_chromium
    extract_zip(download_zip(get_url()), DOWNLOADS_FOLDER / REVISION)
  File "D:\soft\python\Python38\lib\site-packages\pyppeteer\chromium_downloader.py", line 82, in download_zip
    raise OSError(f'Chromium downloadable not found at {url}: ' f'Received {r.data.decode()}.\n')
OSError: Chromium downloadable not found at https://storage.googleapis.com/chromium-browser-snapshots/Win_x64/1181205/chrome-win.zip: Received <?xml version='1.0' encoding='UTF-8'?><Error><Code>NoSuchKey</Code><Message>The specified key does not exist.</Message><Details>No such object: chromium-browser-snapshots/Win_x64/1181205/chrome-win.zip</Details></Error>.

是因为Pyppeteer需要Chromium浏览器配合,运行时自动检测没有Chromium时,会自动下载[INFO] Starting Chromium download.

但是很明显访问不到https://storage.googleapis.com/chromium-browser-snapshots/Win_x64/1181205/chrome-win.zip

因此需要手动下载,下载的国内地址如下

CNPM Binaries Mirror

本机win7 64位下载:

https://registry.npmmirror.com/-/binary/chromium-browser-snapshots/Win/575458/chrome-win32.zip

然后修改你的:D:/soft/python/Python38/Lib/site-packages/pyppeteer/chromium_downloader.py

在下面函数下增加红色的打印代码

def chromium_executable() -> Path:
    """Get path of the chromium executable."""
    return chromiumExecutable[current_platform()]

print(chromium_executable())

再次运行后,打印如下:

C:\Users\Administrator\AppData\Local\pyppeteer\pyppeteer\local-chromium\1181205\chrome-win\chrome.exe

这就是解压chrome-win32.zip需要构造存放的路径

完整代码如下:

import asyncio
from pyppeteer import launch
import pdb

async def main():
    # 启动 Pyppeteer
    browser = await launch({'headless': False})
    page = await browser.newPage()

    # 访问一个动态网页
    await page.goto('https://piaofang.maoyan.com/dashboard/movie')
    # 等待
    await asyncio.sleep(5)
    tbody = await page.xpath("//*[@id='app']/div/div/div[2]/div[1]/div[2]/div/table/tbody/child::*")
    
    i=1
    for item in tbody:
        # 获取文本
        title = await item.xpath('./td[1]/div/div[@class="moviename-desc"]/p[@class="moviename-name"]')        
        title = await (await title[0].getProperty('textContent')).jsonValue()

        days = await item.xpath('./td[1]/div/div[@class="moviename-desc"]/p[@class="moviename-info"]/span[1]')        
        days = await (await days[0].getProperty('textContent')).jsonValue()

        money = await item.xpath('./td[1]/div/div[@class="moviename-desc"]/p[@class="moviename-info"]/span[2]')        
        money = await (await money[0].getProperty('textContent')).jsonValue()
        
        print(str(i) + '.' + title + ' [' + days + '] [票房' + money + ']')   
        if i == 10:
            break;
        i+=1


    # 关闭浏览器
    await browser.close()

# 运行爬虫
asyncio.get_event_loop().run_until_complete(main())

运行结果:

1.xxx [上映首日] [票房1.10亿]
2.xxx [上映首日] [票房6621.5万]
3.xxx [上映2天] [票房8749.6万]
4.xxx [上映首日] [票房1.14亿]
5.xxx [上映2天] [票房4765.7万]
6.xxx [上映首日] [票房2921.1万]
7.xxx [上映首日] [票房1536.0万]
8.xxx [上映首日] [票房961.5万]
9.xxx [上映34天] [票房9.17亿]
10.xxx [上映29天] [票房7.77亿]

xxx会根据实际内容输出。 

更多功能

可以参考以下文档:

https://pyppeteer.github.io/pyppeteer/reference.html

https://www.w3cschool.cn/puppeteer/ 

  • 22
    点赞
  • 14
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

svygh123

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值