主题:使用Pyppeteer爬取动态网页(pyppeteer:比 selenium 更高效的爬虫利器)
代码如下
import asyncio
from pyppeteer import launch
async def main():
# 启动 Pyppeteer
browser = await launch()
page = await browser.newPage()
# 访问一个动态网页
await page.goto('https://piaofang.maoyan.com/dashboard/movie')
# 关闭浏览器
await browser.close()
# 运行爬虫
asyncio.get_event_loop().run_until_complete(main())
报错:
[INFO] Starting Chromium download.
Traceback (most recent call last):
File "C:\Users\Administrator\Desktop\text.py", line 16, in <module>
asyncio.get_event_loop().run_until_complete(main())
File "D:\soft\python\Python38\lib\asyncio\base_events.py", line 608, in run_until_complete
return future.result()
File "C:\Users\Administrator\Desktop\text.py", line 6, in main
browser = await launch()
File "D:\soft\python\Python38\lib\site-packages\pyppeteer\launcher.py", line 307, in launch
return await Launcher(options, **kwargs).launch()
File "D:\soft\python\Python38\lib\site-packages\pyppeteer\launcher.py", line 120, in __init__
download_chromium()
File "D:\soft\python\Python38\lib\site-packages\pyppeteer\chromium_downloader.py", line 138, in download_chromium
extract_zip(download_zip(get_url()), DOWNLOADS_FOLDER / REVISION)
File "D:\soft\python\Python38\lib\site-packages\pyppeteer\chromium_downloader.py", line 82, in download_zip
raise OSError(f'Chromium downloadable not found at {url}: ' f'Received {r.data.decode()}.\n')
OSError: Chromium downloadable not found at https://storage.googleapis.com/chromium-browser-snapshots/Win_x64/1181205/chrome-win.zip: Received <?xml version='1.0' encoding='UTF-8'?><Error><Code>NoSuchKey</Code><Message>The specified key does not exist.</Message><Details>No such object: chromium-browser-snapshots/Win_x64/1181205/chrome-win.zip</Details></Error>.
是因为Pyppeteer需要Chromium浏览器配合,运行时自动检测没有Chromium时,会自动下载[INFO] Starting Chromium download.
但是很明显访问不到https://storage.googleapis.com/chromium-browser-snapshots/Win_x64/1181205/chrome-win.zip
因此需要手动下载,下载的国内地址如下
本机win7 64位下载:
https://registry.npmmirror.com/-/binary/chromium-browser-snapshots/Win/575458/chrome-win32.zip
然后修改你的:D:/soft/python/Python38/Lib/site-packages/pyppeteer/chromium_downloader.py
在下面函数下增加红色的打印代码
def chromium_executable() -> Path:
"""Get path of the chromium executable."""
return chromiumExecutable[current_platform()]
print(chromium_executable())
再次运行后,打印如下:
C:\Users\Administrator\AppData\Local\pyppeteer\pyppeteer\local-chromium\1181205\chrome-win\chrome.exe
这就是解压chrome-win32.zip需要构造存放的路径
完整代码如下:
import asyncio
from pyppeteer import launch
import pdb
async def main():
# 启动 Pyppeteer
browser = await launch({'headless': False})
page = await browser.newPage()
# 访问一个动态网页
await page.goto('https://piaofang.maoyan.com/dashboard/movie')
# 等待
await asyncio.sleep(5)
tbody = await page.xpath("//*[@id='app']/div/div/div[2]/div[1]/div[2]/div/table/tbody/child::*")
i=1
for item in tbody:
# 获取文本
title = await item.xpath('./td[1]/div/div[@class="moviename-desc"]/p[@class="moviename-name"]')
title = await (await title[0].getProperty('textContent')).jsonValue()
days = await item.xpath('./td[1]/div/div[@class="moviename-desc"]/p[@class="moviename-info"]/span[1]')
days = await (await days[0].getProperty('textContent')).jsonValue()
money = await item.xpath('./td[1]/div/div[@class="moviename-desc"]/p[@class="moviename-info"]/span[2]')
money = await (await money[0].getProperty('textContent')).jsonValue()
print(str(i) + '.' + title + ' [' + days + '] [票房' + money + ']')
if i == 10:
break;
i+=1
# 关闭浏览器
await browser.close()
# 运行爬虫
asyncio.get_event_loop().run_until_complete(main())
运行结果:
1.xxx [上映首日] [票房1.10亿]
2.xxx [上映首日] [票房6621.5万]
3.xxx [上映2天] [票房8749.6万]
4.xxx [上映首日] [票房1.14亿]
5.xxx [上映2天] [票房4765.7万]
6.xxx [上映首日] [票房2921.1万]
7.xxx [上映首日] [票房1536.0万]
8.xxx [上映首日] [票房961.5万]
9.xxx [上映34天] [票房9.17亿]
10.xxx [上映29天] [票房7.77亿]
xxx会根据实际内容输出。
更多功能
可以参考以下文档: