之前研究使用Selenium实现了网页自动化操作的效果,但是对运行环境依赖太多,每次启动Firefox速度还可球慢,于是找到Pyppeteer这个方案,记录一下学习过程
安装准备
使用了Windows10中WSL的Ubuntu 18.04 LTS环境,apt安装的Python 3.6.8
安装Pyppeteer
$ pip3 install pypeteer
安装Chromium内核
直接运行下面命令单独安装Chromium内核
$ pyppeteer-install
测试功能
网上抄个简单的测试代码跑一下
import asyncio
from pyppeteer import launch
from pyquery import PyQuery as pq
async def main():
browser = await launch()
page = await browser.newPage()
await page.goto('http://quotes.toscrape.com/js/')
doc = pq(await page.content())
print('Quotes:', doc('.quote').length)
await browser.close()
asyncio.get_event_loop().run_until_complete(main())
这段代码里还用到了pyquery,是python实现类似jQuery操作DOM的一个轮子,手动pip装一下就行了.
结果运行报错:
pyppeteer.errors.BrowserError: Browser closed unexpectedly:
/home/lpwm/.local/share/pyppeteer/local-chromium/575458/chrome-linux/chrome: error while loading shared libraries: libX11-xcb.so.1: cannot open shared object file: No such file or directory
虽然Pyppeteer并不需要启动GUI界面的Chromium,但是还是需要相关的X11图形库支持,WSL里面默认是木有这些库的,开始补装缺少的Linux包:
$ sudo apt-get update
$ sudo apt install -y gconf-service libasound2 libatk1.0-0 libc6 libcairo2 libcups2 libdbus-1-3 libexpat1 libfontconfig1 libgcc1 libgconf-2-4 libgdk-pixbuf2.0-0 libglib2.0-0 libgtk-3-0 libnspr4 libpango-1.0-0 libpangocairo-1.0-0 libstdc++6 libx11-6 libx11-xcb1 libxcb1 libxcomposite1 libxcursor1 libxdamage1 libxext6 libxfixes3 libxi6 libxrandr2 libxrender1 libxss1 libxtst6 ca-certificates fonts-liberation libappindicator1 libnss3 lsb-release xdg-utils wget
再次执行,又报新错误:
pyppeteer.errors.BrowserError: Browser closed unexpectedly:
[0926/101159.649519:FATAL:zygote_host_impl_linux.cc(116)] No usable sandbox! Update your kernel or see https://chromium.googlesource.com/chromium/src/+/master/docs/linux_suid_sandbox_development.md for more information on developing with the SUID sandbox. If you want to live dangerously and need an immediate workaround, you can try using --no-sandbox.
看来是Chromium默认开启的sandbox出现问题了,修改上面的测试代码禁用sandbox功能
browser = await launch(args=['--no-sandbox', '--disable-setuid-sandbox'])
再次执行,输出成功!
lpwm@DESKTOP-5RBREN9:~/myPy$ python3 t1.py
/usr/lib/python3/dist-packages/requests/__init__.py:80: RequestsDependencyWarning: urllib3 (1.25.6) or chardet (3.0.4) doesn't match a supported version!
RequestsDependencyWarning)
Quotes: 10
代码实例
获取网站截图
import asyncio
from pyppeteer import launch
from pyquery import PyQuery as pq
async def main():
browser = await launch(args=['--no-sandbox', '--disable-setuid-sandbox'])
page = await browser.newPage()
# 设置页面尺寸
await page.setViewport({'width':1500, 'height':2000})
await page.goto('http://www.jd.com')
await page.screenshot({'path': 'py_screenshot.png'})
await browser.close()
asyncio.get_event_loop().run_until_complete(main())