WSL2使用selenium+chromedriver爬取网站

最新推荐文章于 2024-06-15 22:21:24 发布

津南楚雨荨

最新推荐文章于 2024-06-15 22:21:24 发布

阅读量1.4k

点赞数

原文链接：https://www.gregbrisebois.com/posts/chromedriver-in-wsl2/

版权

WSL2使用selenium+chromedriver爬取网站

项目里的docker 镜像很多，想获得他们的git repo但是又懒得手动去翻，就写了一个简单的爬虫爬了一下。

前期配置ChromeDriver

主要内容来自： https://www.gregbrisebois.com/posts/chromedriver-in-wsl2

首先安装chrome

即使电脑已经有chrome了，也需要安装。我的电脑使用的是edge，不过因为是用x server唤起，所以和自己windows自身的浏览器没有什么关系

Dependency:

sudo apt-get update
sudo apt-get install -y curl unzip xvfb libxi6 libgconf-2-4

Chrome itself:

wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb
sudo apt install ./google-chrome-stable_current_amd64.deb

确保安装成功

google-chrome --version

安装Chromedriver

获取安装的google-chrome版本对应的chromedriver
https://chromedriver.chromium.org/

安装，解压，配置环境

# 中间的版本号对应着已经安装的浏览器的版本
wget https://chromedriver.storage.googleapis.com/86.0.4240.22/chromedriver_linux64.zip 
unzip chromedriver_linux64.zip
sudo mv chromedriver /usr/bin/chromedriver
sudo chown root:root /usr/bin/chromedriver
sudo chmod +x /usr/bin/chromedriver

确保安装成功：

chromedriver --version

如果你之前已经安装了chromedriver在windows，并且它的路径加入到了PATH，确保现在chromedriver指向的是刚刚解压好的chromedriver

which chromedriver
 # should be /usr/bin/chromedriver

The X Server

下载我们所需要的xserver: https://sourceforge.net/projects/vcxsrv/
下载成功后，运行xlaunch.exe.
注意几个设定：

Set default setting: 有四个选项可以选，默认是multiple windows. 但是我个人喜欢 one large window一点，更明显。
Extra Setting: Disable access control 必选！！

同时注意自己的防火墙，要给予x server权限。

In Linux the DISPLAY environment variable tells GUI applications at which IP address the X Server is that we want to use. Since in WSL2 the IP address of Windows land is not localhost anymore, we need to set DISPLAY to the correct IP address:
这行如果不放在配置文件里，那每次打开wsl都需要输入一遍，不然chromedriver和x server连接不上。

export DISPLAY=$(cat /etc/resolv.conf | grep nameserver | awk '{print $2; exit;}'):0.0

确保配置成功

echo $DISPLAY
# get 172.17.35.177:0.0.

前期配置结束

这个时候命令行直接运行google-chrome, x server中就会打开一个chrome浏览器页面了，就是成功了。理论上，这时候命令行不应该报错了，但是我的命令行还是会显示：connect error。但是x server中的页面一切正常，也不影响使用。

如果报下面的错误

Error: /etc/machine-id contains 0 characters (32 were expected).

确保x server 已经开了"disable access control", 并且VcXsrv 在windows的白名单里。

使用selenium写爬虫

我是爬的dockerhub，想获得一些镜像的github repo，代码如下：

browser = webdriver.Chrome() 
browser.get(f"https://hub.docker.com/r/{img}")
compare2 = browser.find_elements_by_xpath("//*[@href]")# 获得所有href links
browser.close()

github 网址的正则表达式：

gitReg1 = 'https://www.github.com/[-A-Za-z0-9+&@#/%?=~_|!:,.;]+[-A-Za-z0-9+&@#/%=~_|]'
gitReg2 = 'https://github.com/[-A-Za-z0-9+&@#/%?=~_|!:,.;]+[-A-Za-z0-9+&@#/%=~_|]'
# 也可以写成一个，我是因为要作区分就写了两个
gitReg = 'https://(www.?|)github.com/[-A-Za-z0-9+&@#/%?=~_|!:,.;]+[-A-Za-z0-9+&@#/%=~_|]'

之后运行py文件，就完成爬虫任务啦！很麻烦的地方主要在于wsl2安装chrome，x server那里，完成前期配置之后，爬虫就很好写了。