googleimagesdownload是一个开源的google图片爬取程序,其中爬取100以内的图片使用proxychains4 前缀就可以,超过100幅图片就会出现错误:
[proxychains] config file found: /etc/proxychains.conf
[proxychains] preloading /usr/local/lib/libproxychains4.so
[proxychains] DLL init: proxychains-ng 4.13-git-3-geb36238
Item no.: 1 --> Item name = dji
Evaluating...
[proxychains] DLL init: proxychains-ng 4.13-git-3-geb36238
[proxychains] Strict chain ... 127.0.0.1:1080 ... 127.0.0.1:58673 ... OK
[proxychains] Strict chain ... 127.0.0.1:1080 ... 127.0.0.1:58673 ... OK
[proxychains] Strict chain ... 127.0.0.1:1080 ... 127.0.0.1:58673 ... OK
[proxychains] Strict chain ... 127.0.0.1:1080 ... 127.0.0.1:58673 ... OK
[proxychains] Strict chain ... 127.0.0.1:1080 ... 127.0.0.1:58673 ... OK
[proxychains] Strict chain ... 127.0.0.1:1080 ... 127.0.0.1:58673 ... OK
[proxychains] Strict chain ... 127.0.0.1:1080 ... 127.0.0.1:58673 ... OK
[proxychains] Strict chain ... 127.0.0.1:1080 ... 127.0.0.1:58673 ... OK
Looks like we cannot locate the path the 'chromedriver' (use the '--chromedriver' argument to specify the path to the executable.) or google chrome browser is not installed on your machine (exception: Remote end closed connection without response)
[proxychains] Strict chain ... 127.0.0.1:1080 ... 127.0.0.1:58673 ... OK
可能是因为proxychains4与google_images_download.py文件中调用的chromedriver冲突,因而只能通过修改源码实现socks5.
安装chromedriver:
从官网下载与chrome相匹配的chromedriver,解压后chmod a+x,并放置在/usr/bin下.
通过python测试:
from selenium import webdriver
browser = webdriver.Chrome()
如果能够打开空白页面就说明安装成功.
下载googleimagesdownload
git clone https://github.com/hardikvasa/google-images-download.git
cd google-images-download
注:通过pip安装,无法修改源文件,此方法不推荐.
修改源文件
vim google_images_download/google_images_download.py
//在第165行插入:
options.add_argument('--proxy-server=socks5://localhost:1080')
保存后就可以在本机配置shadowsocks的前提下正常爬取
python google_images_download/google_images_download.py -k "apple" --chromedriver '/usr/bin/chromedriver' -l 200
//或者编译后
googleimagesdownload -k "apple" --chromedriver '/usr/bin/chromedriver' -l 200