文章目录
一、开发环境配置
1.1,python安装
Windows(设置环境变量)
Linux
Mac
1.2 请求库的安装
requests
Selenium
selenium==2.48.0下这个老版本,新版不支持phantomjs
ChromeDriver
淘宝镜像站:http://npm.taobao.org/mirrors/chromedriver
GeckoDriver
PhantomJS(版本2.1.1)
安装依赖:
sudo apt-get install build-essential chrpath libssl-dev libxft-dev
sudo apt-get install libfreetype6 libfreetype6-dev libfontconfig1 libfontconfig1-dev
下载地址(淘宝镜像):https://npmmirror.com/mirrors/phantomjs?spm=a2c6h.24755359.0.0.6d443dc1T0AXPt
安装方式一:放到系统目录(推荐)
原名太长,重命名,移动到/usr/local/share
目录下
sudo mv phantomjs211 /usr/local/share/
创建启动软链接:
sudo ln -s /usr/local/share/phantomjs211/bin/phantomjs /usr/local/bin/
安装方式二:放到用户目录
下载包解压后解压,文件夹移到home目录下,
并设为隐藏文件(文件夹名称前加.号),
修改~/.profile文件:
sudo vim ~/.profile
末尾添加phantomjs执行文件路径,如:
export PATH="$HOME/.phantomjs版本号/bin:$PATH"
错误解决
Auto configuration failed
140277513316288:error:25066067:DSO support routines:DLFCN_LOAD:could not load the shared library:dso_dlfcn.c:185:filename(libssl_conf.so): libssl_conf.so: 无法打开共享对象文件: 没有那个文件或目录
140277513316288:error:25070067:DSO support routines:DSO_load:could not load the shared library:dso_lib.c:244:
140277513316288:error:0E07506E:configuration file routines:MODULE_LOAD_DSO:error loading dso:conf_mod.c:285:module=ssl_conf, path=ssl_conf
140277513316288:error:0E076071:configuration file routines:MODULE_RUN:unknown module name:conf_mod.c:222:module=ssl_conf
解决方法:
export OPENSSL_CONF=/etc/ssl/
aiohttp
1.3解析库的安装
lxml
Beautiful Soup
pyquery
tesserocr(先安装tesseract)
(在windows下因为兼容问题,所以用pytesseract替代tesseroct,然后设置tesseract的环境变量。)
tesseract语言下载包:https://codechina.csdn.net/mirrors/tesseract-ocr/tessdata?utm_source=csdn_github_accelerator
测试安装是否成功:
tesseract image.png result -l eng |type result.txt
tesseract image.png result -l eng |cat result.txt
import pytesseract
from PIL import Image
im=Image.open('image.png')
print(pytesseract.image_to_string(im))
import pytesseract
from PIL import Image
image = Image