主要是说要爬虫就要安装的工具,仅简单说一下。大部分都能pip安装。荧光的需要另外安装
- python3 建议安装Anaconda,这样python3和Anaconda同时安装好了,为以后省去不少麻烦。
- 请求库: requests, selenium, chromedriver , geckodriver, phantomjs, aiohttp
- 解析库: lxml, beautifulsoup4, pyquery, tesserocr
- 数据库:mysql, mongodb, redis
- 存储库:pymysql, pymongo, redis-py, redisdump
- Web库:flask, tornado
- App爬取相关库:Charles, mitmproxy, appium
- 爬虫框架:pyspider, scrapy, scrapy-splash, scrapy-redis
- 部署相关库:docker, scrapyd, scrapyd-client, scrapyd api, scrapyrt, gerapy
chromedriver/geckodriver:
下载:
国内要下载chromedriver只能到这个镜像网址
http://npm.taobao.org/mirrors/chromedriver/
Firefox
https://github.com/mozilla/geckodriver/releases
下载对应版本后放在python的scripts文件夹里
验证安装:
from selenium import webdriver
browser = webdriver.Chrome()
browser = webdriver.Firefox()
打开一个空白的浏览器,安装成功
tesserocr:
需要先安装tesseract:
http://digi.bib.uni-mannheim.de/tesseract
选择不带dev版本的下载
然后再 pip install tesserocr pillow
Mysql:
https://www.mysql.com/cn/downloads
然后 pip install pymysql
MongoDB:
https://www.mongodb.com
作者推荐再下载可视化工具robo3t:https://robomongo.org/download
然后 pip install pymongo
Redis:
https://www.redis.cn
作者推荐再下载可视化工具redisdesktopmanager:
https://github.com/uglide/redisdesktopmanager/releases
然后 pip install redis
为了导入导出redis的数据,还需要安装redisdump
先安装 ruby ,http://www.ruby-lang.org
然后 gem install redis-dump
Charles:
https://www.charlesproxy.com/download
appium:
https://github.com/appium/appium-desktop/releases
pyspider:
要先安装pycurl,在下面网址找到适合自己的版本,win64位,python3.7的就要下载
pycurl‑7.43.1‑cp37‑cp37m‑win_amd64.whl
https://www.lfd.uci.edu/~gohlke/pythonlibs/#pycurl
Scrapy:
先pip安装lxml, pyopenssl,twisted,pywin32。最后再pip安装scrapy
scrapy-splash
要先安装splash,通过docker安装 ,再pip install scrapy-splash