Python Scraping Tools

1. Tools Introduction

scrapy: application framework for web scraping and crawling

beautifulsoup: library for parsing HTML

mechanize

lxml

selenium/PhantomJS/casperJS for script executing.


2. install scrapy on Windows 7 32bit

(1) install python 2.7

note: installing this software needs administrator privilege

(2) install Microsoft Visual C++ Compiler for Python 2.7  (http://www.microsoft.com/en-us/download/details.aspx?id=44266)

note one: without this package, you will get error like: Microsoft Visual C++ 9.0 is required (Unable to find vcvarsall.bat)

note two: installing this software doesn't need administrator privilege.

(3) install lxml (https://pypi.python.org/pypi/lxml/3.5.0), download the installer and install it.

note: if you don't install this in advance, the Scrapy installing process will complain it couldn't find libxml2

(4) execute the command under python 2.7 scripts directory: pip.exe install Scrapy

(5) Install pywin32 (otherwise, you will get error: no module named win32api)

Reference[2]. under command line tool:  easy_install-2.7.exe e:\software\pywin32-219.win32-py2.7.exe

software path: http://sourceforge.net/projects/pywin32/files/pywin32/

(6) install selenium. pip.exe install selenium


3. Setup Eclipse PyDev for Scrapy

(1) Download Eclipse Luna (4.4)

(2) Install the Eclipse plugin PyDev for Eclipse 4.4

and set up the PyDev in Eclipse preferences.

(3)  Reference[1]

step 1: create a scrapy project by scrapy command

step 2: create a pydev project in eclipse

step 3: copy the scrapy project files to pydev project folder

after this step, you can see 4-layer folder hierarchy, as scrapy project itself has 3.

step 4: set eclipse->run->debug configurations->Main

    name: configuration name, whatever

    project: choose the scrapy project

    Main Module: don't browse, just enter full path of cmdline.py (in my case: D:\Python\Python27\Lib\site-packages\scrapy\cmdline.py) 

step 5: set eclipse->run->debug configurations->Arguments

    Program arguments:  crawl spidername
    Working directory -> other: choose the spider working directory


References

[1] https://www.zhihu.com/question/28565716/answer/53736780

[2] http://stackoverflow.com/questions/26689371/scrapy-no-module-named-win32api-windows

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值