Scrapy是Python开发的一个快速,高层次的屏幕抓取和web抓取框架,用于抓取web站点并从页面中提取结构化的数据。Scrapy用途广泛,可以用于数据挖掘、监测和自动化测试。 官网网站http://www.scrapy.org/
1、安装如下软件
sudo apt-get install build-essential; sudo apt-get install python-dev; sudo apt-get install libxml2-dev; sudo apt-get install libxslt1-dev; sudo apt-get install python-setuptools; |
2、安装Scrapy
sudo easy_install Scrapy; |
wang@ubuntu:/usr/local/lib/python2.7/dist-packages$ sudo easy_install Scrapy Searching for Scrapy Best match: Scrapy 0.16.1 Processing Scrapy-0.16.1-py2.7.egg Scrapy 0.16.1 is already the active version in easy-install.pth Installing scrapy script to /usr/local/bin Using /usr/local/lib/python2.7/dist-packages/Scrapy-0.16.1-py2.7.egg Processing dependencies for Scrapy Searching for lxml Reading http://pypi.python.org/simple/lxml/ Reading http://codespeak.net/lxml Best match: lxml 3.0.1 Downloading http://pypi.python.org/packages/source/l/lxml/lxml-3.0.1.tar.gz#md5=0f2b1a063ab3b6b0944cbc4a9a85dcfa Processing lxml-3.0.1.tar.gz Running lxml-3.0.1/setup.py -q bdist_egg --dist-dir /tmp/easy_install-qibAzL/lxml-3.0.1/egg-dist-tmp-mSvUVN Building lxml version 3.0.1. Building without Cython. Using build configuration of libxslt 1.1.26 Building against libxml2/libxslt in the following directory: /usr/lib/x86_64-linux-gnu warning: no files found matching '*.txt' under directory 'src/lxml/tests' src/lxml/lxml.etree.c: In function ‘__pyx_f_4lxml_5etree__getFilenameForFile’: src/lxml/lxml.etree.c:26310:7: warning: variable ‘__pyx_clineno’ set but not used [-Wunused-but-set-variable] src/lxml/lxml.etree.c:26309:15: warning: variable ‘__pyx_filename’ set but not used [-Wunused-but-set-variable] src/lxml/lxml.etree.c:26308:7: warning: variable ‘__pyx_lineno’ set but not used [-Wunused-but-set-variable] src/lxml/lxml.etree.c: In function ‘__pyx_pf_4lxml_5etree_4XSLT_18__call__’: src/lxml/lxml.etree.c:132608:81: warning: passing argument 1 of ‘__pyx_f_4lxml_5etree_12_XSLTContext__copy’ from incompatible pointer type [enabled by default] src/lxml/lxml.etree.c:130569:52: note: expected ‘struct __pyx_obj_4lxml_5etree__XSLTContext *’ but argument is of type ‘struct __pyx_obj_4lxml_5etree__BaseContext *’ src/lxml/lxml.etree.c: In function ‘__pyx_f_4lxml_5etree__copyXSLT’: src/lxml/lxml.etree.c:133997:79: warning: passing argument 1 of ‘__pyx_f_4lxml_5etree_12_XSLTContext__copy’ from incompatible pointer type [enabled by default] src/lxml/lxml.etree.c:130569:52: note: expected ‘struct __pyx_obj_4lxml_5etree__XSLTContext *’ but argument is of type ‘struct __pyx_obj_4lxml_5etree__BaseContext *’ src/lxml/lxml.etree.c: At top level: src/lxml/lxml.etree.c:12128:13: warning: ‘__pyx_f_4lxml_5etree_displayNode’ defined but not used [-Wunused-function] src/lxml/lxml.etree.c: In function ‘__pyx_f_4lxml_5etree_11_BaseParser__parseDocFromFile’: src/lxml/lxml.etree.c:86715:3: warning: ‘__pyx_r’ may be used uninitialized in this function [-Wuninitialized] src/lxml/lxml.etree.c: In function ‘__pyx_f_4lxml_5etree_11_BaseParser__parseDoc’: src/lxml/lxml.etree.c:86403:3: warning: ‘__pyx_r’ may be used uninitialized in this function [-Wuninitialized] src/lxml/lxml.etree.c: In function ‘__pyx_f_4lxml_5etree_11_BaseParser__parseUnicodeDoc’: src/lxml/lxml.etree.c:86093:3: warning: ‘__pyx_r’ may be used uninitialized in this function [-Wuninitialized] src/lxml/lxml.etree.c: In function ‘__pyx_f_4lxml_5etree_11_BaseParser__parseDocFromFilelike’: src/lxml/lxml.etree.c:86925:3: warning: ‘__pyx_r’ may be used uninitialized in this function [-Wuninitialized] Adding lxml 3.0.1 to easy-install.pth file Installed /usr/local/lib/python2.7/dist-packages/lxml-3.0.1-py2.7-linux-x86_64.egg Searching for w3lib>=1.2 Reading http://pypi.python.org/simple/w3lib/ Reading http://github.com/scrapy/w3lib Best match: w3lib 1.2 Downloading http://pypi.python.org/packages/source/w/w3lib/w3lib-1.2.tar.gz#md5=f929d5973a9fda59587b09a72f185a9e Processing w3lib-1.2.tar.gz Running w3lib-1.2/setup.py -q bdist_egg --dist-dir /tmp/easy_install-ZAXTgy/w3lib-1.2/egg-dist-tmp-aU3vpc zip_safe flag not set; analyzing archive contents... Adding w3lib 1.2 to easy-install.pth file Installed /usr/local/lib/python2.7/dist-packages/w3lib-1.2-py2.7.egg Searching for Twisted>=8.0 Reading http://pypi.python.org/simple/Twisted/ Reading http://www.twistedmatrix.com Reading http://twistedmatrix.com/products/download Reading http://twistedmatrix.com/ Reading http://tmrc.mit.edu/mirror/twisted/Twisted/9.0/ Reading http://tmrc.mit.edu/mirror/twisted/Twisted/10.0/ Reading http://twistedmatrix.com/projects/core/ Reading http://tmrc.mit.edu/mirror/twisted/Twisted/8.2/ Reading http://tmrc.mit.edu/mirror/twisted/Twisted/8.1/ Best match: Twisted 12.2.0 Downloading http://pypi.python.org/packages/source/T/Twisted/Twisted-12.2.0.tar.bz2#md5=9a321b904d01efd695079f8484b37861 Processing Twisted-12.2.0.tar.bz2 Running Twisted-12.2.0/setup.py -q bdist_egg --dist-dir /tmp/easy_install-kw897y/Twisted-12.2.0/egg-dist-tmp-sZWFYb In file included from /usr/include/python2.7/Python.h:8:0, from twisted/internet/_sigchld.c:9: /usr/include/python2.7/pyconfig.h:1161:0: warning: "_POSIX_C_SOURCE" redefined [enabled by default] /usr/include/features.h:215:0: note: this is the location of the previous definition twisted/internet/_sigchld.c: In function ‘got_signal’: twisted/internet/_sigchld.c:15:13: warning: variable ‘ignored_result’ set but not used [-Wunused-but-set-variable] Adding Twisted 12.2.0 to easy-install.pth file Installing mailmail script to /usr/local/bin Installing conch script to /usr/local/bin Installing pyhtmlizer script to /usr/local/bin Installing twistd script to /usr/local/bin Installing lore script to /usr/local/bin Installing tkconch script to /usr/local/bin Installing tapconvert script to /usr/local/bin Installing ckeygen script to /usr/local/bin Installing tap2rpm script to /usr/local/bin Installing manhole script to /usr/local/bin Installing trial script to /usr/local/bin Installing cftp script to /usr/local/bin Installing tap2deb script to /usr/local/bin Installed /usr/local/lib/python2.7/dist-packages/Twisted-12.2.0-py2.7-linux-x86_64.egg Finished processing dependencies for Scrapy |
表示安装成功。
3、测试
scrapy shell http://ziki.cn |
获取所有a标签
hxs.select('//a').extract() |
参考资料
http://doc.scrapy.org/en/latest/intro/install.html http://doc.scrapy.org/en/latest/intro/tutorial.html |
原创文章,转载请注明: 转载自海波无痕