scrapy 环境搭建与试运行

背景:最近在做php服务器的项目,但是内容对于一个产品起到的是最为重要的东西,那么对于内容的获取就变的非常重要,所有想通过爬虫的方式来获取内容,所以想到了使用scrapy来当作爬虫工具,但是苦逼的是,环境对于开发者来说永远都是那么的不顺。浪费了很多时间。这里记录一下,方便以后查看。


开工:其实官网介绍,安装scrapy非常简单。

官方教程:https://doc.scrapy.org/en/latest/intro/install.html

1.确保你有python

2.使用pip install scrapy

就成功了,官网说的更多,但是我尝试时,好多都是不需要的。都会自动下载安装。


但是我在使用过程中遇到了很多问题,

问题阶段

1.就是会提示etree.so中找不到各种东西。

然后经过查找网上大多的答案都是mac自带的python版本不匹配导致的。这里卡了很久,使用virtualenv也并没有解决我的问题。

查找网上,各种设置各种环境变量,但是都没有解决我的问题。

探索阶段

我一狠心,将mac自带的python都删除干净了,之前是2.7.10。现在回想这个版本应该是ok的。

安装了2.7.12还是有问题。

然后通过进入lxml目录。

运行otool lxml/etree.so

发现连接的文件也都是存在的,也不是网上所说的link错动态库的问题。

结果我又一狠心,将lxml删掉了,重新安装还是不行。

但是在删除的过程中,我运行了 scrapy 发现是下面的错误,关于etree.so找不到方法的解决办法比较少,而且没有什么用,但是这个错误网上很多啊

wudideMacBook-Pro:~ xiepengchong$ scrapy

Traceback (most recent call last):

  File "/usr/local/bin/scrapy", line 7, in <module>

    from scrapy.cmdline import execute

  File "/usr/local/lib/python2.7/site-packages/scrapy/__init__.py", line 34, in <module>

    from scrapy.spiders import Spider

  File "/usr/local/lib/python2.7/site-packages/scrapy/spiders/__init__.py", line 10, in <module>

    from scrapy.http import Request

  File "/usr/local/lib/python2.7/site-packages/scrapy/http/__init__.py", line 11, in <module>

    from scrapy.http.request.form import FormRequest

  File "/usr/local/lib/python2.7/site-packages/scrapy/http/request/form.py", line 9, in <module>

    import lxml.html

ImportError: No module named lxml.html

其中有个解决办法是使用

easy_install Scrapy

安装scrapy,我就抱着试一试的态度,运行了一下。

wudideMacBook-Pro:ucarandroid xiepengchong$ easy_install Scrapy

Searching for Scrapy

Best match: Scrapy 1.2.1

Adding Scrapy 1.2.1 to easy-install.pth file

Installing scrapy script to /usr/local/bin


Using /usr/local/lib/python2.7/site-packages

Processing dependencies for Scrapy

Searching for lxml

Reading https://pypi.python.org/simple/lxml/

Best match: lxml 3.6.4

Downloading https://pypi.python.org/packages/4f/3f/cf6daac551fc36cddafa1a71ed48ea5fd642e5feabd3a0d83b8c3dfd0cb4/lxml-3.6.4.tar.gz#md5=6dd7314233029d9dab0156e7b1c7830b

Processing lxml-3.6.4.tar.gz

Writing /var/folders/b7/3fpgyn013wzfrsdmrcy0gp640000gn/T/easy_install-kjY8jp/lxml-3.6.4/setup.cfg

Running lxml-3.6.4/setup.py -q bdist_egg --dist-dir /var/folders/b7/3fpgyn013wzfrsdmrcy0gp640000gn/T/easy_install-kjY8jp/lxml-3.6.4/egg-dist-tmp-0VlZ05

Building lxml version 3.6.4.

Building without Cython.

Using build configuration of libxslt 1.1.28

creating /usr/local/lib/python2.7/site-packages/lxml-3.6.4-py2.7-macosx-10.10-x86_64.egg

Extracting lxml-3.6.4-py2.7-macosx-10.10-x86_64.egg to /usr/local/lib/python2.7/site-packages

Adding lxml 3.6.4 to easy-install.pth file


Installed /usr/local/lib/python2.7/site-packages/lxml-3.6.4-py2.7-macosx-10.10-x86_64.egg


结果成功了(所以应该还是之前安装的lxml版本的问题,虽然之前安装的也是3.6.4版本,这里哪位大神知道具体原因,希望可以指点一下)

wudideMacBook-Pro:opensource xiepengchong$ scrapy

Scrapy 1.2.1 - no active project


Usage:

  scrapy <command> [options] [args]


Available commands:

  bench         Run quick benchmark test

  commands      

  fetch         Fetch a URL using the Scrapy downloader

  genspider     Generate new spider using pre-defined templates

  runspider     Run a self-contained spider (without creating a project)

  settings      Get settings values

  shell         Interactive scraping console

  startproject  Create new project

  version       Print Scrapy version

  view          Open URL in browser, as seen by Scrapy


  [ more ]      More commands available when run from project directory


Use "scrapy <command> -h" to see more info about a command


成功了还是小小的激动了一下,接下来就是来运行自己的第一个工程来试一下。

官网教程:http://scrapy-chs.readthedocs.io/zh_CN/0.24/intro/tutorial.html

wudideMacBook-Pro:myScrapy xiepengchong$ scrapy startproject tutorial

New Scrapy project 'tutorial', using template directory '/usr/local/lib/python2.7/site-packages/scrapy/templates/project', created in:

    /Users/xiepengchong/opensource/scrapy/myScrapy/tutorial


You can start your first spider with:

    cd tutorial

    scrapy genspider example example.com

wudideMacBook-Pro:myScrapy xiepengchong$ 

其实接下来,根本不需要看教程了,都告诉我们接下来做什么了。



wudideMacBook-Pro:myScrapy xiepengchong$ cd tutorial/

wudideMacBook-Pro:tutorial xiepengchong$ scrapy genspider example example.com

Created spider 'example' using template 'basic' in module:

  tutorial.spiders.example

wudideMacBook-Pro:tutorial xiepengchong$ 


看来我们已经创建成功了,接下来运行一下试试吧



wudideMacBook-Pro:tutorial xiepengchong$ scrapy crawl example

2016-11-09 20:47:21 [scrapy] INFO: Scrapy 1.2.1 started (bot: tutorial)

2016-11-09 20:47:21 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'tutorial.spiders', 'SPIDER_MODULES': ['tutorial.spiders'], 'ROBOTSTXT_OBEY': True, 'BOT_NAME': 'tutorial'}

2016-11-09 20:47:21 [scrapy] INFO: Enabled extensions:

['scrapy.extensions.logstats.LogStats',

 'scrapy.extensions.telnet.TelnetConsole',

 'scrapy.extensions.corestats.CoreStats']

2016-11-09 20:47:21 [scrapy] INFO: Enabled downloader middlewares:

['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',

 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',

 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',

 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',

 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',

 'scrapy.downloadermiddlewares.retry.RetryMiddleware',

 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',

 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',

 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',

 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',

 'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',

 'scrapy.downloadermiddlewares.stats.DownloaderStats']

2016-11-09 20:47:21 [scrapy] INFO: Enabled spider middlewares:

['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',

 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',

 'scrapy.spidermiddlewares.referer.RefererMiddleware',

 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',

 'scrapy.spidermiddlewares.depth.DepthMiddleware']

2016-11-09 20:47:21 [scrapy] INFO: Enabled item pipelines:

[]

2016-11-09 20:47:21 [scrapy] INFO: Spider opened

2016-11-09 20:47:21 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

2016-11-09 20:47:21 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023

2016-11-09 20:47:22 [scrapy] DEBUG: Crawled (404) <GET http://example.com/robots.txt> (referer: None)

2016-11-09 20:47:23 [scrapy] DEBUG: Crawled (200) <GET http://example.com/> (referer: None)

2016-11-09 20:47:23 [scrapy] INFO: Closing spider (finished)

2016-11-09 20:47:23 [scrapy] INFO: Dumping Scrapy stats:

{'downloader/request_bytes': 428,

 'downloader/request_count': 2,

 'downloader/request_method_count/GET': 2,

 'downloader/response_bytes': 1899,

 'downloader/response_count': 2,

 'downloader/response_status_count/200': 1,

 'downloader/response_status_count/404': 1,

 'finish_reason': 'finished',

 'finish_time': datetime.datetime(2016, 11, 9, 12, 47, 23, 133884),

 'log_count/DEBUG': 3,

 'log_count/INFO': 7,

 'response_received_count': 2,

 'scheduler/dequeued': 1,

 'scheduler/dequeued/memory': 1,

 'scheduler/enqueued': 1,

 'scheduler/enqueued/memory': 1,

 'start_time': datetime.datetime(2016, 11, 9, 12, 47, 21, 424947)}

2016-11-09 20:47:23 [scrapy] INFO: Spider closed (finished)

wudideMacBook-Pro:tutorial xiepengchong$ 

虽然还没有完全了解这个爬虫的工作原理,但是第一步总算是完成了,

万事开头难吗,剩下其他的就慢慢积累学习吧。



  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值