PHP如何启动scrapy,scrapy爬虫的正确使用姿势,从入门安装开始(一)

scrapy 官方介绍

An open source and collaborative framework for extracting the data you need from websites.

In a fast, simple, yet extensible way.

这是一个开源协作的框架,用于从网站中提取你需要的数据,并且,快速,简单,可扩展。

总所周知,这是一个很强大的爬虫框架,能够帮助我们从众多的网页中提取出我们需要的数据,且快速易于扩展。

它是由python编写的,所以完成编码后,我们可以运行在Linux、Windows、Mac 和 BSD

官方数据

24k stars, 6k forks and 1.6k watchers on GitHub

4.0k followers on Twitter

8.7k questions on StackOverflow

版本要求

Python 2.7 or Python 3.4+

Works on Linux, Windows, Mac OSX, BSD

快速安装

pip install scrapy

创建项目

root@ubuntu:/# scrapy startproject -h

Usage

=====

scrapy startproject [project_dir]

Create new project

Options

=======

--help, -h show this help message and exit

Global Options

--------------

--logfile=FILE log file. if omitted stderr will be used

--loglevel=LEVEL, -L LEVEL

log level (default: DEBUG)

--nolog disable logging completely

--profile=FILE write python cProfile stats to FILE

--pidfile=FILE write process ID to FILE

--set=NAME=VALUE, -s NAME=VALUE

set/override setting (may be repeated)

--pdb enable pdb on failure

root@ubuntu:/# scrapy startproject helloDemo

New Scrapy project 'helloDemo', using template directory '/usr/local/lib/python3.5/dist-packages/scrapy/templates/project', created in:

/helloDemo

You can start your first spider with:

cd helloDemo

scrapy genspider example example.com

root@ubuntu:/# cd helloDemo/

root@ubuntu:/helloDemo# ls

helloDemo scrapy.cfg

root@ubuntu:/helloDemo# scrapy crawl spider baidu

2018-08-24 17:33:00 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: helloDemo)

2018-08-24 17:33:00 [scrapy.utils.log] INFO: Overridden settings: {'SPIDER_MODULES': ['helloDemo.spiders'], 'NEWSPIDER_MODULE': 'helloDemo.spiders', 'ROBOTSTXT_OBEY': True, 'BOT_NAME': 'helloDemo'}

Usage

=====

scrapy crawl [options]

crawl: error: running 'scrapy crawl' with more than one spider is no longer supported

root@ubuntu:/helloDemo# scrapy crawl baidu

2018-08-24 17:33:06 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: helloDemo)

2018-08-24 17:33:06 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'helloDemo', 'ROBOTSTXT_OBEY': True, 'NEWSPIDER_MODULE': 'helloDemo.spiders', 'SPIDER_MODULES': ['helloDemo.spiders']}

2018-08-24 17:33:07 [scrapy.middleware] INFO: Enabled extensions:

['scrapy.extensions.memusage.MemoryUsage',

'scrapy.extensions.corestats.CoreStats',

'scrapy.extensions.telnet.TelnetConsole',

'scrapy.extensions.logstats.LogStats']

2018-08-24 17:33:07 [scrapy.middleware] INFO: Enabled downloader middlewares:

['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',

'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',

'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',

'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',

'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',

'scrapy.downloadermiddlewares.retry.RetryMiddleware',

'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',

'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',

'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',

'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',

'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',

'scrapy.downloadermiddlewares.stats.DownloaderStats']

2018-08-24 17:33:07 [scrapy.middleware] INFO: Enabled spider middlewares:

['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',

'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',

'scrapy.spidermiddlewares.referer.RefererMiddleware',

'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',

'scrapy.spidermiddlewares.depth.DepthMiddleware']

2018-08-24 17:33:07 [scrapy.middleware] INFO: Enabled item pipelines:

[]

2018-08-24 17:33:07 [scrapy.core.engine] INFO: Spider opened

2018-08-24 17:33:07 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

2018-08-24 17:33:07 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023

2018-08-24 17:33:07 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to from

2018-08-24 17:33:07 [scrapy.core.engine] DEBUG: Crawled (200) (referer: None)

2018-08-24 17:33:07 [scrapy.downloadermiddlewares.robotstxt] DEBUG: Forbidden by robots.txt:

2018-08-24 17:33:08 [scrapy.core.engine] INFO: Closing spider (finished)

2018-08-24 17:33:08 [scrapy.statscollectors] INFO: Dumping Scrapy stats:

{'downloader/exception_count': 1,

'downloader/exception_type_count/scrapy.exceptions.IgnoreRequest': 1,

'downloader/request_bytes': 443,

'downloader/request_count': 2,

'downloader/request_method_count/GET': 2,

'downloader/response_bytes': 1125,

'downloader/response_count': 2,

'downloader/response_status_count/200': 1,

'downloader/response_status_count/302': 1,

'finish_reason': 'finished',

'finish_time': datetime.datetime(2018, 8, 24, 9, 33, 8, 11376),

'log_count/DEBUG': 4,

'log_count/INFO': 7,

'memusage/max': 52117504,

'memusage/startup': 52117504,

'response_received_count': 1,

'scheduler/dequeued': 1,

'scheduler/dequeued/memory': 1,

'scheduler/enqueued': 1,

'scheduler/enqueued/memory': 1,

'start_time': datetime.datetime(2018, 8, 24, 9, 33, 7, 430751)}

2018-08-24 17:33:08 [scrapy.core.engine] INFO: Spider closed (finished)

代码目录结构

root@ubuntu:/helloDemo# tree

.

├── helloDemo

│   ├── __init__.py

│   ├── items.py # 实体,数据结构

│   ├── middlewares.py # 爬虫的中间件

│   ├── pipelines.py # 管道,数据的存储

│   ├── __pycache__

│   │   ├── __init__.cpython-35.pyc

│   │   └── settings.cpython-35.pyc

│   ├── settings.py # 全局设置

│   └── spiders # 爬虫蜘蛛项目

│   ├── baidu.py # 上面创建的baidu爬虫的项目

│   ├── __init__.py

│   └── __pycache__

│   └── __init__.cpython-35.pyc

└── scrapy.cfg

spiders/baidu.py是我们需要我们处理数据的地方,response是抓取时返回的整个html DOM结构

# -*- coding: utf-8 -*-

import scrapy

class BaiduSpider(scrapy.Spider):

name = 'baidu'

allowed_domains = ['wwww.baidu.com']

start_urls = ['http://wwww.baidu.com/']

def parse(self, response):

pass

后面的文章我会继续介绍scrapy的用法

参考资源

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值