PHP如何启动scrapy,scrapy爬虫的正确使用姿势，从入门安装开始（一）

最新推荐文章于 2022-03-22 12:25:43 发布

屁乎小铭

最新推荐文章于 2022-03-22 12:25:43 发布

阅读量483

点赞数

文章标签： PHP如何启动scrapy

scrapy 官方介绍

An open source and collaborative framework for extracting the data you need from websites.

In a fast, simple, yet extensible way.

这是一个开源协作的框架，用于从网站中提取你需要的数据，并且，快速，简单，可扩展。

总所周知，这是一个很强大的爬虫框架，能够帮助我们从众多的网页中提取出我们需要的数据，且快速易于扩展。

它是由python编写的，所以完成编码后，我们可以运行在Linux、Windows、Mac 和 BSD

官方数据

24k stars, 6k forks and 1.6k watchers on GitHub

4.0k followers on Twitter

8.7k questions on StackOverflow

版本要求

Python 2.7 or Python 3.4+

Works on Linux, Windows, Mac OSX, BSD

快速安装

pip install scrapy

创建项目

root@ubuntu:/# scrapy startproject -h

Usage

=====

scrapy startproject [project_dir]

Create new project

Options

=======

--help, -h show this help message and exit

Global Options

--------------

--logfile=FILE log file. if omitted stderr will be used

--loglevel=LEVEL, -L LEVEL

log level (default: DEBUG)

--nolog disable logging completely

--profile=FILE write python cProfile stats to FILE

--pidfile=FILE write process ID to FILE

--set=NAME=VALUE, -s NAME=VALUE

set/override setting (may be repeated)

--pdb enable pdb on failure

root@ubuntu:/# scrapy startproject helloDemo

New Scrapy project 'helloDemo', using template directory '/usr/local/lib/python3.5/dist-packages/scrapy/templates/project', created in:

/helloDemo

You can start your first spider with:

cd helloDemo

scrapy genspider example example.com

root@ubuntu:/# cd helloDemo/

root@ubuntu:/helloDemo# ls

helloDemo scrapy.cfg

root@ubuntu:/helloDemo# scrapy crawl spider baidu

2018-08-24 17:33:00 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: helloDemo)

2018-08-24 17:33:00 [scrapy.utils.log] INFO: Overridden settings: {'SPIDER_MODULES': ['helloDemo.spiders'], 'NEWSPIDER_MODULE': 'helloDemo.spiders', 'ROBOTSTXT_OBEY': True, 'BOT_NAME': 'helloDemo'}

Usage

=====

scrapy crawl [options]

crawl: error: running 'scrapy crawl' with more than one spider is no longer supported

root@ubuntu:/helloDemo# scrapy crawl baidu

2018-08-24 17:33:06 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: helloDemo)

2018-08-24 17:33:06 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'helloDemo', 'ROBOTSTXT_OBEY': True, 'NEWSPIDER_MODULE': 'helloDemo.spiders', 'SPIDER_MODULES': ['helloDemo.spiders']}

2018-08-24 17:33:07 [scrapy.middleware] INFO: Enabled extensions:

['scrapy.extensions.memusage.MemoryUsage',

'scrapy.extensions.corestats.CoreStats',

'scrapy.extensions.telnet.TelnetConsole',

'scrapy.extensions.logstats.LogStats']

2018-08-24 17:33:07 [scrapy.middleware] INFO: Enabled downloader middlewares:

['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',

'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',

'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',

'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',

'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',

'scrapy.downloadermiddlewares.retry.RetryMiddleware',

'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',

'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',

'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',

'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',

'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',

'scrapy.downloadermiddlewares.stats.DownloaderStats']

2018-08-24 17:33:07 [scrapy.middleware] INFO: Enabled spider middlewares:

['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',

'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',

'scrapy.spidermiddlewares.referer.RefererMiddleware',

'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',

'scrapy.spidermiddlewares.depth.DepthMiddleware']

2018-08-24 17:33:07 [scrapy.middleware] INFO: Enabled item pipelines:

[]

2018-08-24 17:33:07 [scrapy.core.engine] INFO: Spider opened

2018-08-24 17:33:07 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

2018-08-24 17:33:07 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023

2018-08-24 17:33:07 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to from

2018-08-24 17:33:07 [scrapy.core.engine] DEBUG: Crawled (200) (referer: None)

2018-08-24 17:33:07 [scrapy.downloadermiddlewares.robotstxt] DEBUG: Forbidden by robots.txt:

2018-08-24 17:33:08 [scrapy.core.engine] INFO: Closing spider (finished)

2018-08-24 17:33:08 [scrapy.statscollectors] INFO: Dumping Scrapy stats:

{'downloader/exception_count': 1,

'downloader/exception_type_count/scrapy.exceptions.IgnoreRequest': 1,

'downloader/request_bytes': 443,

'downloader/request_count': 2,

'downloader/request_method_count/GET': 2,

'downloader/response_bytes': 1125,

'downloader/response_count': 2,

'downloader/response_status_count/200': 1,

'downloader/response_status_count/302': 1,

'finish_reason': 'finished',

'finish_time': datetime.datetime(2018, 8, 24, 9, 33, 8, 11376),

'log_count/DEBUG': 4,

'log_count/INFO': 7,

'memusage/max': 52117504,

'memusage/startup': 52117504,

'response_received_count': 1,

'scheduler/dequeued': 1,

'scheduler/dequeued/memory': 1,

'scheduler/enqueued': 1,

'scheduler/enqueued/memory': 1,

'start_time': datetime.datetime(2018, 8, 24, 9, 33, 7, 430751)}

2018-08-24 17:33:08 [scrapy.core.engine] INFO: Spider closed (finished)

代码目录结构

root@ubuntu:/helloDemo# tree

├── helloDemo

│ ├── __init__.py

│ ├── items.py # 实体，数据结构

│ ├── middlewares.py # 爬虫的中间件

│ ├── pipelines.py # 管道，数据的存储

│ ├── __pycache__

│ │ ├── __init__.cpython-35.pyc

│ │ └── settings.cpython-35.pyc

│ ├── settings.py # 全局设置

│ └── spiders # 爬虫蜘蛛项目

│ ├── baidu.py # 上面创建的baidu爬虫的项目

│ ├── __init__.py

│ └── __pycache__

│ └── __init__.cpython-35.pyc

└── scrapy.cfg

spiders/baidu.py是我们需要我们处理数据的地方，response是抓取时返回的整个html DOM结构

# -*- coding: utf-8 -*-

import scrapy

class BaiduSpider(scrapy.Spider):

name = 'baidu'

allowed_domains = ['wwww.baidu.com']

start_urls = ['http://wwww.baidu.com/']

def parse(self, response):

pass

后面的文章我会继续介绍scrapy的用法

参考资源

屁乎小铭

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
PHP如何启动scrapy,scrapy爬虫的正确使用姿势，从入门安装开始（一）

scrapy 官方介绍An open source and collaborative framework for extracting the data you need from websites.In a fast, simple, yet extensible way.这是一个开源协作的框架，用于从网站中提取你需要的数据，并且，快速，简单，可扩展。总所周知，这是一个很强大的爬虫框架，能够帮...
复制链接

扫一扫

PHP如何启动scrapy,scrapy爬虫的正确使用姿势，从入门安装开始（一）

“相关推荐”对你有帮助么？