scpray使用

最新推荐文章于 2024-08-15 11:14:54 发布

qq3331053

最新推荐文章于 2024-08-15 11:14:54 发布

阅读量476

点赞数

文章标签： mysql python 数据库

本文链接：https://blog.csdn.net/qq3331053/article/details/124560653

版权

**scrapy scrapyd的使用使用**
- 1.scrapy安装

`python3 -m pip install scrapy` 或 `pip3 install scrapy`
- 2.创建项目和爬虫

   `scrapy startproject mypro`

`cd mypro`

`scrapy genspider baidu baidu.com`

- 3.scrapyd安装

`python3 -m pip install scrapyd` 或 `pip3 install scrapyd`

`python3 -m pip install scrapyd-client` 或 `pip3 install scrapyd-client`

- 4.scrapyd 启动和使用

`scrapyd`

`scrapyd-deploy default -p mypro`

`curl http://localhost:6800/schedule.json -d project=mypro -d spider=baidu`

- 5.爬虫数据写入mysql

settings写入如下配置

   ```
MYSQL_HOST = "127.0.0.1"
   MYSQL_PORT = 3306
   MYSQL_USER = "root"
   MYSQL_PASSWORD = "123456"
   MYSQL_DB = "spider"
   MYSQL_CHARSET = "utf8"
ITEM_PIPELINES = {
       'mypro.pipelines.MysqlPipeline':300
}
```

pipeline文件

```
import pymysql

from scrapy.utils.project import get_project_settings

   class MysqlPipeline:
   def __init__(self):
dbConf = get_project_settings()
self.host = dbConf["MYSQL_HOST"]
self.port = dbConf["MYSQL_PORT"]
self.user = dbConf["MYSQL_USER"]
self.password = dbConf["MYSQL_PASSWORD"]
self.db = dbConf["MYSQL_DB"]
self.charset = dbConf["MYSQL_CHARSET"]
self.connect()

def connect(self):
self.conn = pymysql.connect(
host = self.host,
port = self.port,
user = self.user,
passwd = self.password,
db = self.db,
charset = self.charset
)
self.cursor = self.conn.cursor()

def process_item(self, item, spider):
sql = "insert into dbname(xx,xx,xx)values('{}','{}','{}')".format(xx,xx,xx)
self.cursor.execute(sql)
self.conn.commit()
return item

def __del__(self):
self.cursor.close()
self.conn.close()

```

set character_set_database=utf8;
set character_set_server=utf8;
set character_set_client=utf8;
set character_set_connection=utf8;
set character_set_results=utf8;

检查服务的加载状态load status，支持的Request方法GET，比如curl http://localhost:6800/daemonstatus.json，输出{ "status": "ok", "running": "0", "pending": "0", "finished": "0", "node_name": "node-name" }

向项目中添加一个version，如果不存在就创建项目，参数：project (string, required)-项目名；version (string, required)-项目版本；egg (file, required)-包含项目代码的Python egg。
比如$ curl http://localhost:6800/addversion.json -F project=myproject -F version=r23 -F egg=@myproject.egg，输出{"status": "ok", "spiders": 3}

调度一个spider运行（作为一个job），返回job id，支持的Request方法POST，参数：project (string, required)-项目名；spider (string, required)-spider名；setting (string, optional)-运行spider时使用的Scrapy设置；jobid (string, optional)-用于标记job，覆盖默认产生的UUID；_version (string, optional)-使用的项目版本；任何其他参数都作为spider参数传递，比如$ curl http://localhost:6800/schedule.json -d project=myproject -d spider=somespider，输出{"status": "ok", "jobid": "6487ec79947edab326d6db28a2d86511e8247444"}，传递其他参数示例$ curl http://localhost:6800/schedule.json -d project=myproject -d spider=somespider -d setting=DOWNLOAD_DELAY=2 -d arg1=val1。使用scrapyd调度的spider应该允许任意数量的关键字参数，因为scrapyd向正在调度的spider发送内部生成的spider参数。

取消spider run。如果作业处于挂起状态pending，它将被删除。如果作业正在运行，它将被终止。支持的Request方法POST，参数：project (string, required)-项目名；job (string, required)-job id。比如$ curl http://localhost:6800/cancel.json -d project=myproject -d job=6487ec79947edab326d6db28a2d86511e8247444，输出{"status": "ok", "prevstate": "running"}

获取Scrapy服务器上的项目列表，支持Request方法GET，没有参数，比如$ curl http://localhost:6800/listprojects.json，输出{"status": "ok", "projects": ["myproject", "otherproject"]}

获取项目的版本列表，返回版本列表是排序好的，最后一个是最近使用的，支持Request方法GET，参数：project (string, required)-项目名，比如$ curl http://localhost:6800/listversions.json?project=myproject，输出{"status": "ok", "versions": ["r99", "r156"]}

获取一些项目中最近版本的spiders列表，支持Request方法GET，参数project (string, required)-项目名；_version (string, optional)-项目的版本，比如$ curl http://localhost:6800/listspiders.json?project=myproject，输出{"status": "ok", "spiders": ["spider1", "spider2", "spider3"]}

listjobs.json
获取项目的pending，running和finished job，支持Request方法GET，参数project (string, required)-项目名。比如$ curl http://localhost:6800/listjobs.json?project=myproject，输出{"status": "ok", "pending": [{"id": "78391cc0fcaf11e1b0090800272a6d06", "spider": "spider1"}], "running": [{"id": "422e608f9f28cef127b3d5ef93fe9399", "spider": "spider2", "start_time": "2012-09-12 10:14:03.594664"}], "finished": [{"id": "2f16646cfcaf11e1b0090800272a6d06", "spider": "spider3", "start_time": "2012-09-12 10:14:03.594664", "end_time": "2012-09-12 10:24:03.594664"}]}
所有作业数据都保存在内存中，并在Scrapyd服务重新启动时重置

删除项目版本，如果项目中没有版本，项目也会被删除，支持Request方法POST，参数project (string, required) - 项目名，version (string, required)-项目版本，比如$ curl http://localhost:6800/delversion.json -d project=myproject -d version=r99，输出{"status": "ok"}

delproject.json
删除项目和所有版本，支持Request方法POST，参数project (string, required) - 项目名，比如$ curl http://localhost:6800/delproject.json -d project=myproject，输出{"status": "ok"
————————————————
版权声明：本文为CSDN博主「肥叔菌」的原创文章，遵循CC 4.0 BY-SA版权协议，转载请附上原文出处链接及本声明。
原文链接：https://blog.csdn.net/asmartkiller/article/details/111462194

qq3331053

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
scpray使用

**scrapy scrapyd的使用使用**- 1.scrapy安装 `python3 -m pip install scrapy` 或 `pip3 install scrapy`- 2.创建项目和爬虫 `scrapy startproject mypro` `cd mypro` `scrapy genspider baidu baidu.com`- 3.scrapyd安装 `python3 -m pip install scrap...
复制链接

扫一扫