scrapy常用命令

最新推荐文章于 2022-05-11 23:07:40 发布

tboqi1

最新推荐文章于 2022-05-11 23:07:40 发布

阅读量315

点赞数

文章标签： python scrapy 爬虫

原文链接：https://www.jianshu.com/p/6087fbcf3e99

版权

创建项目

D:\tmp\scrapy>scrapy startproject testproject
New Scrapy project 'testproject', using template directory 'c:\\users\\tony\\app
data\\local\\programs\\python\\python35\\lib\\site-packages\\scrapy\\templates\\
project', created in:
    D:\tmp\scrapy\testproject

You can start your first spider with:
    cd testproject
    scrapy genspider example example.com

D:\tmp\scrapy>dir
 驱动器 D 中的卷没有标签。
 卷的序列号是 C5EE-F557

 D:\tmp\scrapy 的目录

2017/10/23  19:53    <DIR>          .
2017/10/23  19:53    <DIR>          ..
2017/10/23  19:53    <DIR>          testproject
               0 个文件              0 字节
               3 个目录 149,355,196,416 可用字节

查看可用模板

D:\tmp\scrapy\tet>scrapy genspider -l
Available templates:
  basic
  crawl
  csvfeed
  xmlfeed

创建爬虫文件

D:\tmp\scrapy\tet>scrapy genspider -t basic spidername http://www.jsit.edu.cn
Created spider 'spidername' using template 'basic' in module:
  tet.spiders.spidername

D:\tmp\scrapy\tet\tet>dir spiders
 驱动器 D 中的卷没有标签。
 卷的序列号是 C5EE-F557

 D:\tmp\scrapy\tet\tet\spiders 的目录

2017/10/23  19:49    <DIR>          .
2017/10/23  19:49    <DIR>          ..
2017/10/23  19:49               249 spidername.py
2017/10/23  19:42               237 test.py
2017/10/23  19:40               588 tset2.py
2017/10/23  19:41               588 tset3.py
2017/10/23  15:33               161 __init__.py
2017/10/23  19:44    <DIR>          __pycache__
               5 个文件          1,823 字节
               3 个目录 149,355,225,088 可用字节

输出到json line格式，要求parse函数中使用yield返回字典数据

scrapy crawl spiderjob -o ../out/spiderjob.j

作者：tonyemail_st
链接：https://www.jianshu.com/p/6087fbcf3e99
来源：简书
著作权归作者所有。商业转载请联系作者获得授权，非商业转载请注明出处。

tboqi1

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫