scrapy shell 实践 | 交互式爬虫

自CSDN学院课实践。

前提:安装scrapy,配置好环境。

主题:交互式爬虫shell命令实践。

C:\Users\32310>scrapy shell https://www.taobao.com/tbhome/page/special-markets
2020-02-19 16:01:34 [scrapy.utils.log] INFO: Scrapy 1.7.4 started (bot: scrapybot)
2020-02-19 16:01:34 [scrapy.utils.log] INFO: Versions: lxml 4.4.1.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 19.7.0, Python 3.7.0 (v3.7.0:1bf9cc5093, Jun 27 2018, 04:59:51) [MSC v.1914 64 bit (AMD64)], pyOpenSSL 19.0.0 (OpenSSL 1.1.1d  10 Sep 2019), cryptography 2.8, Platform Windows-10-10.0.18362-SP0
2020-02-19 16:01:34 [scrapy.crawler] INFO: Overridden settings: {'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter', 'LOGSTATS_INTERVAL': 0}
2020-02-19 16:01:34 [scrapy.extensions.telnet] INFO: Telnet Password: 84f2e2deb0d6e559
2020-02-19 16:01:34 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole']
2020-02-19 16:01:34 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-02-19 16:01:34 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-02-19 16:01:34 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-02-19 16:01:34 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6024
2020-02-19 16:01:34 [scrapy.core.engine] INFO: Spider opened
2020-02-19 16:01:34 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.taobao.com/tbhome/page/special-markets> (referer: None)
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x00000215E31B3A20>
[s]   item       {}
[s]   request    <GET https://www.taobao.com/tbhome/page/special-markets>
[s]   response   <200 https://www.taobao.com/tbhome/page/special-markets>
[s]   settings   <scrapy.settings.Settings object at 0x00000215E31B3B70>
[s]   spider     <DefaultSpider 'default' at 0x215e34edb38>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser

以上是可用的scrapy对象和命令。

爬虫所需要的内容是response,这是网站接到请求以后返回的答复。

>>> response.url
'https://www.taobao.com/tbhome/page/special-markets'
>>> response.status
200

然后是爬取数据。

>>> response.selector.xpath('//dt/text()')
[<Selector xpath='//dt/text()' data='时尚爆料王'>, <Selector xpath='//dt/text()' data='品质生活家'>, <Selector xpath='//dt/text()' data='特色玩味控'>, <Selector xpath='//dt/text()' data='实惠专业户'>]
>>> ls=response.selector.xpath('//dt/text()').extract()
>>> ls
['时尚爆料王', '品质生活家', '特色玩味控', '实惠专业户']
>>> ls=response.selector.xpath('//dl[@class="market-list"]')
>>> ls
[<Selector xpath='//dl[@class="market-list"]' data='<dl class="market-list" id="all-speci...'>, <Selector xpath='//dl[@class="market-list"]' data='<dl class="market-list" id="all-speci...'>, <Selector xpath='//dl[@class="market-list"]' data='<dl class="market-list" id="all-speci...'>, <Selector xpath='//dl[@class="market-list"]' data='<dl class="market-list" id="all-speci...'>]
>>> for i in ls:
...     print(i.xpath('./dt/text()').extract_first())
...
时尚爆料王
品质生活家
特色玩味控
实惠专业户
>>> for i in ls:
...     print(i.xpath('./dt/text()').extract_first())
...     print('='*70)
...     a=i.xpath('.//a')
...     for j in a:
...             print(j.xpath('./@href').extract_first(),end=":")
...             print(j.xpath('./span/img/@alt').extract_first())
...
时尚爆料王
======================================================================
https://if.taobao.com/:潮流从这里开始
https://guang.taobao.com/:外貌协会の逛街指南
https://mei.taobao.com/:妆 出你的腔调
https://g.taobao.com/:探索全球美好生活
//star.taobao.com/:全球明星在这里
https://mm.taobao.com/:美女红人集中地
https://www.taobao.com/markets/designer/stylish:全球创意设计师平台
品质生活家
======================================================================
https://chi.taobao.com/chi/:食尚全球 地道中国
//q.taobao.com:懂得好生活
https://www.jiyoujia.com/:过我想要的生活
https://www.taobao.com/markets/sph/sph/sy:尖货奢品品味选择
https://www.taobao.com/markets/qbb/index:享受育儿生活新方式
//car.taobao.com/:买车省钱,用车省心
//sport.taobao.com/:爱上运动每一天
//zj.taobao.com:匠心所在  物有所值
//wt.taobao.com/:畅享优质通信生活
https://www.alitrip.com/:比梦想走更远
特色玩味控
======================================================================
https://china.taobao.com:地道才够味!
https://www.taobao.com/markets/3c/tbdc:为你开启潮流新生活
https://acg.taobao.com/:ACGN  好玩好看
https://izhongchou.taobao.com/index.htm:认真对待每一个梦想。
//enjoy.taobao.com/:园艺宠物爱好者集中营
https://sf.taobao.com/:法院处置资产,0佣金捡漏
https://zc-paimai.taobao.com/:超值资产,投资首选
https://paimai.taobao.com/:想淘宝上拍卖
//xue.taobao.com/:给你未来的学习体验
//2.taobao.com:让你的闲置游起来
https://ny.taobao.com/:价格实惠品类齐全
实惠专业户
======================================================================
//tejia.taobao.com/:优质好货 特价专区
https://qing.taobao.com/:品牌尾货365天最低价
https://ju.taobao.com/jusp/other/mingpin/tp.htm:奢侈品团购第一站
https://ju.taobao.com/jusp/other/juliangfan/tp.htm?spm=608.5847457.102202.5.jO4uZI:重新定义家庭生活方式
https://qiang.taobao.com/:抢到就是赚到!
https://ju.taobao.com/jusp/nv/fcdppc/tp.htm:大牌正品 底价特惠
https://ju.taobao.com/jusp/shh/life/tp.htm?:惠聚身边精选好货
https://ju.taobao.com/jusp/sp/global/tp.htm?spm=0.0.0.0.biIDGB:10点上新 全球底价
https://try.taobao.com/index.htm:总有新奇等你发现
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值