在线解析html 开源,scrapy-2 嗅探网站,解析HTML

首先我们要使用scrapy shell  最好先安装ipython, 这个应用能让我们在python中使用Tab来补齐命令

pip install ipython

我们开始抓取一个网站

进入我们的项目目录

root@uliweb:~/spider/boge# pwd

/root/spider/boge

root@uliweb:~/spider/boge# scrapy shell http://blu-raydisc.tv/

2014-06-04 08:22:37+0800 [scrapy] INFO: Scrapy 0.22.2 started (bot: boge)

2014-06-04 08:22:37+0800 [scrapy] INFO: Optional features available: ssl, http11

2014-06-04 08:22:37+0800 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'boge.spiders', 'SPIDER_MODULES': ['boge.spiders'], 'LOGSTATS_INTERVAL': 0, 'BOT_NAME': 'boge'}

2014-06-04 08:22:37+0800 [scrapy] INFO: Enabled extensions: TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState

2014-06-04 08:22:37+0800 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats

2014-06-04 08:22:37+0800 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware

2014-06-04 08:22:37+0800 [scrapy] INFO: Enabled item pipelines: ImagesPipeline

2014-06-04 08:22:37+0800 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023

2014-06-04 08:22:37+0800 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080

2014-06-04 08:22:37+0800 [default] INFO: Spider opened

2014-06-04 08:22:40+0800 [default] DEBUG: Crawled (200)  (referer: None)

[s] Available Scrapy objects:

[s]   crawler    

[s]   item       {}

[s]   request    

[s]   response   <200 http://blu-raydisc.tv/>

[s]   sel        

[s]   settings   >

[s]   spider     

[s] Useful shortcuts:

[s]   shelp()           Shell help (print this help)

[s]   fetch(req_or_url) Fetch request (or URL) and update local objects

[s]   view(response)    View response in a browser

/usr/local/lib/python2.7/dist-packages/IPython/frontend.py:30: UserWarning: The top-level `frontend` package has been deprecated. All its subpackages have been moved to the top `IPython` level.

warn("The top-level `frontend` package has been deprecated. "

已经抓取成功,注意看上面的命令,我们下面会用到response和sel,关于其他命令我们暂时用不到,以后再做讲解

In [1]: print response.

response.body             response.copy             response.flags            response.meta             response.request          response.url

response.body_as_unicode  response.encoding         response.headers          response.replace          response.status

In [1]: print response.bo

response.body             response.body_as_unicode

In [1]: print response.body    这里因为抓取的主页有点大,所以不打印出来了

我现在想抓图片试试,看看最近有什么好看的电影,我在这里截取了一段HTML代码

  •                     

  •                     

  •                     

  •                     

  •                     

  •                     

首先分析下,图片在什么位置,以什么方式存在在这个代码里,我们这里可以用正则匹配,也可以用强大的xpath

In [11]: sel.xpath('//img/@src').extract()

Out[11]:

[u'http://blu-raydisc.tv/images/logo.png',

u'data:image/gif;base64,R0lGODlhAQABAJEAAAAAAP///wAAACH5BAEHAAIALAAAAAABAAEAAAICVAEAOw==',

u'data:image/gif;base64,R0lGODlhAQABAJEAAAAAAP///wAAACH5BAEHAAIALAAAAAABAAEAAAICVAEAOw==',

u'data:image/gif;base64,R0lGODlhAQABAJEAAAAAAP///wAAACH5BAEHAAIALAAAAAABAAEAAAICVAEAOw==',

u'data:image/gif;base64,R0lGODlhAQABAJEAAAAAAP///wAAACH5BAEHAAIALAAAAAABAAEAAAICVAEAOw==',

u'data:image/gif;base64,R0lGODlhAQABAJEAAAAAAP///wAAACH5BAEHAAIALAAAAAABAAEAAAICVAEAOw==',

u'http://i.blu-raydisc.tv/images/photos/A_5.jpg',

u'http://blu-raydisc.tv/modules/mod_news_pro_gk4/cache/Film.2014.04.game-of-thrones-season-4.game-of-thrones-season-4_0nsp_275.jpg',

u'http://blu-raydisc.tv/modules/mod_news_pro_gk4/cache/Film.2014.02.winter-s-tale.winter-s-tale_1nsp_275.jpg',

u'http://blu-raydisc.tv/modules/mod_news_pro_gk4/cache/Film.2014.04.that-demon-within.that-demon-within_0nsp_275.jpg',

u'http://blu-raydisc.tv/modules/mod_news_pro_gk4/cache/Film.2014.04.the-fatal-encounter.the-fatal-encounter_0nsp_275.jpg',

u'http://blu-raydisc.tv/modules/mod_news_pro_gk4/cache/Film.2014.06.edge-of-tomorrow.edge-of-tomorrow_0nsp_275.jpg',

u'http://blu-raydisc.tv/modules/mod_news_pro_gk4/cache/Film.2014.05.the-amazing-spider-man-2.the-amazing-spider-man-2_01nsp_275.jpg',

u'http://blu-raydisc.tv/modules/mod_news_pro_gk4/cache/Film.2014.05.x-men-days-of-future-past.x-men-days-of-future-past_1nsp_275.jpg',

u'http://blu-raydisc.tv/modules/mod_news_pro_gk4/cache/Film.2014.03.300-rise-of-an-empire.300-rise-of-an-empire_1nsp_275.jpg',

u'http://blu-raydisc.tv/modules/mod_news_pro_gk4/cache/Film.2014.03.the-grand-budapest-hotel.the-grand-budapest-hotel_0nsp_275.jpg',

u'http://blu-raydisc.tv/modules/mod_news_pro_gk4/cache/Film.2013.09.las-brujas-de-zugarramurdi.las-brujas-de-zugarramurdi_0nsp_275.jpg',

u'http://blu-raydisc.tv/modules/mod_news_pro_gk4/cache/Film.2014.04.oculus.oculus_0nsp_275.jpg',

u'http://blu-raydisc.tv/modules/mod_news_pro_gk4/cache/Film.2014.04.captain-america.captain-america_14nsp_275.jpg',

u'http://blu-raydisc.tv/modules/mod_news_pro_gk4/cache/Film.2015.07.terminator-genesis.terminator-genesis_001nsp_282.jpg',

u'http://blu-raydisc.tv/modules/mod_news_pro_gk4/cache/Film.2016.03.prometheus-2.prometheus-2_001nsp_282.jpg',

u'http://blu-raydisc.tv/modules/mod_news_pro_gk4/cache/Film.2016.03.warcraft.warcraft_001nsp_282.jpg',

u'http://blu-raydisc.tv/modules/mod_news_pro_gk4/cache/Film.2015.06.the-fantastic-four.the-fantastic-four_001nsp_282.jpg',

u'http://blu-raydisc.tv/modules/mod_news_pro_gk4/cache/Film.2015.05.the-avengers-2.the-avengers-2_001nsp_282.jpg',

u'http://blu-raydisc.tv/modules/mod_news_pro_gk4/cache/Film.2016.01.pirates-of-the-caribbean.pirates-of-the-caribbean_001nsp_282.jpg',

u'http://blu-raydisc.tv/modules/mod_news_pro_gk4/cache/Film.2016.01.avatar-2.avatar-2_001nsp_282.jpg',

u'http://blu-raydisc.tv/modules/mod_news_pro_gk4/cache/Film.2014.08.Guardians-of-the-Galaxy.Guardians-of_1nsp_282.jpg',

u'http://blu-raydisc.tv/modules/mod_news_pro_gk4/cache/Film.2014.06.Transfomers-4.Transfomers-4_0nsp_282.jpg',

u'http://blu-raydisc.tv/modules/mod_news_pro_gk4/cache/Film.2014.03.need-for-speed.need-for-speed_1nsp_282.jpg',

u'http://i.blu-raydisc.tv/images/photos/the-hobbit-2.jpg',

u'http://i.blu-raydisc.tv/images/photos/the-hobbit-2_1.jpg',

u'http://blu-raydisc.tv/modules/mod_news_pro_gk4/cache/Film.2014.04.game-of-thrones-season-4.game-of-thrones-season-4_0newspro1.jpg',

u'http://blu-raydisc.tv/modules/mod_news_pro_gk4/cache/Film.2014.02.winter-s-tale.winter-s-tale_1newspro1.jpg',

u'http://blu-raydisc.tv/modules/mod_news_pro_gk4/cache/Film.2014.04.that-demon-within.that-demon-within_0newspro1.jpg',

u'http://blu-raydisc.tv/modules/mod_news_pro_gk4/cache/Film.2011.07.Ice-Age-3.Ice-Age-3_01newspro1.jpg',

u'http://blu-raydisc.tv/modules/mod_news_pro_gk4/cache/Film.2014.04.the-fatal-encounter.the-fatal-encounter_0newspro1.jpg',

u'http://blu-raydisc.tv/modules/mod_news_pro_gk4/cache/Film.1987.07.a-chinese-ghost-story.a-chinese-ghost-story_1newspro1.jpg',

u'http://blu-raydisc.tv/modules/mod_news_pro_gk4/cache/Film.1990.07.sinnui-yauman-2.sinnui-yauman-2_1newspro1.jpg',

u'http://blu-raydisc.tv/modules/mod_news_pro_gk4/cache/Film.1991.09.a-chinese-ghost-story-3.a-chinese-ghost-story-3_1newspro1.jpg',

u'http://blu-raydisc.tv/modules/mod_news_pro_gk4/cache/Film.2014.06.edge-of-tomorrow.edge-of-tomorrow_0newspro1.jpg']

In [12]: sel.xpath('//img/@data-src').extract()

Out[12]:

[u'http://i.blu-raydisc.tv/images/photos/A_4.jpg',

u'http://i.blu-raydisc.tv/images/photos/A_3.jpg',

u'http://i.blu-raydisc.tv/images/photos/A_2.jpg',

u'http://i.blu-raydisc.tv/images/photos/A_1.jpg',

u'http://i.blu-raydisc.tv/images/photos/16.jpg']

好这就是我们爬到图片地址

sel.xpath('//img/@src').extract() 这个我是屡试不爽,HTML代码的图片存放路径基本都可以用这个方式爬去到

sel.xpath('//a/@title').extract() 抓取电影title

我们这样分析好了,如何能得到需要的数据,下面我们就来讲如何利用scrapy来做一个简单的初级的爬虫。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值