Scrapy shell使用

最新推荐文章于 2023-05-16 16:39:33 发布

weixin_33894640

最新推荐文章于 2023-05-16 16:39:33 发布

阅读量98

点赞数

文章标签： shell python

注意：容易出现403错误，实际爬取时不会出现。

response - a Response object containing the last fetched page

>>> response . xpath ( '//title/text()' ) . extract ()

return a list of selectors

>>>for index, link in enumerate ( links ):

... args = ( index , link . xpath ( '@href' ) . extract (), link . xpath ( 'img/@src' ) . extract ()) ... print 'Link number %d points to url %s and image %s' % args

Link number 0 points to url [u'image1.html'] and image [u'image1_thumb.jpg'] Link number 1 points to url [u'image2.html'] and image [u'image2_thumb.jpg'] Link number 2 points to url [u'image3.html'] and image [u'image3_thumb.jpg'] Link number 3 points to url [u'image4.html'] and image [u'image4_thumb.jpg'] Link number 4 points to url [u'image5.html'] and image [u'image5_thumb.jpg']

enumerate() 函数一般用在 for 循环当中。

普通的 for 循环

>>> i = 0 >>> seq = [ ' one ' , ' two ' , ' three ' ] >>> for element in seq : ... print i , seq [ i ] ... i +=1 ... 0 one 1 two 2 three

for 循环使用 enumerate

>>> seq = [ ' one ' , ' two ' , ' three ' ] >>> for i, element in enumerate ( seq ) : ... print i , seq [ i ] ... 0 one 1 two 2 three

suppose you want to extract all <p> elements inside <div> elements. First, you would get all <div> elements:

>>> divs = response . xpath ( '//div' )

note the dot prefixing the .//p XPath):

>>> for p in divs . xpath ( './/p' ): # extracts all <p> inside ... print p . extract ()

Another common case would be to extract all direct <p> children:

>>> for p in divs . xpath ( 'p' ): ... print p . extract ()

在程序中使用shell

from scrapy.shell import inspect_response inspect_response ( response , self )

Ctrl-D (or Ctrl-Z in Windows) to exit the shell and resume the crawling:

xpath最外层最好用单引号！

shell 本地html，方便调试（但别取名为index.html）

scrapy shell ./ path / to / file . html ,即使在本目录，也必须要加./，不能直接 shell file.html scrapy shell ../ other / path / to / file . html scrapy shell / absolute / path / to / file . html

weixin_33894640

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Scrapy shell使用

注意：容易出现403错误，实际爬取时不会出现。response - a Response object containing the last fetched page&gt;&gt;&gt;response.xpath('//title/text()').extract() return a list of selectors&gt;&gt;&gt;for index, link in enum...
复制链接

扫一扫