注意:容易出现403错误,实际爬取时不会出现。
>>>
response
.
xpath
(
'//title/text()'
)
.
extract
()
return a list of selectors
>>>for
index, link
in
enumerate
(
links
):
...
args
=
(
index
,
link
.
xpath
(
'@href'
)
.
extract
(),
link
.
xpath
(
'img/@src'
)
.
extract
())
...
print
'Link number
%d
points to url %s and image %s'
%
args
Link number
0
points to url [u'image1.html'] and image [u'image1_thumb.jpg']
Link number
1
points to url [u'image2.html'] and image [u'image2_thumb.jpg']
Link number
2
points to url [u'image3.html'] and image [u'image3_thumb.jpg']
Link number
3
points to url [u'image4.html'] and image [u'image4_thumb.jpg']
Link number
4
points to url [u'image5.html'] and image [u'image5_thumb.jpg']
enumerate() 函数一般用在 for 循环当中。
普通的 for 循环
>>>
i
=
0
>>>
seq
=
[
'
one
'
,
'
two
'
,
'
three
'
]
>>>
for
element
in
seq
: ...
print
i
,
seq
[
i
]
...
i +=1
...
0
one
1
two
2
three
for 循环使用 enumerate
>>>
seq
=
[
'
one
'
,
'
two
'
,
'
three
'
]
>>>
for
i, element
in
enumerate
(
seq
)
: ...
print
i
,
seq
[
i
]
...
0
one
1
two
2
three
suppose you want to extract all
<p>
elements inside
<div>
elements. First, you would get all
<div>
elements:
>>>
divs
=
response
.
xpath
(
'//div'
)
note the dot prefixing the
.//p
XPath):
>>>
for
p
in
divs
.
xpath
(
'.//p'
):
# extracts all <p>
inside
...
print
p
.
extract
()
Another common case would be to extract all
direct
<p>
children:
>>>
for
p
in
divs
.
xpath
(
'p'
):
...
print
p
.
extract
()
在程序中使用shell
from
scrapy.shell
import
inspect_response
inspect_response
(
response
,
self
)
Ctrl-D (or Ctrl-Z in Windows) to exit the shell and resume the crawling:
xpath最外层最好用单引号!
shell 本地html,方便 调试(但别取名为index.html)
scrapy
shell
./
path
/
to
/
file
.
html ,即使在本目录,也必须要加./,不能直接 shell file.html
scrapy
shell
../
other
/
path
/
to
/
file
.
html
scrapy
shell
/
absolute
/
path
/
to
/
file
.
html