Scrapy框架学习--Selector提取数据

最新推荐文章于 2023-06-11 19:02:49 发布

ChenTsingZheng

最新推荐文章于 2023-06-11 19:02:49 发布

阅读量382

点赞数

分类专栏： Scrapy爬虫

本文链接：https://blog.csdn.net/a7806006/article/details/90514104

版权

Scrapy爬虫专栏收录该内容

1 篇文章 0 订阅

订阅专栏

Scrapy框架学习--Selector提取数据

1.Selector对象
2.创建对象
3.Xpath
4. CSS选择器

1.Selector对象

从网页提取数据的核心技术是HTTP文本解析，在python中常用一下模块处理此类问题：

BeautifulSoup
lxml
Scrapy综合上述两者优点实现了Selector类，基于lxml，反正我用的比之前的上述两种的提取舒服多了，也比正则表达式中提取简单。

2.创建对象

Selector类实现位于scrapy.selector模块，创建Selector对象时，可将页面的HTML文档字符串传递与Selector构造器中的text参数：

from scrapy.selector import Selector
from scrapy.http import HtmlResponse

body = """<html lang="en">
<body>
<h1>hello,world</h1>
<h1>Hello,Scrapy</h1>
<b>Hello Python</b>
<ul>
    <li>C++</li>
    <li>Java</li>
    <li>python</li>
</ul>
</body>
</html>"""
response = HtmlResponse(url='www.example.com', body=body, encoding='utf8')
selector = Selector(response=response)
print(selector)
selector_list = selector.xpath('//h1')
print(selector_list)
for i in selector_list:
    print(i.xpath("./text()"))
print(selector.xpath('//ul').css('li').xpath('./text()'))
sl=selector.xpath('.//li/text()').re("")
print(sl)

当然，与此同时，Selector也有re的正则表达式可以用，但是我写的时候很少用，也就不误人子弟了。

3.Xpath

Xpath 即XML路径语言（XML Path Language），它是一种用于确定xml文档某部分位置的语言。
xml文档的节点有多种类型，其中最常用的有以下几种：

根节点整个文档树的根
元素节点 html、body、div、p、a
属性节点 href src
文本节点 Hello，world之类的。

Xpath的基本语法就去找以下w3school吧。
下面展示几种

from scrapy.selector import Selector
from scrapy.http import HtmlResponse

body = """
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Example weisite</title>
</head>
<body>
<div id="images">
    <a href="image1.html">Name:Image 1<br/><img src="image1.jpg"></a>
    <a href="image2.html">Name:Image 2<br/><img src="image2.jpg"></a>
    <a href="image3.html">Name:Image 3<br/><img src="image3.jpg"></a>
    <a href="image4.html">Name:Image 4<br/><img src="image4.jpg"></a>
    <a href="image5.html">Name:Image 5<br/><img src="image5.jpg"></a>
</div>
</body>
</html>
"""
response = HtmlResponse(url="http://example.com", body=body, encoding="utf8")
print(response)
response.xpath("/html")

/：描述一个从根开始的绝对路径

In [52]: response.xpath("/html")
Out[52]: [<Selector xpath='/html' data='<html lang="en">\n<head>\n    <meta charse'>]

E1/E2:选中E1子节点中的所有E2。

In [56]: response.xpath("/html/body/div/a")
Out[56]:
[<Selector xpath='/html/body/div/a' data='<a href="image1.html">Name:Image 1<br><i'>,
 <Selector xpath='/html/body/div/a' data='<a href="image2.html">Name:Image 2<br><i'>,
 <Selector xpath='/html/body/div/a' data='<a href="image3.html">Name:Image 3<br><i'>,
 <Selector xpath='/html/body/div/a' data='<a href="image4.html">Name:Image 4<br><i'>,
 <Selector xpath='/html/body/div/a' data='<a href="image5.html">Name:Image 5<br><i'>]

//E：选中文档中所有的E，无论在那个位置

In [57]: response.xpath("//a")
Out[57]:
[<Selector xpath='//a' data='<a href="image1.html">Name:Image 1<br><i'>,
 <Selector xpath='//a' data='<a href="image2.html">Name:Image 2<br><i'>,
 <Selector xpath='//a' data='<a href="image3.html">Name:Image 3<br><i'>,
 <Selector xpath='//a' data='<a href="image4.html">Name:Image 4<br><i'>,
 <Selector xpath='//a' data='<a href="image5.html">Name:Image 5<br><i'>]

E1//E2：选中E1后代中所有E2。
E/text():选中E的文本子节点
E/*选中E的所有元素的子节点，不在乎E的属性ID。
*/E：选中孙节点中的所有E。
E/@ATTR:选中E的attr属性。这个用于提取网页链接会方便。
//@ATTR：选中文档中所有Attr属性
E/@*：选中E的所有属性

4. CSS选择器

css相对于Xpath会稍微简单不少，但是功能不如xpath强大。事实上，调用Selecotr的CSS的方法时，在内部会用python库的cssselect将CSS选择器翻译成XPath表达式，然后调用Xpath方法。
CSS的语法可以在W3school中学习。
先创建一个Html文档如下：、

from scrapy.selector import Selector
from scrapy.http import HtmlResponse
body="""
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <base href='http://example.com/'/>
    <title>Example website</title>
</head>
<body>
<div id="image-1" style="width: 1230px;">
    <a href="image1.html">Name: Image1 <br/><img src="iamge1.jpg"/> </a>
    <a href="image2.html">Name: Image2 <br/><img src="iamge2.jpg"/> </a>
    <a href="image3.html">Name: Image3 <br/><img src="iamge3.jpg"/> </a>

</div>
<div id="image-2" class="small">
    <a href="image4.html">Name: Image4 <br/><img src="iamge4.jpg"/> </a>
    <a href="image5.html">Name: Image5 <br/><img src="iamge5.jpg"/> </a>
</div>
</body>
</html>
"""
text=HtmlResponse(url="http://www.example.com",body=body,encoding="utf8")

E:选中E元素

In [60]: text.css("img")
Out[60]:
[<Selector xpath='descendant-or-self::img' data='<img src="iamge1.jpg">'>,
 <Selector xpath='descendant-or-self::img' data='<img src="iamge2.jpg">'>,
 <Selector xpath='descendant-or-self::img' data='<img src="iamge3.jpg">'>,
 <Selector xpath='descendant-or-self::img' data='<img src="iamge4.jpg">'>,
 <Selector xpath='descendant-or-self::img' data='<img src="iamge5.jpg">'>]

E1,E2：选中E1和E2元素

In [66]: text.css("base,title")
Out[66]:
[<Selector xpath='descendant-or-self::base | descendant-or-self::title' data='<base href="http://example.com/">'>,
 <Selector xpath='descendant-or-self::base | descendant-or-self::title' data='<title>Example website</title>'>]

E1 E2:选中E1后代元素中的E2元素

In [67]: text.css("div img")
Out[67]:
[<Selector xpath='descendant-or-self::div/descendant-or-self::*/img' data='<img src="iamge1.jpg">'>,
 <Selector xpath='descendant-or-self::div/descendant-or-self::*/img' data='<img src="iamge2.jpg">'>,
 <Selector xpath='descendant-or-self::div/descendant-or-self::*/img' data='<img src="iamge3.jpg">'>,
 <Selector xpath='descendant-or-self::div/descendant-or-self::*/img' data='<img src="iamge4.jpg">'>,
 <Selector xpath='descendant-or-self::div/descendant-or-self::*/img' data='<img src="iamge5.jpg">'>]

E1>E2：选中E1子元素中的E2元素

In [68]: text.css("body>div")
Out[68]:
[<Selector xpath='descendant-or-self::body/div' data='<div id="image-1" style="width: 1230px;"'>,
 <Selector xpath='descendant-or-self::body/div' data='<div id="image-2" class="small">\n    <a '>]

(这里是空) [ATTR]:选中包含ATTR属性的元素。

In [71]: text.css('[style]')
Out[71]: [<Selector xpath='descendant-or-self::*[@style]' data='<div id="image-1" style="width: 1230px;"'>]

(这里是空) [ATTR=VALUE]:选中包含ATTR属性的元素且值为VALUE的元素。

In [72]: text.css('[id="image-1"]')
Out[72]: [<Selector xpath="descendant-or-self::*[@id = 'image-1']" data='<div id="image-1" style="width: 1230px;"'>]

E:nth-child(n):选中E元素，且该元素必须是父元素的第N个子元素。

In [75]: text.css('div:nth-child(2)>a:first-child')
Out[75]: [<Selector xpath='descendant-or-self::div[count(preceding-sibling::*) = 1]/a[count(preceding-sibling::*) = 0]' data='<a href="image4.html">Name: Image4 <br><'>]