python中spider的用法_Spider-PyQuery基本用法(示例代码)

pip install pyquery

2.引用方法

from pyquery import PyQuery as pq

3.简介

pyquery 是类型jquery 的一个专供python使用的html解析的库,使用方法类似bs4。

4.使用方法

4.1 初始化方法:

from pyquery import PyQuery as pq

doc =pq(html) #解析html字符串

doc =pq("http://news.baidu.com/") #解析网页

doc =pq("./a.html") #解析html 文本

4.2 基本CSS选择器

from pyquery import PyQuery as pq

html = ‘‘‘

asdasd

asdadasdad12312

asdadasdad12312

asdadasdad12312

‘‘‘

doc = pq(html)

print doc("#wrap .s_from link")

运行结果:

asdadasdad12312

asdadasdad12312

asdadasdad12312

#是查找id的标签  .是查找class 的标签  link 是查找link 标签 中间的空格表示里层

4.3 查找子元素

from pyquery import PyQuery as pq

html = ‘‘‘

asdasd

asdadasdad12312

asdadasdad12312

asdadasdad12312

‘‘‘

#查找子元素

doc = pq(html)

items=doc("#wrap")

print(items)

print("类型为:%s"%type(items))

link = items.find(‘.s_from‘)

print(link)

link = items.children()

print(link)

运行结果:

asdasd

asdadasdad12312

asdadasdad12312

asdadasdad12312

类型为:

asdasd

asdadasdad12312

asdadasdad12312

asdadasdad12312

asdasd

asdadasdad12312

asdadasdad12312

asdadasdad12312

根据运行结果可以发现返回结果类型为pyquery,并且find方法和children 方法都可以获取里层标签

4.4查找父元素

from pyquery import PyQuery as pq

html = ‘‘‘

hello nihao

asdasd

asdadasdad12312

asdadasdad12312

asdadasdad12312

‘‘‘

doc = pq(html)

items=doc(".s_from")

print(items)

#查找父元素

parent_href=items.parent()

print(parent_href)

运行结果:

asdasd

asdadasdad12312

asdadasdad12312

asdadasdad12312

hello nihao

asdasd

asdadasdad12312

asdadasdad12312

asdadasdad12312

parent可以查找出外层标签包括的内容,与之类似的还有parents,可以获取所有外层节点

4.5 查找兄弟元素

from pyquery import PyQuery as pq

html = ‘‘‘

hello nihao

asdasd

asdadasdad12312

asdadasdad12312

asdadasdad12312

‘‘‘

doc = pq(html)

items=doc("link.active1.a123")

print(items)

#查找兄弟元素

siblings_href=items.siblings()

print(siblings_href)

运行结果:

asdadasdad12312

asdadasdad12312

asdadasdad12312

根据运行结果可以看出,siblings 返回了同级的其他标签

结论:子元素查找,父元素查找,兄弟元素查找,这些方法返回的结果类型都是pyquery类型,可以针对结果再次进行选择

4.6 遍历查找结果

from pyquery import PyQuery as pq

html = ‘‘‘

hello nihao

asdasd

asdadasdad12312

asdadasdad12312

asdadasdad12312

‘‘‘

doc = pq(html)

its=doc("link").items()

for it in its:

print(it)

运行结果:

asdadasdad12312

asdadasdad12312

asdadasdad12312

4.7获取属性信息

from pyquery import PyQuery as pq

html = ‘‘‘

hello nihao

asdasd

asdadasdad12312

asdadasdad12312

asdadasdad12312

‘‘‘

doc = pq(html)

its=doc("link").items()

for it in its:

print(it.attr(‘href‘))

print(it.attr.href)

运行结果:

http://asda.com

http://asda.com

http://asda1.com

http://asda1.com

http://asda2.com

http://asda2.com

4.8 获取文本

from pyquery import PyQuery as pq

html = ‘‘‘

hello nihao

asdasd

asdadasdad12312

asdadasdad12312

asdadasdad12312

‘‘‘

doc = pq(html)

its=doc("link").items()

for it in its:

print(it.text())

运行结果

asdadasdad12312

asdadasdad12312

asdadasdad12312

4.9 获取 HTML信息

from pyquery import PyQuery as pq

html = ‘‘‘

hello nihao

asdasd

asdadasdad12312

asdadasdad12312

asdadasdad12312

‘‘‘

doc = pq(html)

its=doc("link").items()

for it in its:

print(it.html())

运行结果:

asdadasdad12312

asdadasdad12312

asdadasdad12312

5.常用DOM操作

5.1 addClass removeClass

添加,移除class标签

from pyquery import PyQuery as pq

html = ‘‘‘

hello nihao

asdasd

asdadasdad12312

asdadasdad12312

asdadasdad12312

‘‘‘

doc = pq(html)

its=doc("link").items()

for it in its:

print("添加:%s"%it.addClass(‘active1‘))

print("移除:%s"%it.removeClass(‘active1‘))

运行结果

添加:asdadasdad12312

移除:asdadasdad12312

添加:asdadasdad12312

移除:asdadasdad12312

添加:asdadasdad12312

移除:asdadasdad12312

需要注意的是已经存在的class标签不会继续添加

5.2 attr css

attr 为获取/修改属性 css 添加style属性

from pyquery import PyQuery as pq

html = ‘‘‘

hello nihao

asdasd

asdadasdad12312

asdadasdad12312

asdadasdad12312

‘‘‘

doc = pq(html)

its=doc("link").items()

for it in its:

print("修改:%s"%it.attr(‘class‘,‘active‘))

print("添加:%s"%it.css(‘font-size‘,‘14px‘))

运行结果

C:Python27python.exe D:/test_his/test_re_1.py

修改:asdadasdad12312

添加:asdadasdad12312

修改:asdadasdad12312

添加:asdadasdad12312

修改:asdadasdad12312

添加:asdadasdad12312

attr css操作直接修改对象的

5.3 remove

remove 移除标签

from pyquery import PyQuery as pq

html = ‘‘‘

hello nihao

asdasd

asdadasdad12312

asdadasdad12312

asdadasdad12312

‘‘‘

doc = pq(html)

its=doc("div")

print(‘移除前获取文本结果:

%s‘%its.text())

it=its.remove(‘ul‘)

print(‘移除后获取文本结果:

%s‘%it.text())

运行结果

移除前获取文本结果:

hello nihao

asdasd

asdadasdad12312

asdadasdad12312

asdadasdad12312

移除后获取文本结果:

hello nihao

其他DOM方法参考: 请点击

6.伪类选择器

from pyquery import PyQuery as pq

html = ‘‘‘

hello nihao

asdasd

helloasdadasdad12312

asdadasdad12312

asdadasdad12312

‘‘‘

doc = pq(html)

its=doc("link:first-child")

print(‘第一个标签:%s‘%its)

its=doc("link:last-child")

print(‘最后一个标签:%s‘%its)

its=doc("link:nth-child(2)")

print(‘第二个标签:%s‘%its)

its=doc("link:gt(0)") #从零开始

print("获取0以后的标签:%s"%its)

its=doc("link:nth-child(2n-1)")

print("获取奇数标签:%s"%its)

its=doc("link:contains(‘hello‘)")

print("获取文本包含hello的标签:%s"%its)

运行结果

第一个标签:helloasdadasdad12312

最后一个标签:asdadasdad12312

第二个标签:asdadasdad12312

获取0以后的标签:asdadasdad12312

asdadasdad12312

获取奇数标签:helloasdadasdad12312

asdadasdad12312

获取文本包含hello的标签:helloasdadasdad12312

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值