Python 爬虫学习笔记(二)

python 爬虫学习笔记(二)

学习视频:
【Python网络爬虫与信息提取】.MOOC. 北京理工大学

  • BeautifulSoup库

    HTML文档 <=> 标签树 <=> BeautifulSoup类

    例:

from bs4 import BeautifulSoup //引入BeautifulSoup库
soup = BeautifulSoup(‘

data

’,‘html.parser’)
  • Beautiful Soup库解析器
解析器使用方法条件
bs4的HTML解析器BeautifulSoup(mk,‘html.parser’)安装bs4库
lxml的HTML解析器BeautifulSoup(mk,‘lxml’)pip install lxml
lxml的XML解析器BeautifulSoup(mk,‘xml’)pip install lxml
html5lib的解析器BeautifulSoup(mk,‘html5lib’)pip install html5lib
  • BeautifulSoup类的基本元素

    基本元素说明
    Tag标签
    Name标签的名字,格式:.name
    Attributes标签的属性,格式:.attrs
    NavigableString标签内非属性的字符串,格式:.string
    Comment标签内字符串的注释部分
  • 标签树的平行遍历:发生在同一个父节点的孩子节点之间

    .next_sibling .previous_sibling .next_siblings .previous_siblings

 >>> import requests
  >>> r = requests.get("http://python123.io/ws/demo.html")
  >>> r.text
  >>> demo = r.text
  >>> from bs4 import BeautifulSoup
  >>> soup = BeautifulSoup(demo,"html.parser")
  >>> print(soup.prettify())
  >>> tag = soup.a
>>> tag.attrs //获取标签的属性
>>> tag.attrs['class'] //获取class属性的值
>>> tag.attrs['href'] //获取标签的链接属性
>>> type(tag.attrs) //标签属性的类型
>>> type(tag) //标签类型
>>> soup.a
>>> soup.a.string
>>> type(soup.a.string)
>>> newsoup = BeautifulSoup("<b><!--This is a comment--></b><p>This is not a comment</p>","html.parser") //新做一锅“汤”
>>> newsoup.b.string
>>> type(newsoup.b.string)
>>> soup = BeautifulSoup(demo,"html.parser")
>>> soup.head //查看head标签
>>> soup.head.contents //head标签的儿子节点,“.contents”返回的是列表
>>> len(soup.body.contents) //获取标签的儿子节点数
>>> soup.body.contents[1]
  • 标签树下行遍历.contents.children.descendants

    • 遍历儿子节点

      >>> for child in soup.body.children:
      	print(child)
      
    • 遍历子孙节点

      >>> for child in soup.body.children:
      	print(child)
      
  • 标签树上行遍历.parent节点的父亲标签, .parents节点的所有先辈标签

    • 标签树上行遍历代码:
    >>> soup = BeautifulSoup(demo,"html.parser")
    >>> for parent in soup.a.parents:
        	if parent is None:
                print(parent)
            else:
                print(parent.name)
    
  • 标签树平行遍历(发生在同一父节点的子节点间):.next_sibling.previous_sibling .next_siblings.previous_siblings

     >>> soup = BeautifulSoup(demo,"html.parser")
     >>> soup.a.next_sibling
     >>> soup.a.next_sibling.next_sibling
     >>> soup.a.previous_sibling
     >>> soup.a.previous_sibling.previous_sibling
    
    • 遍历后续节点

      >>> for sibling in soup.a.next_siblings:
      	print(sibling)
      
    • 遍历前续节点

      >>> for sibling in soup.a.next_siblings:
      	print(sibling)
      
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值