Python网络爬虫(四)——Beautiful Soup库

1. 安装

在命令行窗口输入以下代码进行下载

pip install beautifulsoup4

2. 练习

>>> import requests

>>> r = requests.get("http://python123.io/ws/demo.html")

>>> r.text

'<html><head><title>This is a python demo page</title></head>\r\n<body>\r\n<p class="title"><b>The demo python introduces several python courses.</b></p>\r\n<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p>\r\n</body></html>'
>>> demo = r.text

>>> from bs4 import BeautifulSoup

>>> soup = BeautifulSoup(demo , "html.parser") #对HTML解析

>>> print(soup.prettify())

<html>

 <head>

  <title>

   This is a python demo page

  </title>

 </head>

 <body>

  <p class="title">

   <b>

    The demo python introduces several python courses.

   </b>

  </p>

  <p class="course">

   Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

   <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">

    Basic Python

   </a>

   and

   <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">

    Advanced Python

   </a>

  </p>

 </body>

</html>

 3. Beautiful Soup库是解析、遍历、维护“标签”的功能库

Beautiful Soup库,也叫beautifulsoup4或bs4

from bs4 import BeautifulSoup

打开文件的方式

>>> from bs4 import BeautifulSoup

>>> soup2 = BeautifulSoup(open(“D://demo.html”), "html.parser")

解析器使用方法条件
Bs4的HTML解析器BeautifulSoup(mk,’html.parser’)安装bs4库
Lxml的HTML解析器BeautifulSoup(mk,’lxml’)pip install lxml
Lxml的XML解析器

BeautifulSoup(mk,’xml’)

pip install lxml
Html5lib的解析器BeautifulSoup(mk,’html5lib’)pip install html5lib

Beautiful Soup类的基本元素

基本元素说明
Tag标签,最基本的信息组织单元,分别用<>和</>表明开头和结尾
Name标签的名字,<p>...</p>的名字是’p’,格式:<tag>.name
Attributes

标签的属性,字典形式组织,格式:<tag>.attrs

NavigableString标签内非属性字符串,<>...</>中字符串,格式:<tag>.string
Comment标签内字符串的注释部分,一种特殊的Comment类型
>>> import requests

>>> r = requests.get("http://python123.io/ws/demo.html")

>>> demo = r.text

>>> from bs4 import BeautifulSoup

>>> soup = BeautifulSoup(demo , "html.parser")

>>> soup.title

<title>This is a python demo page</title>

>>> tag = soup.a

>>> tag

<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>

>>> tag.name

'a'

>>> tag.parent.name

'p'

>>> tag.parent.parent.name

'body'

>>> tag.attrs

{'href': 'http://www.icourse163.org/course/BIT-268001', 'class': ['py1'], 'id': 'link1'}

>>> tag.attrs['class']

['py1']

>>> type(tag.attrs)

<class 'dict'>

>>> type(tag)

<class 'bs4.element.Tag'>

>>> tag.string

'Basic Python'

>>> soup.p

<p class="title"><b>The demo python introduces several python courses.</b></p>

>>> soup.p.string

'The demo python introduces several python courses.'

>>> type(soup.p.string)

<class 'bs4.element.NavigableString'>

 

  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值