Python网络爬虫（四）——Beautiful Soup库

最新推荐文章于 2019-05-01 10:56:03 发布

AI阿聪

最新推荐文章于 2019-05-01 10:56:03 发布

阅读量316

点赞数 1

分类专栏： Python爬虫文章标签： Python 网络爬虫 Beautiful Soup

本文链接：https://blog.csdn.net/weixin_40431584/article/details/89066394

版权

Python 同时被 3 个专栏收录

13 篇文章 2 订阅

订阅专栏

Python网络爬虫

5 篇文章 1 订阅

订阅专栏

Python爬虫

5 篇文章 0 订阅

订阅专栏

1. 安装

在命令行窗口输入以下代码进行下载

pip install beautifulsoup4

2. 练习

>>> import requests

>>> r = requests.get("http://python123.io/ws/demo.html")

>>> r.text

'<html><head><title>This is a python demo page</title></head>\r\n<body>\r\n<p class="title"><b>The demo python introduces several python courses.</b></p>\r\n<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p>\r\n</body></html>'

>>> demo = r.text

>>> from bs4 import BeautifulSoup

>>> soup = BeautifulSoup(demo , "html.parser") #对HTML解析

>>> print(soup.prettify())

<html>

 <head>

  <title>

   This is a python demo page

  </title>

 </head>

 <body>

  <p class="title">

   <b>

    The demo python introduces several python courses.

   </b>

  </p>

  <p class="course">

   Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

   <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">

    Basic Python

   </a>

   and

   <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">

    Advanced Python

   </a>

  </p>

 </body>

</html>

3. Beautiful Soup库是解析、遍历、维护“标签”的功能库

Beautiful Soup库，也叫beautifulsoup4或bs4

from bs4 import BeautifulSoup

打开文件的方式

>>> from bs4 import BeautifulSoup

>>> soup2 = BeautifulSoup(open(“D://demo.html”), "html.parser")

解析器	使用方法	条件
Bs4的HTML解析器	BeautifulSoup(mk,’html.parser’)	安装bs4库
Lxml的HTML解析器	BeautifulSoup(mk,’lxml’)	pip install lxml
Lxml的XML解析器	BeautifulSoup(mk,’xml’)	pip install lxml
Html5lib的解析器	BeautifulSoup(mk,’html5lib’)	pip install html5lib

Beautiful Soup类的基本元素

基本元素	说明
Tag	标签，最基本的信息组织单元，分别用<>和</>表明开头和结尾
Name	标签的名字，<p>...</p>的名字是’p’，格式：<tag>.name
Attributes	标签的属性，字典形式组织，格式：<tag>.attrs
NavigableString	标签内非属性字符串，<>...</>中字符串，格式：<tag>.string
Comment	标签内字符串的注释部分，一种特殊的Comment类型

>>> import requests

>>> r = requests.get("http://python123.io/ws/demo.html")

>>> demo = r.text

>>> from bs4 import BeautifulSoup

>>> soup = BeautifulSoup(demo , "html.parser")

>>> soup.title

<title>This is a python demo page</title>

>>> tag = soup.a

>>> tag

<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>

>>> tag.name

'a'

>>> tag.parent.name

'p'

>>> tag.parent.parent.name

'body'

>>> tag.attrs

{'href': 'http://www.icourse163.org/course/BIT-268001', 'class': ['py1'], 'id': 'link1'}

>>> tag.attrs['class']

['py1']

>>> type(tag.attrs)

<class 'dict'>

>>> type(tag)

<class 'bs4.element.Tag'>

>>> tag.string

'Basic Python'

>>> soup.p

<p class="title"><b>The demo python introduces several python courses.</b></p>

>>> soup.p.string

'The demo python introduces several python courses.'

>>> type(soup.p.string)

<class 'bs4.element.NavigableString'>

AI阿聪

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Python网络爬虫（四）——Beautiful Soup库

1. 安装在命令行窗口输入以下代码进行下载pip install beautifulsoup42. 练习>>> import requests>>> r = requests.get("http://python123.io/ws/demo.html")>>> r.text'<html><...
复制链接

扫一扫

专栏目录