Beautiful_Soup 自学笔记 001 -- 创建Beautiful Soup对象, Features Argument, TreeBuilder & Parsers

最新推荐文章于 2024-06-30 13:13:58 发布

KnightHacker2077

最新推荐文章于 2024-06-30 13:13:58 发布

阅读量338

点赞数

分类专栏： Artificial Intelligence 文章标签： python

本文链接：https://blog.csdn.net/DOITJT/article/details/111386619

版权

Artificial Intelligence 专栏收录该内容

7 篇文章 1 订阅

订阅专栏

Beautiful_Soup 自学笔记 001

1. 创建Beautiful Soup对象

方法一：通过string来创建

# Method 1 -- Create from String
hello = "<p>Hello</p>"
soup_str = BeautifulSoup(hello)

方法二：通过URL来创建

# Method 2 -- Create from URL
url = 'https://mcc.osu.edu/events.aspx'
page = requests.get(url) # Get the webpage with GET request
soup_url = BeautifulSoup(page.text, "html.parser")

注意: 此处的 html.parser 叫做 features argument，在后面的TreeBuilder Class部分有详细说明

方法三：通过file来创建

with open("foo.html","r") as foo_file:
    soup_file = BeautifulSoup(foo_file)

2. 有关TreeBuilder Class

The TreeBuilder class is used for creating the HTML/XML tree from the
input document

在创建object时注明 “features argument” (e.g. html, xml, etc.) (Default: HTML parser)
BeautifulSoup会根据提供的argument选择最合适的TreeBuilder (根据parser的优先级)
例如:
features argument: html
BeautifulSoup选择parser优先级为: lxml > html5lib > html.parser
于是根据parser优先级，BeautifulSoup选择TreeBuilder优先级为
- lXmlTreeBuilder > HTML5TreeBuilder > HTMLPraserTreeBuilder

# Example Code -- Features specified as xml
soup_xml = BeautifulSoup(hello,features= "xml")
soup_xml = BeautifulSoup(hello,"xml")

bs4

A Better Practice – specify parser

不同的parser parse的结果不同，所以注明parser结果会更准确

“It is good to specify the parser by giving the features argument because this helps to ensure that the input is processed in the same manner across different machines”