Beautiful_Soup 自学笔记 001
1. 创建Beautiful Soup对象
方法一:通过string来创建
# Method 1 -- Create from String
hello = "<p>Hello</p>"
soup_str = BeautifulSoup(hello)
方法二:通过URL来创建
# Method 2 -- Create from URL
url = 'https://mcc.osu.edu/events.aspx'
page = requests.get(url) # Get the webpage with GET request
soup_url = BeautifulSoup(page.text, "html.parser")
注意: 此处的 html.parser 叫做 features argument,在后面的TreeBuilder Class部分有详细说明
方法三:通过file来创建
with open("foo.html","r") as foo_file:
soup_file = BeautifulSoup(foo_file)
2. 有关TreeBuilder Class
The TreeBuilder class is used for creating the HTML/XML tree from the
input document
- 在创建object时注明 “features argument” (e.g. html, xml, etc.) (Default: HTML parser)
- BeautifulSoup会根据提供的argument选择最合适的TreeBuilder (根据parser的优先级)
- 例如:
features argument: html
BeautifulSoup选择parser优先级为: lxml > html5lib > html.parser
于是根据parser优先级,BeautifulSoup选择TreeBuilder优先级为- lXmlTreeBuilder > HTML5TreeBuilder > HTMLPraserTreeBuilder
# Example Code -- Features specified as xml
soup_xml = BeautifulSoup(hello,features= "xml")
soup_xml = BeautifulSoup(hello,"xml")
- A Better Practice – specify parser
不同的parser parse的结果不同,所以注明parser结果会更准确
“It is good to specify the parser by giving the features argument because this helps to ensure that the input is processed in the same manner across different machines”