Beautiful_Soup 自学笔记 001 -- 创建Beautiful Soup对象, Features Argument, TreeBuilder & Parsers

Beautiful_Soup 自学笔记 001

1. 创建Beautiful Soup对象

方法一:通过string来创建

# Method 1 -- Create from String
hello = "<p>Hello</p>"
soup_str = BeautifulSoup(hello)

方法二:通过URL来创建

# Method 2 -- Create from URL
url = 'https://mcc.osu.edu/events.aspx'
page = requests.get(url) # Get the webpage with GET request
soup_url = BeautifulSoup(page.text, "html.parser") 

注意: 此处的 html.parser 叫做 features argument,在后面的TreeBuilder Class部分有详细说明

方法三:通过file来创建

with open("foo.html","r") as foo_file:
    soup_file = BeautifulSoup(foo_file)

2. 有关TreeBuilder Class

The TreeBuilder class is used for creating the HTML/XML tree from the
input document

  • 在创建object时注明 “features argument” (e.g. html, xml, etc.) (Default: HTML parser)
  • BeautifulSoup会根据提供的argument选择最合适的TreeBuilder (根据parser的优先级)
  • 例如:
    features argument: html
    BeautifulSoup选择parser优先级为: lxml > html5lib > html.parser
    于是根据parser优先级,BeautifulSoup选择TreeBuilder优先级为
    • lXmlTreeBuilder > HTML5TreeBuilder > HTMLPraserTreeBuilder
# Example Code -- Features specified as xml
soup_xml = BeautifulSoup(hello,features= "xml")
soup_xml = BeautifulSoup(hello,"xml")

bs4

  • A Better Practice – specify parser

不同的parser parse的结果不同,所以注明parser结果会更准确

“It is good to specify the parser by giving the features argument because this helps to ensure that the input is processed in the same manner across different machines”

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值