1、tag, Tag就是html文件中的标签以及标签之间的内容。
html = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'html.parser') print(soup.p.b.string) # The Dormouse's story,.sting是获得内容但是只能返回第一个,不能返回所有匹配的内容
soup = BeautifulSoup(html, 'html.parser') print(soup.title) # <title>The Dormouse's story</title> print(type(soup.title)) # <class 'bs4.element.Tag'> print(soup.head) # <head><title>The Dormouse's story</title></head> print(soup.p) # <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
获得属性值
soup = BeautifulSoup(html, 'html.parser') print(soup.a['href']) # http://example.com/elsie 或者 print(soup.a.attrs['href'])
2、find_all用法,标准选择器,可根据标签名,属性,内容查找文档
html=''' <div class="panel"> <div class="panel-heading"> <h4>Hello</h4> </div> <div class="panel-body"> <ul class="list" id="list-1"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> <ul class="list list-small" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> </ul> </div> </div> ''' from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'html.parser') print(soup.find_all('ul')) # [<ul class="list" id="list-1"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul>, <ul class="list list-small" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> </ul>]