```python from bs4 import BeautifulSoup ``` # # 标签选择总结:获取tag时,总是获取第一个,若返回结果只有一个,则直接返回元素,若结果有多个,以迭代器返回,通过enumerate返回,两个标签之间若有换行,则有一个"\n "标签 # 标签选择器 ### 选择元素(只返回第一个匹配标签) ```python html = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ soup = BeautifulSoup(html,"lxml") print(soup.title) print(type(soup.title)) print(soup.p) print(soup.a) ``` <title>The Dormouse's story</title> <class 'bs4.element.Tag'> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a> ## 获取名称 ```python print(soup.title.name) ``` title ## 获取属性 ```python print(soup.p["name"]) print(soup.p.attrs["name"]) ``` dromouse dromouse ## 获取内容 ```python print(soup.p.string) print(soup.p.get_text()) ``` The Dormouse's story The Dormouse's story # 嵌套选择 ```python print(soup.head.title.string) ``` The Dormouse's story ## 子节点(以list返回)和子孙节点 ```python html = """ <html> <head> <title>The Dormouse's story</title> </head> <body> <p class="story"> Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"> <span>Elsie</span> </a> <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a> and they lived at the bottom of a well. </p> <p class="story">...</p> """ soup = BeautifulSoup(html,"lxml") print(soup.p.contents) print(len(soup.p.contents)) ``` ['\n Once upon a time there were three little sisters; and their names were\n ', <a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie</span> </a>, '\n', <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, ' \n and\n ', <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>, '\n and they lived at the bottom of a well.\n '] 7 ## children返回一个由子节点组成的迭代器,由序号和内容构成,通过enumerate获取 ```python print(soup.p.children) for i,child in enumerate(soup.p.children): print(i,child) ``` <list_iterator object at 0x00000137FD009908> 0 Once upon a time there were three little sisters; and their names were 1 <a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie</span> </a> 2 3 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> 4 and 5 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> 6 and they lived at the bottom of a well. ## descendants返回由子孙节点组成的迭代器,由序号和内容构成,通过enumerate获取, ```python print(soup.p.descendants) for i,child in enumerate(soup.p.descendants): print(i,child) ``` <generator object descendants at 0x00000137FD0261A8> 0 Once upon a time there were three little sisters; and their names were 1 <a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie</span> </a> 2 3 <span>Elsie</span> 4 Elsie 5 6 7 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> 8 Lacie 9 and 10 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> 11 Tillie 12 and they lived at the bottom of a well. ## 父节点和祖先节点 ```python print(soup.a.parent) ``` <p class="story"> Once upon a time there were three little sisters; and their names were <a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie</span> </a> <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> and they lived at the bottom of a well. </p> ```python print(soup.a.parents) print(list(enumerate(soup.a.parents))) ``` <generator object parents at 0x00000137FD026308> [(0, <p class="story"> Once upon a time there were three little sisters; and their names were <a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie</span> </a> <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> and they lived at the bottom of a well. </p>), (1, <body> <p class="story"> Once upon a time there were three little sisters; and their names were <a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie</span> </a> <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> and they lived at the bottom of a well. </p> <p class="story">...</p> </body>), (2, <html> <head> <title>The Dormouse's story</title> </head> <body> <p class="story"> Once upon a time there were three little sisters; and their names were <a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie</span> </a> <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> and they lived at the bottom of a well. </p> <p class="story">...</p> </body></html>), (3, <html> <head> <title>The Dormouse's story</title> </head> <body> <p class="story"> Once upon a time there were three little sisters; and their names were <a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie</span> </a> <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> and they lived at the bottom of a well. </p> <p class="story">...</p> </body></html>)] ## 兄弟节点 ```python print(list(enumerate(soup.a.previous_siblings))) print(list(enumerate(soup.a.next_siblings))) ``` [(0, '\n Once upon a time there were three little sisters; and their names were\n ')] [(0, '\n'), (1, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>), (2, ' \n and\n '), (3, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>), (4, '\n and they lived at the bottom of a well.\n ')] # 标准选择器 # find_all(name,attrs,recursive,text,**kwargs) ### name(通过标签查找) ```python html=''' <div class="panel"> <div class="panel-heading"> <h4>Hello</h4> </div> <div class="panel-body"> <ul class="list" id="list-1"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> <ul class="list list-small" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> </ul> </div> </div> ''' soup = BeautifulSoup(html,"lxml") print(soup.find_all("ul")) print(soup.find_all("ul")[0]) ``` [<ul class="list" id="list-1"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul>, <ul class="list list-small" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> </ul>] <ul class="list" id="list-1"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> ### attrs(根据属性查找) ```python print(soup.find_all(attrs = {"class":"element"})) print(soup.find_all(attrs = {"class":"list"})) ``` [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>] [<ul class="list" id="list-1"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul>, <ul class="list list-small" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> </ul>] #### 针对class和id的快速查找 ```python print(soup.find_all(class_ = "list")) print(soup.find_all(id = "list-2")) ``` [<ul class="list" id="list-1"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul>, <ul class="list list-small" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> </ul>] [<ul class="list list-small" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> </ul>] ### text(根据内容查找,只返回内容,不返回整个标签) ```python print(soup.find_all(text = "Foo")) ``` ['Foo', 'Foo'] # find(name,attrs,recursive,text,**kwargs),只返回第一个 ## find_parents(),find_parent() 查找祖先节点和父节点 ## find_next_siblings(),find_next_sibling(),find_previous_siblings(),find_previous_sibling() 返回所有后面的兄弟节点,后面第一个兄弟节点,前面所有兄弟节点,前面第一个兄弟节点 与直接选择标签中的.next_siblings()。。。用法完全不一样,详见下面代码 ```python html2=''' <div class="panel"> <div class="panel-heading"> <h4>Hello</h4> </div> <div class="panel-body"> <ul class="list" id="list-1"> <li class="element">Foo</li> <li class="element1">Bar</li> <li class="element">Jay</li> </ul> <ul class="list list-small" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> </ul> </div> </div> ''' from bs4 import BeautifulSoup soup2 = BeautifulSoup(html2, 'lxml') link = soup2.find(class_ = "element1") print(link) print(link.find_previous_siblings("li")) print(link.find_next_siblings("li")) ``` <li class="element1">Bar</li> [<li class="element">Foo</li>] [<li class="element">Jay</li>] ```python ``` ## find_all_next(),find_next(),find_all_previous(),find_previous() 返回所有之前所有符合条件的节点,之后第一个符合条件的节点,之前所有符合条件的节点,之前第一个符合条件的节点 # CSS选择器,class用#,id用.开始,用空格隔开,返回所有得到的结果,以list返回 ```python print(soup.select(".panel .panel-heading")) print(soup.select("#list-1 .element")) ``` [<div class="panel-heading"> <h4>Hello</h4> </div>] [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>] ```python import requests import re,json from bs4 import BeautifulSoup url = "https://www.toutiao.com/a6467787316680196622/" html = requests.get("https://www.toutiao.com/a6467787316680196622/").text # print(html) def parse_page_detail(html, url): soup = BeautifulSoup(html, 'lxml') result = soup.select('title') title = result[0].get_text() if result else '' images_pattern = re.compile('var gallery = (.*?);', re.S) result = re.search(images_pattern, html) if result: data = json.loads(result.group(1)) if data and 'sub_images' in data.keys(): sub_images = data.get('sub_images') images = [item.get('url') for item in sub_images] #for image in images: download_image(image) return { 'title': title, 'url': url, 'images': images } print(parse_page_detail(html,url)) ``` None
Beautiful学习笔记
最新推荐文章于 2021-11-18 21:06:10 发布