Beautiful学习笔记

最新推荐文章于 2021-11-18 21:06:10 发布
独行特立喵
最新推荐文章于 2021-11-18 21:06:10 发布
阅读量285
点赞数
分类专栏：爬虫文章标签： python tag
本文链接：https://blog.csdn.net/u014197417/article/details/78175847
版权
爬虫专栏收录该内容
7 篇文章 0 订阅
订阅专栏
```python
from bs4 import BeautifulSoup
```

#    # 标签选择总结：获取tag时，总是获取第一个，若返回结果只有一个，则直接返回元素，若结果有多个，以迭代器返回，通过enumerate返回，两个标签之间若有换行，则有一个"\n    "标签

# 标签选择器

### 选择元素(只返回第一个匹配标签)


```python
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html,"lxml")
print(soup.title)
print(type(soup.title))
print(soup.p)
print(soup.a)
```

    <title>The Dormouse's story</title>
    <class 'bs4.element.Tag'>
    <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
    <a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>
    

## 获取名称


```python
print(soup.title.name)
```

    title
    

## 获取属性


```python
print(soup.p["name"])
print(soup.p.attrs["name"])
```

    dromouse
    dromouse
    

## 获取内容


```python
print(soup.p.string)
print(soup.p.get_text())
```

    The Dormouse's story
    The Dormouse's story
    

# 嵌套选择


```python
print(soup.head.title.string)
```

    The Dormouse's story
    

## 子节点(以list返回)和子孙节点


```python
html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>
"""
soup = BeautifulSoup(html,"lxml")
print(soup.p.contents)
print(len(soup.p.contents))
```

    ['\n            Once upon a time there were three little sisters; and their names were\n            ', <a class="sister" href="http://example.com/elsie" id="link1">
    <span>Elsie</span>
    </a>, '\n', <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, ' \n            and\n            ', <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>, '\n            and they lived at the bottom of a well.\n        ']
    7
    

## children返回一个由子节点组成的迭代器，由序号和内容构成,通过enumerate获取


```python
print(soup.p.children)
for i,child in enumerate(soup.p.children):
    print(i,child)
```

    <list_iterator object at 0x00000137FD009908>
    0 
                Once upon a time there were three little sisters; and their names were
                
    1 <a class="sister" href="http://example.com/elsie" id="link1">
    <span>Elsie</span>
    </a>
    2 
    
    3 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
    4  
                and
                
    5 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
    6 
                and they lived at the bottom of a well.
            
    

## descendants返回由子孙节点组成的迭代器，由序号和内容构成，通过enumerate获取，


```python
print(soup.p.descendants)
for i,child in enumerate(soup.p.descendants):
    print(i,child)

```

    <generator object descendants at 0x00000137FD0261A8>
    0 
                Once upon a time there were three little sisters; and their names were
                
    1 <a class="sister" href="http://example.com/elsie" id="link1">
    <span>Elsie</span>
    </a>
    2 
    
    3 <span>Elsie</span>
    4 Elsie
    5 
    
    6 
    
    7 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
    8 Lacie
    9  
                and
                
    10 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
    11 Tillie
    12 
                and they lived at the bottom of a well.
            
    

## 父节点和祖先节点


```python
print(soup.a.parent)
```

    <p class="story">
                Once upon a time there were three little sisters; and their names were
                <a class="sister" href="http://example.com/elsie" id="link1">
    <span>Elsie</span>
    </a>
    <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> 
                and
                <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
                and they lived at the bottom of a well.
            </p>
    


```python
print(soup.a.parents)
print(list(enumerate(soup.a.parents)))
```

    <generator object parents at 0x00000137FD026308>
    [(0, <p class="story">
                Once upon a time there were three little sisters; and their names were
                <a class="sister" href="http://example.com/elsie" id="link1">
    <span>Elsie</span>
    </a>
    <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> 
                and
                <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
                and they lived at the bottom of a well.
            </p>), (1, <body>
    <p class="story">
                Once upon a time there were three little sisters; and their names were
                <a class="sister" href="http://example.com/elsie" id="link1">
    <span>Elsie</span>
    </a>
    <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> 
                and
                <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
                and they lived at the bottom of a well.
            </p>
    <p class="story">...</p>
    </body>), (2, <html>
    <head>
    <title>The Dormouse's story</title>
    </head>
    <body>
    <p class="story">
                Once upon a time there were three little sisters; and their names were
                <a class="sister" href="http://example.com/elsie" id="link1">
    <span>Elsie</span>
    </a>
    <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> 
                and
                <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
                and they lived at the bottom of a well.
            </p>
    <p class="story">...</p>
    </body></html>), (3, <html>
    <head>
    <title>The Dormouse's story</title>
    </head>
    <body>
    <p class="story">
                Once upon a time there were three little sisters; and their names were
                <a class="sister" href="http://example.com/elsie" id="link1">
    <span>Elsie</span>
    </a>
    <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> 
                and
                <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
                and they lived at the bottom of a well.
            </p>
    <p class="story">...</p>
    </body></html>)]
    

## 兄弟节点


```python
print(list(enumerate(soup.a.previous_siblings)))
print(list(enumerate(soup.a.next_siblings)))
```

    [(0, '\n            Once upon a time there were three little sisters; and their names were\n            ')]
    [(0, '\n'), (1, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>), (2, ' \n            and\n            '), (3, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>), (4, '\n            and they lived at the bottom of a well.\n        ')]
    

# 标准选择器

# find_all(name,attrs,recursive,text,**kwargs)

### name(通过标签查找）


```python
html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
soup = BeautifulSoup(html,"lxml")
print(soup.find_all("ul"))
print(soup.find_all("ul")[0])
```

    [<ul class="list" id="list-1">
    <li class="element">Foo</li>
    <li class="element">Bar</li>
    <li class="element">Jay</li>
    </ul>, <ul class="list list-small" id="list-2">
    <li class="element">Foo</li>
    <li class="element">Bar</li>
    </ul>]
    <ul class="list" id="list-1">
    <li class="element">Foo</li>
    <li class="element">Bar</li>
    <li class="element">Jay</li>
    </ul>
    

### attrs(根据属性查找)


```python
print(soup.find_all(attrs = {"class":"element"}))
print(soup.find_all(attrs = {"class":"list"}))
```

    [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]
    [<ul class="list" id="list-1">
    <li class="element">Foo</li>
    <li class="element">Bar</li>
    <li class="element">Jay</li>
    </ul>, <ul class="list list-small" id="list-2">
    <li class="element">Foo</li>
    <li class="element">Bar</li>
    </ul>]
    

#### 针对class和id的快速查找


```python
print(soup.find_all(class_ = "list"))
print(soup.find_all(id = "list-2"))
```

    [<ul class="list" id="list-1">
    <li class="element">Foo</li>
    <li class="element">Bar</li>
    <li class="element">Jay</li>
    </ul>, <ul class="list list-small" id="list-2">
    <li class="element">Foo</li>
    <li class="element">Bar</li>
    </ul>]
    [<ul class="list list-small" id="list-2">
    <li class="element">Foo</li>
    <li class="element">Bar</li>
    </ul>]
    

### text(根据内容查找,只返回内容，不返回整个标签)


```python
print(soup.find_all(text = "Foo"))
```

    ['Foo', 'Foo']
    

# find（name,attrs,recursive,text,**kwargs),只返回第一个

## find_parents(),find_parent()
查找祖先节点和父节点

## find_next_siblings(),find_next_sibling(),find_previous_siblings(),find_previous_sibling()
返回所有后面的兄弟节点，后面第一个兄弟节点，前面所有兄弟节点，前面第一个兄弟节点
与直接选择标签中的.next_siblings()。。。用法完全不一样，详见下面代码


```python
html2='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element1">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup2 = BeautifulSoup(html2, 'lxml')
link = soup2.find(class_ = "element1")
print(link)
print(link.find_previous_siblings("li"))
print(link.find_next_siblings("li"))
```

    <li class="element1">Bar</li>
    [<li class="element">Foo</li>]
    [<li class="element">Jay</li>]
    


```python

```

## find_all_next(),find_next(),find_all_previous(),find_previous()
返回所有之前所有符合条件的节点，之后第一个符合条件的节点，之前所有符合条件的节点，之前第一个符合条件的节点

# CSS选择器,class用#，id用.开始，用空格隔开，返回所有得到的结果，以list返回


```python
print(soup.select(".panel .panel-heading"))
print(soup.select("#list-1 .element"))
```

    [<div class="panel-heading">
    <h4>Hello</h4>
    </div>]
    [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
    


```python
import requests
import re,json
from bs4 import BeautifulSoup
url = "https://www.toutiao.com/a6467787316680196622/"
html = requests.get("https://www.toutiao.com/a6467787316680196622/").text
# print(html)
def parse_page_detail(html, url):
    soup = BeautifulSoup(html, 'lxml')
    result = soup.select('title')
    title = result[0].get_text() if result else ''
    images_pattern = re.compile('var gallery = (.*?);', re.S)
    result = re.search(images_pattern, html)
    if result:
        data = json.loads(result.group(1))
        if data and 'sub_images' in data.keys():
            sub_images = data.get('sub_images')
            images = [item.get('url') for item in sub_images]
            #for image in images: download_image(image)
            return {
                'title': title,
                'url': url,
                'images': images
            }
print(parse_page_detail(html,url))
```

    None
独行特立喵
关注
0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Beautiful学习笔记

```pythonfrom bs4 import BeautifulSoup```# # 标签选择总结：获取tag时，总是获取第一个，若返回结果只有一个，则直接返回元素，若结果有多个，以迭代器返回，通过enumerate返回，两个标签之间若有换行，则有一个"\n "标签# 标签选择器### 选择元素(只返回第一个匹配标签)```pythonhtml = """
复制链接

扫一扫
专栏目录