爬虫第五课(beautifulsoup)

最新推荐文章于 2024-08-18 10:00:00 发布

akon_wang_hkbu

最新推荐文章于 2024-08-18 10:00:00 发布

阅读量204

点赞数

分类专栏：爬虫学习

爬虫学习专栏收录该内容

4 篇文章 0 订阅

订阅专栏

beautifulsoup

以下内容为转载

beautiful soup选择器之CSS选择器

BeautifulSoup支持大部分的CSS选择器，其语法为：向tag或soup对象的.select()方法中传入字符串参数，选择的结果以列表形式返回。

　　tag.select("string")

　　BeautifulSoup.select("string")

源代码示例：

html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="title" name="dromouse">
            <b>The Dormouse's story</b>
        </p>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a class="mysis" href="http://example.com/elsie" id="link1">
                <b>the first b tag<b>
                Elsie
            </a>,
            <a class="mysis" href="http://example.com/lacie" id="link2" myname="kong">
                Lacie
            </a>and
            <a class="mysis" href="http://example.com/tillie" id="link3">
                Tillie
            </a>;and they lived at the bottom of a well.
        </p>
        <p class="story">
            myStory
            <a>the end a tag</a>
        </p>
        <a>the p tag sibling</a>
    </body>
</html>
"""

soup = BeautifulSoup(html,'lxml')

　　1、通过标签选择

 
          # 选择所有title标签 
         
          soup. 
          select 
          ( 
          "title" 
          ) 
         
          # 选择所有p标签中的第三个标签 
         
          soup. 
          select 
          ( 
          "p:nth-of-type(3)" 
          ) 
         
          # 选择body标签下的所有a标签 
         
          soup. 
          select 
          ( 
          "body a" 
          ) 
         
          # 选择body标签下的直接a子标签 
         
          soup. 
          select 
          ( 
          "body > a" 
          ) 
         
          # 选择id=link1后的所有兄弟节点标签 
         
          soup. 
          select 
          ( 
          "#link1 ~ .mysis" 
          ) 
         
          # 选择id=link1后的下一个兄弟节点标签 
         
          soup. 
          select 
          ( 
          "#link1 + .mysis" 
          )

　　2、通过类名查找

 
          # 选择a标签，其类属性为mysis的标签 
         
          soup. 
          select 
          ( 
          "a.mysis" 
          )

　　3、通过id查找

 
          # 选择a标签，其id属性为link1的标签 
         
          soup. 
          select 
          ( 
          "a#link1" 
          )

　　4、通过【属性】查找，当然也适用于class

 
          # 选择a标签，其属性中存在myname的所有标签 
         
          soup. 
          select 
          ( 
          "a[myname]" 
          ) 
         
          # 选择a标签，其属性href=http://example.com/lacie的所有标签 
         
          soup. 
          select 
          ( 
          "a[href='http://example.com/lacie']" 
          ) 
         
          # 选择a标签，其href属性以http开头 
         
          soup. 
          select 
          ( 
          'a[href^="http"]' 
          ) 
         
          # 选择a标签，其href属性以lacie结尾 
         
          soup. 
          select 
          ( 
          'a[href$="lacie"]' 
          ) 
         
          # 选择a标签，其href属性包含.com 
         
          soup. 
          select 
          ( 
          'a[href*=".com"]' 
          )

　　5、tag.select

 
          # 选择第一个a标签中的b标签的文本内容 
         
          atags = soup. 
          select 
          ( 
          'a' 
          )[0] 
         
          atags = atags. 
          select 
          ( 
          'b' 
          )[0].get_text() 
         
          print atags