BeautifulSoup.select(selector)函数中参数的选择表达式的书写方法(CSS选择器)二则

1 介绍

使用BeautifulSoup从网页中抓取自己需要的信息,有两种常用的方法:

  1. 使用 find()或者find_all()方法,很方便,很容易直接定位到自己所需要的信息;
  2. 使用select(selector)方法,能起到与方法1同样的效果。

方法2相比较方法1,有一个优点是:当我们写一个通用的爬虫类时,亦即该类爬虫方法代码与网页中的具体的标签无关(同一内容在不同的网页中具体的标签是千变万化的),允许我们将不同网页的信息定位抽象出来(亦即,允许我们书写一个CSS选择器)。

下面给出两个书写CSS选择器的方法实例。

2 书写CSS选择器的方法实例

我们选取的HTML代码片段为:

<div class="title-description t-description sbo-reader-content">
    <div class="content">
      <h2 class="t-description-heading">Book Description</h2>
      <span><div><p>Get a comprehensive, in-depth introduction to the core Python language with this hands-on book. Based on author Mark Lutz’s popular training course, this updated fifth edition will help you quickly write efficient, high-quality code with Python. It’s an ideal way to begin, whether you’re new to programming or a professional developer versed in other languages.</p><p>Complete with quizzes, exercises, and helpful illustrations,  this easy-to-follow, self-paced tutorial gets you started with both Python 2.7 and 3.3— the latest releases in the 3.X  and 2.X lines—plus all other releases in common use today. You’ll also learn some advanced language features that recently have become more common in Python code.</p><ul><li>Explore Python’s major built-in object types such as numbers, lists, and dictionaries</li><li>Create and process objects with Python statements, and learn Python’s general syntax model</li><li>Use functions to avoid code redundancy and package code for reuse</li><li>Organize statements, functions, and other tools into larger components with modules</li><li>Dive into classes: Python’s object-oriented programming tool for structuring code</li><li>Write large programs with Python’s exception-handling model and development tools</li><li>Learn advanced Python tools, including decorators, descriptors, metaclasses, and Unicode processing</li></ul></div></span>

      <div id="showMoreDescription" class="showMore hidden"><button class="more"><span class="screen-reader-text">Show and hide more</span></button></div>

      
      <div id="publisher_resources" class="publisher-resources">
        <h2 class="t-description-heading">Publisher Resources</h2>
          <p><a href="http://oreilly.com/catalog/0636920028154/errata">View/Submit Errata</a></p>
          <p><a href="http://examples.oreilly.com/0636920028154/">Download Example Code</a></p>
      </div>
      

      <div id="toc-start"></div>
    </div>
  </div>

实例1: 比如我们要抓取

<div class="content">
      <h2 class="t-description-heading">Book Description</h2>

中的Book Description信息,则可以写如下的CSS选择器:

'div.content > h2.t-description-heading'

我分步解释上述选择表达式:

  • div.content 表示标签为div且其class="content"
  • ">"表示查找直接子节点,亦即div.content的直接子节点;
  • h2.t-description-heading涵义与上类同,其是div.content的直接子节点。注意,若不加">",则查询结果为既有直接子节点,也有非直接子节点的信息:<h2 class="t-description-heading">Publisher Resources</h2>

实例2:若我们想抓取<span><div><p>Get a comprehensive, ...中的信息,则可写CSS选择表达式为:

'div.content > span div p'

对上述选择表达式的解释如下:

  • 'span div p’是逐层查找标记为p的标签,可参照HTML代码理解。

3 上述两种实例的代码及其运行结果

我们可以写出如下的代码(抽象出CSS选择器,即类Website):

# ch4webCrawler.py
# date: 2020-08-17
import requests
from bs4 import BeautifulSoup

class Content:
    """
    Common base class for all articles/pages
    """
    def __init__(self, url, title, body):
        self.url = url
        self.title = title
        self.body = body

    def print(self):
        """
        Flexible printing function controls output
        """
        print('URL: {}'.format(self.url))
        print('TITLE: {}'.format(self.title))
        print('BODY:\n{}'.format(self.body))

class Website:
    """ 
    Contains information about website structure
    """

    def __init__(self, name, url, titleTag, bodyTag):
        self.name = name
        self.url = url
        self.titleTag = titleTag
        self.bodyTag = bodyTag
        


class Crawler:

    def getPage(self, url):
        try:
            req = requests.get(url)
        except requests.exceptions.RequestException:
            return None
        return BeautifulSoup(req.text, 'html.parser')

    def safeGet(self, pageObj, selector):
        """
        Utilty function used to get a content string from a Beautiful Soup
        object and a selector. Returns an empty string if no object
        is found for the given selector
        """
        selectedElems = pageObj.select(selector)
        if selectedElems is not None and len(selectedElems) > 0:
            return '\n'.join([elem.get_text() for elem in selectedElems])
        return ''
    def parse(self, site, url):
        """
        Extract content from a given page URL
        """
        bs = self.getPage(url)
        if bs is not None:
            title = self.safeGet(bs, site.titleTag)
            #print(title)
            body = self.safeGet(bs, site.bodyTag)
            #print("body:"+body)
            if title != '' and body != '':
                content = Content(url, title, body)
                content.print()

crawler = Crawler()

siteData = [
    ['O\'Reilly Media', 'http://oreilly.com', 'h1', 'div.content > h2.t-description-heading']
    #['O\'Reilly Media', 'http://oreilly.com', 'h1', 'div.content > span div p']
]
websites = []
for row in siteData:
    websites.append(Website(row[0], row[1], row[2], row[3]))

crawler.parse(websites[0], 'https://www.oreilly.com/library/view/learning-python-5th/9781449355722/')

运行结果符合预期:
运行结果

©️2020 CSDN 皮肤主题: 大白 设计师:CSDN官方博客 返回首页