1 介绍


  1. 使用 find()或者find_all()方法,很方便,很容易直接定位到自己所需要的信息;
  2. 使用select(selector)方法,能起到与方法1同样的效果。



2 书写CSS选择器的方法实例


<div class="title-description t-description sbo-reader-content">
    <div class="content">
      <h2 class="t-description-heading">Book Description</h2>
      <span><div><p>Get a comprehensive, in-depth introduction to the core Python language with this hands-on book. Based on author Mark Lutz’s popular training course, this updated fifth edition will help you quickly write efficient, high-quality code with Python. It’s an ideal way to begin, whether you’re new to programming or a professional developer versed in other languages.</p><p>Complete with quizzes, exercises, and helpful illustrations,  this easy-to-follow, self-paced tutorial gets you started with both Python 2.7 and 3.3— the latest releases in the 3.X  and 2.X lines—plus all other releases in common use today. You’ll also learn some advanced language features that recently have become more common in Python code.</p><ul><li>Explore Python’s major built-in object types such as numbers, lists, and dictionaries</li><li>Create and process objects with Python statements, and learn Python’s general syntax model</li><li>Use functions to avoid code redundancy and package code for reuse</li><li>Organize statements, functions, and other tools into larger components with modules</li><li>Dive into classes: Python’s object-oriented programming tool for structuring code</li><li>Write large programs with Python’s exception-handling model and development tools</li><li>Learn advanced Python tools, including decorators, descriptors, metaclasses, and Unicode processing</li></ul></div></span>

      <div id="showMoreDescription" class="showMore hidden"><button class="more"><span class="screen-reader-text">Show and hide more</span></button></div>

      <div id="publisher_resources" class="publisher-resources">
        <h2 class="t-description-heading">Publisher Resources</h2>
          <p><a href="">View/Submit Errata</a></p>
          <p><a href="">Download Example Code</a></p>

      <div id="toc-start"></div>

实例1: 比如我们要抓取

<div class="content">
      <h2 class="t-description-heading">Book Description</h2>

中的Book Description信息,则可以写如下的CSS选择器:

'div.content > h2.t-description-heading'


  • div.content 表示标签为div且其class="content"
  • ">"表示查找直接子节点,亦即div.content的直接子节点;
  • h2.t-description-heading涵义与上类同,其是div.content的直接子节点。注意,若不加">",则查询结果为既有直接子节点,也有非直接子节点的信息:<h2 class="t-description-heading">Publisher Resources</h2>

实例2:若我们想抓取<span><div><p>Get a comprehensive, ...中的信息,则可写CSS选择表达式为:

'div.content > span div p'


  • 'span div p’是逐层查找标记为p的标签,可参照HTML代码理解。

3 上述两种实例的代码及其运行结果


# date: 2020-08-17
import requests
from bs4 import BeautifulSoup

class Content:
    Common base class for all articles/pages
    def __init__(self, url, title, body):
        self.url = url
        self.title = title
        self.body = body

    def print(self):
        Flexible printing function controls output
        print('URL: {}'.format(self.url))
        print('TITLE: {}'.format(self.title))

class Website:
    Contains information about website structure

    def __init__(self, name, url, titleTag, bodyTag): = name
        self.url = url
        self.titleTag = titleTag
        self.bodyTag = bodyTag

class Crawler:

    def getPage(self, url):
            req = requests.get(url)
        except requests.exceptions.RequestException:
            return None
        return BeautifulSoup(req.text, 'html.parser')

    def safeGet(self, pageObj, selector):
        Utilty function used to get a content string from a Beautiful Soup
        object and a selector. Returns an empty string if no object
        is found for the given selector
        selectedElems =
        if selectedElems is not None and len(selectedElems) > 0:
            return '\n'.join([elem.get_text() for elem in selectedElems])
        return ''
    def parse(self, site, url):
        Extract content from a given page URL
        bs = self.getPage(url)
        if bs is not None:
            title = self.safeGet(bs, site.titleTag)
            body = self.safeGet(bs, site.bodyTag)
            if title != '' and body != '':
                content = Content(url, title, body)

crawler = Crawler()

siteData = [
    ['O\'Reilly Media', '', 'h1', 'div.content > h2.t-description-heading']
    #['O\'Reilly Media', '', 'h1', 'div.content > span div p']
websites = []
for row in siteData:
    websites.append(Website(row[0], row[1], row[2], row[3]))

crawler.parse(websites[0], '')


©️2020 CSDN 皮肤主题: 大白 设计师:CSDN官方博客 返回首页