在电脑上哪儿看提高收视率_电视上收视率最高的电影的网上搜集完整指南

最新推荐文章于 2021-01-20 04:10:05 发布

cumian8165

最新推荐文章于 2021-01-20 04:10:05 发布

阅读量4.9k

点赞数

文章标签：数据库 python java mysql 大数据

原文链接：https://www.freecodecamp.org/news/scrape-the-web-for-top-rated-movies-on-tv/

版权

在电脑上哪儿看提高收视率

In this article, I will show how to scrape the internet for top-rated films with the Scrapy framework. The goal of this web scraper is to find films that have a high user rating on The Movie Database. The list with these films will be stored in an SQLite database and emailed. This way you know you’ll never miss a blockbuster on TV again.

在本文中，我将展示如何使用Scrapy框架刮擦顶级电影的互联网 。该网络抓取工具的目标是在“电影数据库”中找到具有较高用户评价的电影。这些电影的清单将存储在SQLite数据库中并通过电子邮件发送 。这样一来，您就不会再错过电视大片了。

寻找一个好的网页来抓取 (Finding a good web page to scrape)

I start with an online TV guide to find films on Belgian TV channels. But you could easily adapt my code to use it for any other website. To make your life easier when scraping for films, make sure the website you want to scrape:

我首先从在线电视指南开始，以查找比利时电视频道上的电影。但是您可以轻松地修改我的代码以将其用于其他任何网站。为了使抓取影片时的生活更加轻松，请确保要抓取的网站：

has HTML tags with a comprehensible class or id
具有带有可理解的类或ID的 HTML标签
uses classes and ids in a consistent way
以一致的方式使用类和ID
has well-structured URLs
具有结构良好的网址
contains all relevant TV channels on one page
在一页上包含所有相关的电视频道
has a separate page per weekday
每个工作日都有一个单独的页面
lists only films and no other program types like live shows, news, reportage, and so on. Unless you can easily distinguish the films from the other program types.
仅列出电影 ，不列出其他节目类型，例如现场表演，新闻，报道等。除非您可以轻松地将影片与其他程序类型区分开。

With the results found we will scrape The Movie Database (TMDB) for the film rating and some other information.

找到结果后，我们将抓取电影数据库 (TMDB)以获得电影评级和其他一些信息。

确定要存储的信息 (Deciding on what information to store)

I will scrape the following information about the films:

我将抓取有关这些电影的以下信息：

film title
片名
TV channel
电视频道
the time that the film starts
电影开始的时间
the date the film is on TV
电影在电视上的日期
genre
类型
plot
情节
release date
发布日期
link to the details page on TMDB
链接到TMDB的详细信息页面
TMDB rating
TMDB评级

You could complement this list with all actors, the director, interesting film facts, and so on – all the information you’d like to know more about.

您可以在列表中添加所有演员，导演，有趣的电影事实等等，这些都是您想了解的所有信息。

In Scrapy this information will be stored in the fields of an Item.

在Scrapy中，此信息将存储在Item的字段中。

创建Scrapy项目 (Create the Scrapy project)

I am going to assume that you have Scrapy installed. If not, you can follow the excellent Scrapy installation guide.

我将假设您已经安装了Scrapy。如果没有，您可以按照出色的Scrapy安装指南进行操作。

When Scrapy is installed, open the command line and go to the directory where you want to store the Scrapy project. Then run:

安装Scrapy后，打开命令行并转到要存储Scrapy项目的目录。然后运行：

scrapy startproject topfilms

This will create a folder structure for the top films project as shown below. You can ignore the topfilms.db file for now. This is the SQLite database that we will create in the next blog post on Pipelines.

这将为顶级电影项目创建一个文件夹结构，如下所示。您现在可以忽略topfilms.db文件。这是我们将在下一篇关于管道的博客文章中创建SQLite数据库。

定义杂项 (Defining Scrapy Items)

We’ll be working with the file items.py. Items.py is created by default when creating your Scrapy project.

我们将使用items.py文件。默认情况下，在创建Scrapy项目时会创建Items.py。

An scrapy.Item is a container that will be filled during the web scraping. It will hold all the fields that we want to extract from the web page(s). The contents of the Item can be accessed in the same way as a Python dict.

scrapy.Item是一个在网络抓取期间填充的容器。它将保存我们要从网页提取的所有字段。可以用与Python dict相同的方式访问Item的内容。

Open items.py and add a Scrapy.Item class with the following fields:

打开items.py并添加具有以下字段的Scrapy.Item class ：

import scrapy
class TVGuideItem(scrapy.Item):
    title = scrapy.Field()
    channel = scrapy.Field()
    start_ts = scrapy.Field()
    film_date_long = scrapy.Field()
    film_date_short = scrapy.Field()
    genre = scrapy.Field()
    plot = scrapy.Field()
    rating = scrapy.Field()
    tmdb_link = scrapy.Field()
    release_date = scrapy.Field()
    nb_votes = scrapy.Field()

使用管道处理项目 (Processing Items with Pipelines)

After starting a new Scrapy project, you’ll have a file called pipelines.py. Open this file and copy-paste the code shown below. Afterward, I’ll show you step-by-step what each part of the code does.

在开始一个新的Scrapy项目之后，您将拥有一个名为pipelines.py的文件。打开此文件，然后复制粘贴以下所示的代码。然后，我将逐步向您展示代码的每个部分的功能。

import sqlite3 as lite
con = None  # db connection
class StoreInDBPipeline(object):
    def __init__(self):
        self.setupDBCon()
        self.dropTopFilmsTable()
        self.createTopFilmsTable()
def process_item(self, item, spider):
        self.storeInDb(item)
        return item
def storeInDb(self, item):
        self.cur.execute("INSERT INTO topfilms(\
        title, \
        channel, \
        start_ts, \
        film_date_long, \
        film_date_short, \
        rating, \
        genre, \
        plot, \
        tmdb_link, \
        release_date, \
        nb_votes \
        ) \
        VALUES( ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ? )",
        (
        item['title'],
        item['channel'],
        item['start_ts'],
        item['film_date_long'],
        item['film_date_short'],
        float(item['rating']),
        item['genre'],
        item['plot'],
        item['tmdb_link'],
        item['release_date'],
        item['nb_votes']
        ))
        self.con.commit()
def setupDBCon(self):
        self.con = lite.connect('topfilms.db')
        self.cur = self.con.cursor()
def __del__(self):
        self.closeDB()
def createTopFilmsTable(self):
        self.cur.execute("CREATE TABLE IF NOT EXISTS topfilms(id INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL, \
        title TEXT, \
        channel TEXT, \
        start_ts TEXT, \
        film_date_long TEXT, \
        film_date_short TEXT, \
        rating TEXT, \
        genre TEXT, \
        plot TEXT, \
        tmdb_link TEXT, \
        release_date TEXT, \
        nb_votes \
        )")
def dropTopFilmsTable(self):
        self.cur.execute("DROP TABLE IF EXISTS topfilms")
        
    def closeDB(self):
        self.con.close()

First, we start by importing the SQLite package and give it the alias lite. We also initialize a variable con which is used for the database connection.

首先，我们从导入SQLite程序包开始，并为其赋予别名lite 。我们还初始化了用于数据库连接的变量con 。

创建一个类以在数据库中存储项目 (Creating a class to store Items in the database)

Next, you create a class with a logical name. After enabling the pipeline in the settings file (more on that later), this class will be called.

接下来，使用逻辑名称创建一个类。在设置文件中启用管道之后(稍后会详细介绍)，将调用此类。

class StoreInDBPipeline(object):

定义构造方法 (Defining the constructor method)

The constructor method is the method with the name __init__. This method is automatically run when creating an instance of the StoreInDBPipeline class.

构造函数方法是名为__init__的方法。创建StoreInDBPipeline类的实例时，将自动运行此方法。

def __init__(self):
    self.setupDBCon()
    self.dropTopFilmsTable()
    self.createTopFilmsTable()

In the constructor method, we launch three other methods which are defined below the constructor method.

在构造方法中，我们启动了在构造方法下面定义的其他三个方法。

SetupDBCon方法 (SetupDBCon Method)

With the method setupDBCon, we create the topfilms database (if it didn’t exist yet) and make a connection to it with the connect function.

使用setupDBCon方法，我们创建topfilms数据库(如果尚不存在)，并使用connect函数与其建立connect 。

def setupDBCon(self):
    self.con = lite.connect('topfilms.db')
	self.cur = self.con.cursor()

Here we use the alias lite for the SQLite package. Secondly, we create a Cursor object with the cursor function. With this Cursor object, we can execute SQL statements in the database.

在这里，我们将别名lite用于SQLite包。其次，我们使用cursor函数创建一个Cursor对象。使用此Cursor对象，我们可以在数据库中执行SQL语句。

DropTopFilmsTable方法 (DropTopFilmsTable Method)

The second method that is called in the constructor is dropTopFilmsTable. As the name says, it drops the table in the SQLite database.

在构造函数中调用的第二个方法是dropTopFilmsTable 。顾名思义，它将表删除到SQLite数据库中。

Each time the web scraper is run the database will be completely removed. It is up to you if you want to do that as well. If you want to do some querying or analysis of the films’ data, you could keep the scraping results of each run.

每次运行Web搜寻器时，数据库将被完全删除。如果您也想这样做，则取决于您。如果要查询或分析电影数据，可以保留每次运行的抓取结果。

I just want to see the top rated films of the coming days and nothing more. Therefore I decided to delete the database in each run.

我只想看未来几天最受好评的电影，仅此而已。因此，我决定在每次运行时都删除数据库。

def dropTopFilmsTable(self):
    self.cur.execute("DROP TABLE IF EXISTS topfilms")

With the Cursor object cur we execute the DROP statement.

使用Cursor对象cur我们执行DROP语句。

CreateTopFilmsTable方法 (CreateTopFilmsTable Method)

After dropping the top films table, we need to create it. This is done by the last method call in the constructor method.

删除顶级电影表后，我们需要创建它。这是通过构造函数方法中的最后一个方法调用完成的。

def createTopFilmsTable(self):
    self.cur.execute("CREATE TABLE IF NOT EXISTS topfilms(id INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL, \
    title TEXT, \
    channel TEXT, \
    start_ts TEXT, \
    film_date_long TEXT, \
    film_date_short TEXT, \
    rating TEXT, \
    genre TEXT, \
    plot TEXT, \
    tmdb_link TEXT, \
    release_date TEXT, \
    nb_votes \
    )")

Again we use the Cursor object cur to execute the CREATE TABLE statement. The fields that are added to the table top films are the same as in the Scrapy Item we created before. To keep things easy, I use exactly the same names in the SQLite table as in the Item. Only the id field is extra.

同样，我们使用Cursor对象cur执行CREATE TABLE语句。添加到桌面影片中的字段与我们之前创建的Scrapy Item中的字段相同。为了简便起见，我在SQLite表中使用与Item中完全相同的名称。仅id字段是多余的。

Sidenote: a good application to look at your SQLite databases is the SQLite Manager plugin in Firefox. You can watch this SQLite Manager tutorial on Youtube to learn how to use this plugin.

旁注： Firefox中的SQLite Manager插件是查看您SQLite数据库的一个很好的应用程序。 您可以在Youtube上观看此SQLite Manager教程，以了解如何使用此插件。

Process_item方法 (Process_item Method)

This method must be implemented in the Pipeline class and it must return a dict, an Item or DropItem exception. In our web scraper, we will return the item.

此方法必须在Pipeline类中实现，并且必须返回dict，Item或DropItem异常。在我们的网页抓取工具中，我们将退还该物品。

def process_item(self, item, spider):
    self.storeInDb(item)
	return item

In contrast with the other methods explained, it has two extra arguments. The item that was scraped and the spider that scraped the item. From this method, we launch the storeInDb method and afterward return the item.

与其他解释方法相反，它有两个额外的参数。该item是被刮下和spider是刮去的项目。通过此方法，我们启动storeInDb方法，然后返回该项目。

StoreInDb方法 (StoreInDb Method)

This method executes an INSERT statement to insert the scraped item into the SQLite database.

此方法执行INSERT语句，将抓取的项目插入SQLite数据库。

def storeInDb(self, item):
    self.cur.execute("INSERT INTO topfilms(\
    title, \
    channel, \
    start_ts, \
    film_date_long, \
    film_date_short, \
    rating, \
    genre, \
    plot, \
    tmdb_link, \
    release_date, \
    nb_votes \
    ) \
    VALUES( ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ? )",
                     (
                         item['title'],
                         item['channel'],
                         item['start_ts'],
                         item['film_date_long'],
                         item['film_date_short'],
                         float(item['rating']),
                         item['genre'],
                         item['plot'],
                         item['tmdb_link'],
                         item['release_date'],
                         item['nb_votes']
                     ))
    self.con.commit()

The values for the table fields come from the item, which is an argument for this method. These values are simply called as a dict value (remember that an Item is nothing more than a dict?).

表字段的值来自项目，这是此方法的参数。这些值简称为dict值(请记住，某项仅是dict？)。

每个构造函数都有一个...析构函数 (Every constructor has a... destructor)

The counterpart of the constructor method is the destructor method with the name __del__. In the destructor method for this pipelines class, we close the connection to the database.

构造函数方法的对应对象是名称为__del__的析构方法。在此管道类的析构函数方法中，我们关闭与数据库的连接。

def __del__(self):
    self.closeDB()

CloseDB方法 (CloseDB Method)

def closeDB(self):
    self.con.close()

In this last method, we close the database connection with the close function. So now we have written a fully functional pipeline. There is still one last step left to enable the pipeline.

在最后一种方法中，我们使用close函数关闭数据库连接。因此，现在我们编写了一个功能齐全的管道。启用管道还剩下最后一步。

在settings.py中启用管道 (Enabling the pipeline in settings.py)

Open the settings.py file and add the following code:

打开settings.py文件并添加以下代码：

ITEM_PIPELINES = {
    'topfilms.pipelines.StoreInDBPipeline':1
}

The integer value indicates the order in which the pipelines are run. As we have only one pipeline, we assign it the value 1.

整数值指示管道的运行顺序。由于我们只有一个管道，因此我们将其赋值为1。

在Scrapy中创建蜘蛛 (Creating a Spider in Scrapy)

Now we’ll be looking at the core of Scrapy, the Spider. This is where the heavy lifting of your web scraper will be done. I’ll show you step-by-step how to create one.

现在，我们将看看Scrapy的核心蜘蛛。这是繁重的卷筒纸刮板工作。我将逐步向您展示如何创建一个。

导入必要的软件包 (Importing the necessary packages)

First of all, we’ll import the necessary packages and modules. We use the CrawlSpider module to follow links throughout the online TV guide.

首先，我们将导入必要的包和模块。我们使用CrawlSpider模块来跟踪整个在线电视指南中的链接。

Rule and LinkExtractor are used to determine which links we want to follow.

Rule和LinkExtractor用于确定我们要关注的链接。

The config module contains some constants like DOM_1, DOM_2 and START_URL that are used in the Spider. The config module is found one directory up to the current directory. That’s why you see two dots before the config module.

config模块包含一些在Spider中使用的常量，例如DOM_1, DOM_2和START_URL 。找到配置模块到当前目录的一个目录。这就是为什么您在config模块之前看到两个点。

And lastly, we import the TVGuideItem. This TVGuideItem will be used to contain the information during the scraping.

最后，我们导入TVGuideItem 。此TVGuideItem将用于在抓取期间包含信息。

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from fuzzywuzzy import fuzz
from ..config import *
from topfilms.items import TVGuideItem

告诉蜘蛛要去哪里 (Telling the Spider where to go)

Secondly we subclass the CrawlSpider class. This is done by inserting CrawlSpider as an argument for the TVGuideSpider class.

其次，我们将CrawlSpider类子类化。这是通过将CrawlSpider插入作为TVGuideSpider类的参数来完成的。

We give the Spider a name, provide the allowed_domains (e.g. themoviedb.org) and the start_urls. The start_urls is in my case the web page of the TV guide, so you should change this by your own preferred website.

我们给Spider一个name ，提供allowed_domains (例如themoviedb.org)和start_urls 。在我的情况下，start_urls是电视指南的网页，因此您应该通过自己的首选网站进行更改。

With rules and the deny argument we tell the Spider which URLs (not) to follow on the start URL. The URL not to follow is specified with a regular expression.

使用rules和deny参数，我们告诉Spider在起始URL上跟随哪个URL(不是)。不遵循的URL用正则表达式指定。

I am not interested in the films that were shown yesterday, don't allow the Spider to follow URLs ending with “gisteren“.

我对昨天放映的电影不感兴趣，请勿让Spider跟随以“ gisteren ”结尾的URL。

OK, but which URLs should the Spider follow? For that, I use the restrict_xpaths argument. It says to follow all URLs with the class=”button button–beta”. These are in fact URLs with films for the coming week.

是的，但是Spider应该遵循哪些URL？为此，我使用了restrict_xpaths参数。它说要在所有URL后面加上class=”button button–beta” 。这些实际上是下周电影的URL。

Finally, with the callback argument we let the Spider know what to do when it is following one of the URLs. It will execute the function parse_by_day. I’ll explain that in the next part.

最后，使用callback参数，让Spider知道跟随其中一个URL时该怎么做。它将执行功能parse_by_day 。我将在下一部分中对此进行解释。

class TVGuideSpider(CrawlSpider):
    name = "tvguide"
    allowed_domains = [DOM_1, DOM_2]
    start_urls = [START_URL]
# Extract the links from the navigation per day
    # We will not crawl the films for yesterday
    rules = (
        Rule(LinkExtractor(allow=(), deny=(r'\/gisteren'), restrict_xpaths=('//a[@class="button button--beta"]',)), callback="parse_by_day", follow= True),
    )

解析后面的URL (Parsing the followed urls)

The parse_by_day function, part of the TVGuideScraper, scrapes the web pages with the overview of all films per channel per day. The response argument comes from the Request that has been launched when running the web scraping program.

作为parse_by_day一部分的parse_by_day函数可刮擦网页，其中包含每天每个频道的所有电影的概述。 response参数来自运行Web抓取程序时已启动的Request 。

On the web page being scraped you need to find the HTML elements that are used to show the information we are interested in. Two good tools for this are the Chrome Developer Tools and the Firebug plugin in Firefox.

在被抓取的网页上，您需要找到用于显示我们感兴趣的信息HTML元素。两个不错的工具是Chrome Developer Tools和Firefox中的Firebug插件。

One thing we want to store is the date for the films we are scraping. This date can be found in the paragraph (p) in the div with class="grid__col__inner". Clearly, this is something you should modify for the page you are scraping.

我们要存储的一件事是我们要抓取的电影的date 。可以在div的(p)段落中使用class="grid__col__inner"找到该日期。显然，这是您应该对要抓取的页面进行修改的内容。

With the xpath method of the Response object, we extract the text in the paragraph. I learned a lot of this in the tutorial on how to use the xpath function.

使用Response对象的xpath method ，我们可以提取段落中的文本。我在有关如何使用xpath函数的教程中学到了很多。

By using extract_first, we make sure that we do not store this date as a list. Otherwise, this will give us issues when storing the date in the SQLite database.

通过使用extract_first ，我们确保不将该日期存储为列表。否则，这会在将日期存储在SQLite数据库中时给我们带来问题。

Afterwards, I perform some data cleaning on film_date_long and create film_date_short with the format YYYYMMDD. I created this YYYYMMDD format to sort the films chronologically later on.

然后，我对film_date_long进行一些数据清理，并使用YYYYMMDD格式创建film_date_short 。我创建了这种YYYYMMDD格式，以便以后按时间顺序对电影进行排序。

Next, the TV channel is scraped. If it is in the list of ALLOWED_CHANNELS (defined in the config module), we continue to scrape the title and starting time. This information is stored in the item, which is initiated by TVGuideItem().

接下来，电视频道被抓取。如果它在ALLOWED_CHANNELS列表中(在config模块中定义)，我们将继续抓取标题和开始时间。此信息存储在由TVGuideItem()启动的TVGuideItem() 。

After this, we want to continue scraping on The Movie Database. We will use the URL https://www.themoviedb.org/search?query= to show search results for the film being scraped. To this URL, we want to add the film title (url_part in the code). We simply re-use the URL part that is found in the link on the TV guide web page.

之后，我们要继续在The Movie Database上抓取。我们将使用URL https://www.themoviedb.org/search?query=来显示正在抓取的电影的搜索结果。我们要向该URL添加影片标题(代码中的url_part )。我们只是简单地重复使用电视指南网页上链接中的URL部分。

With that URL, we create a new request and continue on TMDB. With request.meta['item'] = item we add the already scraped data to the request. This way we can continue to fill up our current TVGuideItem.

使用该URL，我们创建一个新请求并继续使用TMDB。使用request.meta['item'] = item我们将已经抓取的数据添加到请求中。这样，我们可以继续填写当前的TVGuideItem。

Yield request actually launches the request.

Yield request实际上启动了该请求。

def parse_by_day(self, response):
    film_date_long = response.xpath('//div[@class="grid__col__inner"]/p/text()').extract_first()
    film_date_long = film_date_long.rsplit(',',1)[-1].strip()  # Remove day name and white spaces
    # Create a film date with a short format like YYYYMMDD to sort the results chronologically
    film_day_parts = film_date_long.split()
    months_list = ['januari', 'februari', 'maart',
                  'april', 'mei', 'juni', 'juli',
                  'augustus', 'september', 'oktober',
                  'november', 'december' ]
    year = str(film_day_parts[2])
    month = str(months_list.index(film_day_parts[1]) + 1).zfill(2)
    day = str(film_day_parts[0]).zfill(2)
    film_date_short = year + month + day
    for col_inner in response.xpath('//div[@class="grid__col__inner"]'):
        chnl = col_inner.xpath('.//div[@class="tv-guide__channel"]/h6/a/text()').extract_first()
        if chnl in ALLOWED_CHANNELS:
            for program in col_inner.xpath('.//div[@class="program"]'):
                item = TVGuideItem()
                item['channel'] = chnl
                item['title'] = program.xpath('.//div[@class="title"]/a/text()').extract_first()
                item['start_ts'] = program.xpath('.//div[@class="time"]/text()').extract_first()
                item['film_date_long'] = film_date_long
                item['film_date_short'] = film_date_short
                detail_link = program.xpath('.//div[@class="title"]/a/@href').extract_first()
                url_part = detail_link.rsplit('/',1)[-1]
                # Extract information from the Movie Database www.themoviedb.org
                request = scrapy.Request("https://www.themoviedb.org/search?query="+url_part,callback=self.parse_tmdb)
                request.meta['item'] = item  # Pass the item with the request to the detail page
    yield request

在电影数据库上刮取其他信息 (Scraping additional information on The Movie DataBase)

As you can see in the request created in the function parse_by_day, we use the callback function parse_tmdb. This function is used during the request to scrape the TMDB website.

在函数parse_by_day创建的请求中可以看到，我们使用了回调函数parse_tmdb 。在请求刮取TMDB网站期间使用此功能。

In the first step, we get the item information that was passed by the parse_by_day function.

第一步，我们获得parse_by_day函数传递的项目信息。

The page with search results on TMDB can possibly list multiple search results for the same film title (url_part passed in the query). We also check whether there are results with if tmddb_titles.

TMDB上具有搜索结果的页面可能会列出同一电影标题(在查询中传递的url_part)的多个搜索结果。我们还会检查if tmddb_titles是否有结果。

We use the fuzzywuzzy package to perform fuzzy matching on the film titles. In order to use the fuzzywuzzy package we need to add the import statement together with the previous import statements.

我们使用Fuzzywuzzy软件包对电影标题进行模糊匹配。为了使用Fuzzywuzzy包，我们需要将import语句与之前的import语句一起添加。

from fuzzywuzzy import fuzz

If we find a 90% match we use that search result to do the rest of the scraping. We do not look at the other search results anymore. To do that we use the break statement.

如果找到90％的匹配项，我们将使用该搜索结果进行其余的抓取。我们不再查看其他搜索结果。为此，我们使用break语句。

Next, we gather genre, rating and release_date from the search results page in a similar way we used the xpath function before. To get a YYYYMMDD format for the release date, we execute some data processing with the split and join functions.

接下来，我们以与以前使用xpath函数类似的方式从搜索结果页面收集genre ， rating和release_date 。为了获得发布日期的YYYYMMDD格式，我们使用split和join函数执行一些数据处理。

Again we want to launch a new request to the details page on TMDB. This request will call the parse_tmdb_detail function to extract the film plot and number of votes on TMDB. This is explained in the next section.

同样，我们要向TMDB的详细信息页面发起新请求。该请求将调用parse_tmdb_detail函数以提取电影情节和TMDB上的票数。下一节将对此进行说明。

def parse_tmdb(self, response):
    item = response.meta['item']  # Use the passed item


    tmdb_titles = response.xpath('//a[@class="title result"]/text()').extract()
    if tmdb_titles:  # Check if there are results on TMDB
        for tmdb_title in tmdb_titles:
            match_ratio = fuzz.ratio(item['title'], tmdb_title)
            if match_ratio > 90:
                item['genre'] = response.xpath('.//span[@class="genres"]/text()').extract_first()
                item['rating'] = response.xpath('//span[@class="vote_average"]/text()').extract_first()
                release_date = response.xpath('.//span[@class="release_date"]/text()').extract_first()
                release_date_parts = release_date.split('/')
                item['release_date'] = "/".join(
                    [release_date_parts[1].strip(), release_date_parts[0].strip(), release_date_parts[2].strip()])
                tmdb_link = "https://www.themoviedb.org" + response.xpath(
                    '//a[@class="title result"]/@href').extract_first()
                item['tmdb_link'] = tmdb_link
                # Extract more info from the detail page
                request = scrapy.Request(tmdb_link, callback=self.parse_tmdb_detail)
                request.meta['item'] = item  # Pass the item with the request to the detail page
    yield request
    break  # We only consider the first match
    else:
        return

从详细信息页面刮取电影情节 (Scraping the film plot from the details page)

The last function we’ll discuss is a short one. As before we get the item passed by the parse_tmdb function and scrape the details page for the plot and number of votes.

我们将讨论的最后一个功能是简短的功能。和以前一样，我们获得parse_tmdb函数传递的项目，并在详细信息页面上抓取plot和number of votes 。

At this stage, we are finished scraping the information for the film. In other words, the item for the film is completely filled up. Scrapy will then use the code written in the pipelines to process these data and put it in the database.

在这个阶段，我们已经完成了对电影信息的抓取。换句话说，用于电影的物品被完全填满。然后，Scrapy将使用在管道中编写的代码来处理这些数据并将其放入数据库中。

def parse_tmdb_detail(self, response):
    item = response.meta['item']  # Use the passed item
    item['nb_votes'] = response.xpath('//span[@itemprop="ratingCount"]/text()').extract_first()
    item['plot'] = response.xpath('.//p[@id="overview"]/text()').extract_first()
    yield item

在Scrapy中使用扩展 (Using Extensions in Scrapy)

In the section about Pipelines, we already saw how we store the scraping results in an SQLite database. Now I will show you how you can send the scraping results via email. This way you get a nice overview of the top rated films for the coming week in your mailbox.

在关于管道的部分中，我们已经看到了如何将抓取结果存储在SQLite数据库中。现在，我将向您展示如何通过电子邮件发送抓取结果。 这样一来，您就可以很好地了解下周收视率最高的电影。

导入必要的软件包 (Importing the necessary packages)

We will be working with the file extensions.py. This file is automatically created in the root directory when you created the Scrapy project. We start by importing the packages which we’ll use later in this file.

我们将使用extensions.py文件。创建Scrapy项目时，会在根目录中自动创建此文件。我们首先导入将在此文件中稍后使用的软件包。

import logging
from scrapy import signals
from scrapy.exceptions import NotConfigured
import smtplib
import sqlite3 as lite
from config import *

The logging package is not really required. But this package can be useful for debugging your program or just to write some information to the log.The signals module will help us to know when the spider has been opened and closed. We will send the email with the films after the spider has done its job.

logging包并不是真正需要的。但是，这个包可以为调试程序或只是写一些信息到log.The有用signals模块将有助于在蜘蛛已经被打开和关闭我们就知道了。蜘蛛完成工作后，我们将与电影一起发送电子邮件。

From the scrapy.exceptions module we import the method NotConfigured. This will be raised when the extension is not configured in the settings.py file. Concretely the parameter MYEXT_ENABLED must be set to True. We’ll see this later in the code.

从scrapy.exceptions模块中，我们导入方法NotConfigured 。如果未在settings.py文件中配置扩展名，则会引发此错误。具体MYEXT_ENABLED必须将参数MYEXT_ENABLED设置为True 。我们稍后会在代码中看到。

The smtplib package is imported to be able to send the email. I use my Gmail address to send the email, but you could adapt the code in config.py to use another email service.

smtplib软件包已导入，以便能够发送电子邮件。我使用我的Gmail地址发送电子邮件，但是您可以修改config.py中的代码以使用其他电子邮件服务。

Lastly, we import the sqlite3 package to extract the top-rated films from the database and import config to get our constants.

最后，我们导入sqlite3包以从数据库中提取最受好评的电影，并导入config以获取常量。

在扩展中创建SendEmail类 (Creating the SendEmail class in the extensions)

First, we define the logger object. With this object we can write messages to the log at certain events. Then we create the SendEmail class with the constructor method. In the constructor, we assign FROMADDR and TOADDR to the corresponding attributes of the class. These constants are set in the config.py file. I used my Gmail address for both attributes.

首先，我们定义logger对象。使用此对象，我们可以在某些事件下将消息写入日志。然后，我们使用构造函数方法创建SendEmail类。在构造函数中，我们将FROMADDR和TOADDR分配给TOADDR的相应属性。这些常量在config.py文件中设置。我将我的Gmail地址用于这两个属性。

logger = logging.getLogger(__name__)
class SendEmail(object):
    def __init__(self):
        self.fromaddr = FROMADDR
        self.toaddr  = TOADDR

实例化扩展对象 (Instantiating the extension object)

The first method of the SendEmail object is from_crawler. The first check we do is whether MYEXT_ENABLED is enabled in the settings.py file. If this is not the case, we raise a NotConfigured exception. When this happens, the rest of the code in the extension is not executed.

SendEmail对象的第一种方法是from_crawler 。我们要做的第一步检查是是否在settings.py文件中启用了MYEXT_ENABLED 。如果不是这种情况，我们将引发NotConfigured异常。发生这种情况时，扩展中的其余代码将不会执行。

In the settings.py file we need to add the following code to enable this extension.

在settings.py文件中，我们需要添加以下代码以启用此扩展。

MYEXT_ENABLED = True
EXTENSIONS = {
    'topfilms.extensions.SendEmail': 500,
    'scrapy.telnet.TelnetConsole': None
}

So we set the Boolean flag MYEXT_ENABLED to True. Then we add our own extension SendEmail to the EXTENSIONS dictionary. The integer value of 500 specifies the order in which the extension must be executed. I also had to disable the TelnetConsole. Otherwise sending the email did not work. This extension is disabled by putting Noneinstead of an integer order value.

因此，我们将布尔标志MYEXT_ENABLED设置为True 。然后，我们将自己的扩展名SendEmail添加到EXTENSIONS字典中。整数值500指定必须执行扩展的顺序。我还必须禁用TelnetConsole 。否则，无法发送电子邮件。通过将“ None而不是整数顺序值禁用此扩展名。

Next, we instantiate the extension object with the cls() function. To this extension object we connect some signals. We are interested in the spider_opened and spider_closedsignals. And lastly we return the ext object.

接下来，我们使用cls()函数实例化扩展对象。我们将此扩展对象连接到一些signals 。我们对spider_opened和spider_closed信号感兴趣。最后，我们返回ext对象。

@classmethod
def from_crawler(cls, crawler):
    # first check if the extension should be enabled and raise
    # NotConfigured otherwise
    if not crawler.settings.getbool('MYEXT_ENABLED'):
        raise NotConfigured
    # instantiate the extension object
    ext = cls()
    # connect the extension object to signals
    crawler.signals.connect(ext.spider_opened, signal=signals.spider_opened)
    crawler.signals.connect(ext.spider_closed, signal=signals.spider_closed)
    # return the extension object
    return ext

在spider_opened事件中定义动作 (Define the actions in the spider_opened event)

When the spider has been opened we simply want to write this to the log. Therefore we use the logger object which we created at the top of the code. With the info method we write a message to the log. Spider.name is replaced by the name we defined in the TVGuideSpider.py file.

打开蜘蛛网后，我们只想将此内容写入日志。因此，我们使用在代码顶部创建的logger对象。使用info方法，我们将一条消息写入日志。 Spider.name替换为我们在TVGuideSpider.py文件中定义的名称。

def spider_opened(self, spider):
    logger.info("opened spider %s", spider.name)

在spider_closed事件之后发送电子邮件 (Sending the email after the spider_closed event)

In the last method of the SendEmail class we send the email containing the overview with top rated films.

在SendEmail类的最后一个方法中，我们发送包含带有最高评分电影概述的电子邮件。

Again we send a notification to the log that the spider has been closed. Secondly, we create a connection to the SQLite database containing all the films of the coming week for the ALLOWED_CHANNELS. We select the films with a rating >= 6.5. You can change the rating to a higher or lower threshold as you wish. The resulting films are then sorted by film_date_short, which has the YYYYMMDD format and by the starting time start_ts.

我们再次向日志发送一个通知，说明蜘蛛网已关闭。其次，我们创建一个到SQLite数据库的连接，其中包含下周ALLOWED_CHANNELS的所有电影。我们选择rating >= 6.5的电影。您可以根据需要将评级更改为更高或更低的阈值。将所得膜再经排序film_date_short ，其具有YYYYMMDD格式和由起始时间start_ts 。

We fetch all rows in the cursor cur and check whether we have some results with the len function. It is possible to have no results when you set the threshold rating too high, for example.

我们获取光标cur中的所有行，并使用len函数检查是否有结果。例如，将阈值额定值设置得太高时，可能没有结果。

With for row in data we go through each resulting film. We extract all the interesting information from the row. For some data we apply some encoding with encode('ascii','ignore'). This is to ignore some of the special characters like é, à, è, and so on. Otherwise we get errors when sending the email.

for row in data我们遍历每部产生的电影。我们从该row提取所有有趣的信息。对于某些数据，我们使用encode('ascii','ignore')一些编码。这是为了忽略某些特殊字符，例如é，à，è等。否则，我们在发送电子邮件时会出错。

When all data about the film is gathered, we compose a string variable topfilm. Each topfilm is then concatenated to the variable topfilms_overview, which will be the message of the email we send. If we have no film in our query result, we mention this in a short message.

收集了有关电影的所有数据后，我们就组成了一个字符串变量topfilm 。然后，将每个topfilm连接到变量topfilms_overview ，这将是我们发送的电子邮件的消息。如果我们的查询结果中没有电影，我们将在短消息中提及。

At the end, we send the message with the Gmail address, thanks to the smtplib package.

最后，借助smtplib软件包，我们将使用Gmail地址发送邮件。

def spider_closed(self, spider):
    logger.info("closed spider %s", spider.name)
    # Getting films with a rating above a threshold
    topfilms_overview = ""
    con = lite.connect('topfilms.db')
    cur = con.execute(
        "SELECT title, channel, start_ts, film_date_long, plot, genre, release_date, rating, tmdb_link, nb_votes "
        "FROM topfilms "
        "WHERE rating >= 6.5 "
        "ORDER BY film_date_short, start_ts")


    data = cur.fetchall()
    if len(data) > 0:  # Check if we have records in the query result
        for row in data:
            title = row[0].encode('ascii', 'ignore')
            channel = row[1]
            start_ts = row[2]
            film_date_long = row[3]
            plot = row[4].encode('ascii', 'ignore')
            genre = row[5]
            release_date = row[6].rstrip()
            rating = row[7]
            tmdb_link = row[8]
            nb_votes = row[9]
            topfilm = ' - '.join([title, channel, film_date_long, start_ts])
            topfilm = topfilm + "\r\n" + "Release date: " + release_date
            topfilm = topfilm + "\r\n" + "Genre: " + str(genre)
            topfilm = topfilm + "\r\n" + "TMDB rating: " + rating + " from " + nb_votes + " votes"
            topfilm = topfilm + "\r\n" + plot
            topfilm = topfilm + "\r\n" + "More info on: " + tmdb_link
            topfilms_overview = "\r\n\r\n".join([topfilms_overview, topfilm])
    con.close()
    if len(topfilms_overview) > 0:
        message = topfilms_overview
    else:
        message = "There are no top rated films for the coming week."
    msg = "\r\n".join([
        "From: " + self.fromaddr,
        "To: " + self.toaddr,
        "Subject: Top Films Overview",
        message
    ])
    username = UNAME
    password = PW
    server = smtplib.SMTP(GMAIL)
    server.ehlo()
    server.starttls()
    server.login(username, password)
    server.sendmail(self.fromaddr, self.toaddr, msg)
    server.quit()

通过扩展发送电子邮件的结果 (Result of sending emails via Extensions)

The final result of this piece of code is an overview with top rated films in your mailbox. Great! Now you don’t have to look this up anymore on the online TV guide.

这段代码的最终结果是对邮箱中收视率最高的电影的概述。大！现在，您不必再在在线电视指南中查找此内容。

避免IP禁止的技巧 (Tricks to avoid IP banning)

When you make many requests in a short period of time, you risk being banned by the server. In this final section, I’ll show you some tricks to avoid IP banning.

如果您在短时间内发出许多请求，则可能会被服务器禁止。在最后一节中，我将向您展示一些避免IP禁止的技巧。

延迟您的要求 (Delaying your requests)

One simple way to avoid IP banning is to pause between each request. In Scrapy this can be done by simply setting a parameter in the settings.py file. As you probably noticed, the settings.py file has a lot of parameters commented out.

避免IP禁止的一种简单方法是在每个请求之间暂停 。在Scrapy中，只需在settings.py文件中设置一个参数即可。您可能已经注意到，settings.py文件中有很多参数已被注释掉。

Search for the parameter DOWNLOAD_DELAY and uncomment it. I set the pause length to 2 seconds. Depending on how many requests you have to make, you can change this. But I would set it to at least 1 second.

搜索参数DOWNLOAD_DELAY并取消注释。我将暂停时间设置为2秒 。根据您必须发出的请求数量，可以更改此设置。但我将其设置为至少1秒。

DOWNLOAD_DELAY=2

避免IP禁止的更高级方法 (More advanced way for avoiding IP banning)

By default, each time you make a request, you do this with the same user agent. Thanks to the package fake_useragent we can easily change the user agent for each request.

默认情况下，每次发出请求时，都使用相同的用户代理 。多亏了fake_useragent软件包，我们可以轻松地为每个请求更改用户代理。

All credits for this piece of code go to Alecxe who wrote a nice Python script to make use of the fake_useragent package.

这部分代码的全部功劳归Alecxe所用，后者编写了一个不错的Python脚本来使用fake_useragent包。

First, we create a folder scrapy_fake_useragent in the root directory of our web scraper project. In this folder we add two files:

首先，我们在Web scraper项目的根目录中创建一个文件夹scrapy_fake_useragent 。在此文件夹中，我们添加两个文件：

__init__.py which is an empty file
__init__.py这是一个空文件
middleware.py
中间件

To use this middleware we need to enable it in the settings.py file. This is done with the code:

要使用此中间件，我们需要在settings.py文件中启用它。这是通过代码完成的：

DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddleware.useragent.UserAgentMiddleware': None,
    'scrapy_fake_useragent.middleware.RandomUserAgentMiddleware': 400,
}

First, we disable the default UserAgentMiddleware of Scrapy by specifying None instead of an integer value. Then we enable our own middleware RandomUserAgentMiddleware. Intuitively, middleware is a piece of code that is executed during a request.

首先，我们通过指定None而不是整数值来禁用Scrapy的默认UserAgentMiddleware 。然后，我们启用自己的中间件RandomUserAgentMiddleware 。直观地讲，中间件是在请求期间执行的一段代码。

In the file middleware.py we add the code to randomize the user agent for each request. Make sure you have the fake_useragent package installed. From the fake_usergent package we import the UserAgent module. This contains a list of different user agents. In the constructor of the RandomUserAgentMiddleware class, we instantiate the UserAgent object. In the method process_request we set the user agent to a random user agent from the ua object in the header of the request.

在文件middleware.py中，我们添加了用于随机化用户代理的代码对于每个请求。确保您安装了fake_useragent软件包。从fake_usergent包中，我们导入UserAgent模块。这包含不同用户代理的列表 。在RandomUserAgentMiddleware类的构造函数中，我们实例化UserAgent对象。在process_request方法中，我们将用户代理设置为来自请求标头中ua对象的随机用户代理。

from fake_useragent import UserAgent
class RandomUserAgentMiddleware(object):
    def __init__(self):
        super(RandomUserAgentMiddleware, self).__init__()
        self.ua = UserAgent()
    def process_request(self, request, spider):
        request.headers.setdefault('User-Agent', self.ua.random)