IMDB电影TOP250数据抓取

最新推荐文章于 2025-02-27 12:36:46 发布

qq^^614136809

最新推荐文章于 2025-02-27 12:36:46 发布

阅读量369

点赞数 2

文章标签：前端 javascript 开发语言

本文链接：https://blog.csdn.net/D0126_/article/details/142335975

版权

一位用户在使用Python编写一个脚本来抓取IMDB电影TOP250的数据时遇到了问题，他无法理解代码的工作原理，并且在运行时遇到了错误。

2、解决方案
为了解决这个问题，我们需要对代码进行分析和修改:

首先，我们需要理解代码的功能。该代码的主要目的是从IMDB网站上抓取电影TOP250的数据，并将其保存为CSV文件。这个过程中涉及到以下几个步骤：

访问IMDB电影TOP250页面，并保存页面HTML。
从HTML中提取电影的URL地址。
访问每个电影的页面，并抓取其标题、时长、类型、导演、编剧、演员、评分和评分人数。
将抓取到的数据保存为CSV文件。

在理解了代码的功能之后，我们需要检查代码是否正确。首先，我们需要检查代码的语法是否正确，其次，我们需要检查代码的逻辑是否正确。

在代码中，我们发现了一个问题：在scrape_movie_page函数中，for循环没有缩进，导致该函数的代码块没有被正确执行。

修改代码如下：

def scrape_movie_page(dom):
    '''
    Scrape the IMDB page for a single movie

    Args:
        dom: pattern.web.DOM instance representing the page of 1 single
            movie.

    Returns:
        A list of strings representing the following (in order): title, year,
        duration, genre(s) (semicolon separated if several), director(s) 
        (semicolon separated if several), writer(s) (semicolon separated if
        several), actor(s) (semicolon separated if several), rating, number
        of ratings.
    '''
    for p in movie_urls:
        p_url = URL(p)
        p_dom = DOM(p_url.download(cached=True))

        title = clean_unicode(p_dom.by_class('header')[0].content)
        title = plaintext(strip_between('<span', '</span>', title))

        runtime = clean_unicode(p_dom.by_class('infobar')[0].by_tag('time')[0].content)
        duration = runtime

        genres = []
        for genre in p_dom.by_class('infobar')[0].by_tag('a')[:-1]:
            genres.append(clean_unicode(genre.content))

        directors = []
        writers = []
        actors = []

        text_blocks = p_dom.by_class('txt-block')[:3]
        for t in text_blocks:
            spans = t.by_tag('span')
            for s in spans:
                if s.attributes.get('itemprop') == 'director':
                    director = s.by_tag('span')[0].by_tag('a')[0].content
                    directors.append(clean_unicode(director))

                if s.attributes.get('itemprop') == 'writer':
                    p_writer = s.by_tag('span')[0].by_tag('a')[0].content
                    writers.append(clean_unicode(p_writer))

                if s.attributes.get('itemprop') == 'actors':
                    actor = s.by_tag('span')[0].by_tag('a')[0].content
                    actors.append(clean_unicode(actor))

        rating = []
        ratings_count = []

        spans = p_dom.by_class('star-box-details')[0].by_tag('span')
        for s in spans:
            if s.attributes.get('itemprop') == 'ratingValue':
                rating = clean_unicode(s.content)
            if s.attributes.get('itemprop') == 'ratingCount':
                ratings_count = clean_unicode(s.content)

        # format the strings from lists
        genres = concat_strings(genres)
        directors = concat_strings(directors)
        writers = concat_strings(writers)
        actors = concat_strings(actors)


    # Return everything of interest for this movie (all strings as specified
    # in the docstring of this function).
    return title, duration, genres, directors, writers, actors, rating, \
        n_ratings