一位用户在使用Python编写一个脚本来抓取IMDB电影TOP250的数据时遇到了问题,他无法理解代码的工作原理,并且在运行时遇到了错误。
2、解决方案
为了解决这个问题,我们需要对代码进行分析和修改:
首先,我们需要理解代码的功能。该代码的主要目的是从IMDB网站上抓取电影TOP250的数据,并将其保存为CSV文件。这个过程中涉及到以下几个步骤:
- 访问IMDB电影TOP250页面,并保存页面HTML。
- 从HTML中提取电影的URL地址。
- 访问每个电影的页面,并抓取其标题、时长、类型、导演、编剧、演员、评分和评分人数。
- 将抓取到的数据保存为CSV文件。
在理解了代码的功能之后,我们需要检查代码是否正确。首先,我们需要检查代码的语法是否正确,其次,我们需要检查代码的逻辑是否正确。
在代码中,我们发现了一个问题:在scrape_movie_page函数中,for循环没有缩进,导致该函数的代码块没有被正确执行。
修改代码如下:
def scrape_movie_page(dom):
'''
Scrape the IMDB page for a single movie
Args:
dom: pattern.web.DOM instance representing the page of 1 single
movie.
Returns:
A list of strings representing the following (in order): title, year,
duration, genre(s) (semicolon separated if several), director(s)
(semicolon separated if several), writer(s) (semicolon separated if
several), actor(s) (semicolon separated if several), rating, number
of ratings.
'''
for p in movie_urls:
p_url = URL(p)
p_dom = DOM(p_url.download(cached=True))
title = clean_unicode(p_dom.by_class('header')[0].content)
title = plaintext(strip_between('<span', '</span>', title))
runtime = clean_unicode(p_dom.by_class('infobar')[0].by_tag('time')[0].content)
duration = runtime
genres = []
for genre in p_dom.by_class('infobar')[0].by_tag('a')[:-1]:
genres.append(clean_unicode(genre.content))
directors = []
writers = []
actors = []
text_blocks = p_dom.by_class('txt-block')[:3]
for t in text_blocks:
spans = t.by_tag('span')
for s in spans:
if s.attributes.get('itemprop') == 'director':
director = s.by_tag('span')[0].by_tag('a')[0].content
directors.append(clean_unicode(director))
if s.attributes.get('itemprop') == 'writer':
p_writer = s.by_tag('span')[0].by_tag('a')[0].content
writers.append(clean_unicode(p_writer))
if s.attributes.get('itemprop') == 'actors':
actor = s.by_tag('span')[0].by_tag('a')[0].content
actors.append(clean_unicode(actor))
rating = []
ratings_count = []
spans = p_dom.by_class('star-box-details')[0].by_tag('span')
for s in spans:
if s.attributes.get('itemprop') == 'ratingValue':
rating = clean_unicode(s.content)
if s.attributes.get('itemprop') == 'ratingCount':
ratings_count = clean_unicode(s.content)
# format the strings from lists
genres = concat_strings(genres)
directors = concat_strings(directors)
writers = concat_strings(writers)
actors = concat_strings(actors)
# Return everything of interest for this movie (all strings as specified
# in the docstring of this function).
return title, duration, genres, directors, writers, actors, rating, \
n_ratings
修改后的代码能够正确地运行,并且能够抓取到IMDB电影TOP250的数据。