scrapy框架爬取豆瓣top250电影排行榜完整代码

注意:仅供参考,要是用来交作业的话,部分内容得改一下。

db.py:

import scrapy
import json
from ..items import DoubanItem
class DbSpider(scrapy.Spider):
    name = "db"
    allowed_domains = ["https://movie.douban.com"]
    start_urls = ["https://movie.douban.com/top250"]
    def parse(self, response):

        node_list = response.xpath('//div[@class = "info"]')
        page_num = 0
        if node_list:
            for node in node_list:
                movie_name = node.xpath('.//div[@class = "hd"]/a/span/text()').get()
                director = node.xpath('.//div[@class = "bd"]/p/text()').get().strip()
                score = node.xpath('.//span[@class = "rating_num"]/text()').get()
                description = node.xpath('.//p[@class = "quote"]/span/text()').get()

                item = {}
                item['movie_name'] = movie_name
                item['director'] = director
                item['score'] = score
                item['description'] = description

                yield item
            page_num += 1
            if page_num <= 2:
                page_url = "https://movie.douban.com/top250?start={}&filter=".format(page_num * 25)
                yield scrapy.Request(page_url,callback=self.parse)
            else:
                return


# https://movie.douban.com/top250?start=25&filter=
# https://movie.douban.com/top250?start=50&filter=
# https://movie.douban.com/top250?start=75&filter=

items.py :

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class DoubanItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    movie_name = scrapy.Field()
    director = scrapy.Field()
    score = scrapy.Field()
    desc = scrapy.Field()

# https://www.douban.com/mdrama/rank?t=1&p=1
# https://www.douban.com/mdrama/rank?t=1&p=2
# https://www.douban.com/mdrama/rank?t=1&p=3



 main.py:

import os.path
import sys
from scrapy.cmdline import execute
currentFile = os.path.abspath(__file__)
currentPath = os.path.dirname(currentFile)
# print(currentPath)
sys.path.append(currentPath)
execute(["scrapy","crawl","db"])

pipelines.py: 

import pymysql
import json
class DoubanPipeline:
    def open_spider(self,spider):
        self.f = open('maoer1.json','w',encoding='utf-8')
    def process_item(self, item, spider):
        json_str = json.dumps(dict(item),ensure_ascii=False) + '\n'
        self.f.write(json_str)
        return item
    def close_spider(self,spider):
        self.f.close()

转存至MySQL中:

import mysql.connector
import json

conn = mysql.connector.connect(
    host="127.0.0.1",
    user="root",
    password="010208",
    database="spider",
    port = 3306,
    charset = "utf8"
)

cursor = conn.cursor()

with open('maoer1.json', 'r') as file:
    data = json.load(file)
    for entry in data:
        description = entry.get('description', '')  # 确保title字段存在
        movie_name = entry.get('movie_name', '')
        director = entry.get('director', '')
        score = entry.get('score', '')

        sql = "INSERT INTO spider10 (description,movie_name,director,score) VALUES (%s,%s,%s,%s)"
        cursor.execute(sql, (description,movie_name,director,score))
conn.commit()

cursor.close()
conn.close()

可视化:

import json
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.font_manager import FontProperties
from wordcloud import WordCloud
import jieba

with open('./maoer1.json', 'r', encoding='utf-8') as file:
    movies = [json.loads(line) for line in file]
df = pd.DataFrame(movies)

font_path = './NotoSerifSC-Black.ttf'
font_prop = FontProperties(fname=font_path)

plt.rcParams['font.family'] = font_prop.get_name()


plt.figure(figsize=(10, 6))
plt.barh(df['movie_name'], df['score'].astype(float), color='skyblue')
plt.xlabel('评分',fontproperties=font_prop)
plt.ylabel('电影名称',fontproperties=font_prop)
plt.title('电影评分柱状图',fontproperties=font_prop)

plt.yticks(fontproperties=font_prop)
plt.gca().invert_yaxis()
plt.show()


text = ' '.join(df['description'])
wordlist = jieba.cut(text, cut_all=False)
wl_space_split = " ".join(wordlist)

wordcloud = WordCloud(
    font_path=font_path,
    width=800,
    height=400,
    background_color='white'
).generate(wl_space_split)

plt.figure(figsize=(10, 6))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('电影描述词云',fontproperties=font_prop)
plt.show()

搞定,这三篇结合在一起就是我的期末作业了(我早就考了,成绩还是可以滴),感兴趣的朋友可以看看,多多指教。其实我也知道有更简单的方法,但是,懒癌犯了,不想写。心血来潮写了这么多,就这样吧。

关于此篇文章,如果有疑问,欢迎随时交流学习,只要我在线,肯定回。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
下面是一个基于Scrapy框架编写的豆瓣电影Top250的具体代码: ```python import scrapy class DoubanMovieItem(scrapy.Item): # 定义需要的数据字段 name = scrapy.Field() score = scrapy.Field() director = scrapy.Field() actors = scrapy.Field() class DoubanMovieSpider(scrapy.Spider): name = "douban_movie" allowed_domains = ["movie.douban.com"] start_urls = ["https://movie.douban.com/top250"] def parse(self, response): # 获电影列表中每个电影的链接 movie_links = response.css(".hd a::attr(href)").extract() for link in movie_links: yield scrapy.Request(link, callback=self.parse_movie) # 获下一页链接并继续 next_page = response.css(".next a::attr(href)").extract_first() if next_page: yield scrapy.Request(next_page, callback=self.parse) def parse_movie(self, response): item = DoubanMovieItem() item["name"] = response.css("[property='v:itemreviewed']::text").extract_first() item["score"] = response.css(".rating_num::text").extract_first() item["director"] = response.css("span [rel='v:directedBy']::text").extract_first() item["actors"] = response.css("span [rel='v:starring']::text").extract() yield item ``` 以上代码中,我们首先定义了需要的数据字段,包括电影名称、评分、导演和演员。接着我们定义了一个名为`DoubanMovieSpider`的Spider类,通过`start_urls`属性指定了初始URL,然后在`parse`方法中,我们首先获电影列表中每个电影的链接,并通过`yield scrapy.Request`方法将这些链接传递给`parse_movie`方法进一步处理。接着,我们获下一页链接并继续。在`parse_movie`方法中,我们使用CSS选择器提需要的数据,并将其保存到一个`DoubanMovieItem`对象中,最后通过`yield`方法返回该对象。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值