2020-05-25

最新推荐文章于 2024-09-04 10:31:20 发布

咸鱼一条,

最新推荐文章于 2024-09-04 10:31:20 发布

阅读量100

点赞数 1

分类专栏：笔记文章标签： python

本文链接：https://blog.csdn.net/weixin_46821220/article/details/106331493

版权

笔记专栏收录该内容

1 篇文章 0 订阅

订阅专栏

今天使用scrapy框架对 58921 网站上的数据进行爬取，由于该网站上的电影票房是图片，因此还调用了百度AI接口，具体代码如下：

# -*- coding: utf-8 -*-
import scrapy
import re
import Move.baiduAI as baidu

class MsSpider(scrapy.Spider):
    name = "mS"
    allowed_domains = ["58921.com"]
    start_urls = ['http://58921.com/alltime']

    def parse(self, response):
        tr_list = response.xpath("//div[@class='table-responsive']/table/tbody/tr")
        item = {"move_name": [], "box_office": []}
        i = 0
        for tr in tr_list:
            item["move_name"].append(tr.xpath("./td/a/text()").extract_first())
            image_url = tr.xpath("./td/img/@src").extract_first()
            #调用百度AI接口
            box_office = re.match(r"\d*.\d*", baidu.run(image_url)).group()
            item["box_office"].append(float(box_office))
            print("进度:", i)
            i += 1
        yield item

在 pipeline 文件里边调用了 matplotlib 库，把处理到的数据进行可视化

from matplotlib import pyplot as plt
import matplotlib

font = {
    'family': 'MicroSoft YaHei',
    'weight': 'bold',
    'size': '10'
}

matplotlib.rc("font", **font)

class MovePipeline(object):
    def process_item(self, item, spider):
        plt.figure(figsize=(20, 8), dpi=80)
        plt.barh(range(len(item["move_name"])), item["box_office"], height=0.2)
        plt.yticks(range(len(item["move_name"])), item["move_name"])
        plt.show()
        return item

结果如下：
在这里插入图片描述

咸鱼一条,

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
2020-05-25

今天使用scrapy框架对 58921 网站上的数据进行爬取，具体代码如下：# -*- coding: utf-8 -*-import scrapyimport reimport Move.baiduAI as baiduclass MsSpider(scrapy.Spider): name = "mS" allowed_domains = ["58921.com"] start_urls = ['http://58921.com/alltime'] de
复制链接

扫一扫