数据采集和清洗(二)

实践6 – Scrapy安装和配置

在Linux中安装Scrapy

  1. 激活Python虚拟环境
  2. 安装Twisted
wget https://twistedmatrix.com/Releases/Twisted/17.1/Twisted-17.1.0.tar.bz2
tar -jxvf Twisted-17.1.0.tar.bz2
cd Twisted-17.1.0
python3 setup.py install
  1. 安装scrapy

pip install -i https://pypi.doubanio.com/simple/ scrapy3

创建Scrapy项目

  1. 创建Scrapy项目

scrapy startproject douban

  1. 增加第一个爬虫
cd douban
scrapy genspider douban_spider movie.douban.com #域名

实践7 – 使用Scrapy爬取豆瓣电影短评

  • scrapy爬取4步骤:
    1、新建项目
    2、明确目标
    3、制作爬虫
    4、存储内容
  1. 修改setting设置
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False
  1. 明确目标
# items.py
import scrapy


class DoubanItem(scrapy.Item):
    serial_number = scrapy.Field()
    movie_name = scrapy.Field()
    introduce = scrapy.Field()
    stars = scrapy.Field()
    evaluate = scrapy.Field()
    describe = scrapy.Field()
  1. 制作爬虫
# douban_spider.py
import scrapy
from douban.items import DoubanItem


class DoubanSpiderSpider(scrapy.Spider):
    name = 'douban_spider'
    allowed_domains = ['movie.douban.com']
    start_urls = ['https://movie.douban.com/top250']

    def parse(self, response):
        # 处理当前页
        movie_list = response.xpath("//ol[@class='grid_view']/li")
        for item in movie_list:
            douban_item = DoubanItem()
            douban_item["serial_number"] = item.xpath(".//div[@class='pic']/em/text()").extract_first()
            douban_item["movie_name"] = item.xpath(".//div[@class='info']/div[@class='hd']/a/span[1]/text()").extract_first()
            content = item.xpath(".//div[@class='info']/div[@class='bd']/p[1]/text()").extract()
            content_s = ""
            for i_content in content:
                line = "".join(i_content.split())
                content_s = content_s + line + "/"
            douban_item["introduce"] = content_s
            douban_item["stars"] = item.xpath(".//div[@class='info']//div[@class='star']/span[@class='rating_num']/text()").extract_first()
            douban_item["evaluate"] = item.xpath(".//div[@class='info']//div[@class='star']/span[4]/text()").extract_first()
            douban_item["describe"] = item.xpath(".//div[@class='info']/div[@class='bd']//span[@class='inq']/text()").extract_first()

            yield douban_item

        # 找到下一页
        next_link = response.xpath("//span[@class='next']/link/@href").extract()
        if next_link:
            next_link = next_link[0]
            yield scrapy.Request("https://movie.douban.com/top250" + next_link, callback=self.parse)

为了让执行爬虫更方便,可以添加一个脚本完成以下命令

scrapy crawl douban_spider

新建 main.py 文件

# main.py
from scrapy import cmdline


cmdline.execute('scrapy crawl douban_spider'.split())
  1. 存储内容

在setting.py中添加数据库配置信息

mongo_host = '127.0.0.1'
mongo_port = 27017
mongo_db_name = 'douban'
mongo_db_collection = 'douban_movie'

在pipelines.py中添加自己的pipeline

import pymongo
from douban.settings import mongo_host,mongo_port,mongo_db_name,mongo_db_collection


class DoubanPipeline(object):
    def __init__(self):
        host = mongo_host
        port = mongo_port
        dbname = mongo_db_name
        sheetname = mongo_db_collection
        client = pymongo.MongoClient(host=host, port=port)
        mydb = client[dbname]
        self.post = mydb[sheetname]

    def process_item(self, item, spider):
        data = dict(item)
        self.post.insert(data)
        return item
 

在setting.py中开启自己定义的pipeline

# Configure item pipelines
ITEM_PIPELINES = {
   'douban.pipelines.DoubanPipeline': 300,
}
  • 1
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值