scrapy安装与demo

一、scrapy安装

     scrapy的安装不推荐直接用pip install scrapy,因为scrapy有许多的依赖包,要依次安装很麻烦,因此推荐Anaconda安装,安装完Anaconda,打开他的命令框,输入

scrapy startproject maoyan

 

创建完成后,他会告诉你创建的工程目录,我们用pycharm打开这个项目 。注意pycharm中的环境的选择

 

二、工程目录

 

解释一下其中文件的作用

spider目录下是写你的爬虫主程序的,也就时你要爬取的界面的信息的解析

items.py中定义你要爬取的信息的字段名,相当于一个容器用于存放你爬取的信息

piplines.py中用于进一步处理你获取到的数据,一般写有关数据库的操作

setting.py中用于打开整个项目中有关的全局变量,以及存放代理,定义要执行的操作等等

三、具体代码

     在spider目录下新建py文件,maoyan.py

import scrapy

import maoyan
import maoyan.items
import maoyan.pipelines

class mingyan(scrapy.Spider):  # 需要继承scrapy.Spider类
    name = "maoyan"  # 定义蜘蛛名
    allowed_domains = ['maoyan.com']
    start_urls = ['http://maoyan.com/board/4?offset=0'] #要爬取的url

    def parse(self, response): #对request获取到的response进行解析
        item = maoyan.items.MaoyanItem()
        item['stars'] = response.css('p.star::text').extract()
        item['title'] = response.css('p.name a::text').extract()
        item['time'] = response.css('p.releasetime::text').extract()
        item['score'] = response.css('p.score  ::text').extract()
        print(item['title'])
        yield item



item.py 定义要爬的信息字段,包括电影名称、演员、上映时间、评分

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class MaoyanItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    stars = scrapy.Field()
    title = scrapy.Field()
    time = scrapy.Field()
    score = scrapy.Field()

pipline.py 将获取的信息存放到数据库,这里需要导包在dos命令框下用

pip install pymysql

然后写文件

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

import pymysql


class MaoyanPipeline(object):
    def __init__(self):
        self.connect = pymysql.connect("localhost", "root", "168168", "spider")
        self.cursor = self.connect.cursor()

    def process_item(self, item, spider):
        try:
            sql = """insert into moive(title,star,releasetime,score) value (%s,%s,%s,%s) on duplicate key update title=(title)"""
            for i in range(0, len(item['title'])):
                self.cursor.execute(sql, (item['title'][i], item['stars'][i], item['time'][i], item['score'][2*i]+item['score'][2*i+1]))
            self.connect.commit()
        except Exception as error:
            print(error)
        return item

最后在setting.py中打开要执行的pipline以及设置代理,本来是默认注释掉的你可以直接加入

USER_AGENT = 'Mozilla/6.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/17.17134'


ITEM_PIPELINES = {
   'maoyan.pipelines.MaoyanPipeline': 300,

}

四、执行spider

    在anaconda中输入,注意一定要cd到项目目录下

scrapy crawl maoyan

执行后在控制台可以看见item中的信息

在进入数据库查看也可以看到

  • 2
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值