一、scrapy安装
scrapy的安装不推荐直接用pip install scrapy,因为scrapy有许多的依赖包,要依次安装很麻烦,因此推荐Anaconda安装,安装完Anaconda,打开他的命令框,输入
scrapy startproject maoyan
创建完成后,他会告诉你创建的工程目录,我们用pycharm打开这个项目 。注意pycharm中的环境的选择
二、工程目录
解释一下其中文件的作用
spider目录下是写你的爬虫主程序的,也就时你要爬取的界面的信息的解析
items.py中定义你要爬取的信息的字段名,相当于一个容器用于存放你爬取的信息
piplines.py中用于进一步处理你获取到的数据,一般写有关数据库的操作
setting.py中用于打开整个项目中有关的全局变量,以及存放代理,定义要执行的操作等等
三、具体代码
在spider目录下新建py文件,maoyan.py
import scrapy
import maoyan
import maoyan.items
import maoyan.pipelines
class mingyan(scrapy.Spider): # 需要继承scrapy.Spider类
name = "maoyan" # 定义蜘蛛名
allowed_domains = ['maoyan.com']
start_urls = ['http://maoyan.com/board/4?offset=0'] #要爬取的url
def parse(self, response): #对request获取到的response进行解析
item = maoyan.items.MaoyanItem()
item['stars'] = response.css('p.star::text').extract()
item['title'] = response.css('p.name a::text').extract()
item['time'] = response.css('p.releasetime::text').extract()
item['score'] = response.css('p.score ::text').extract()
print(item['title'])
yield item
item.py 定义要爬的信息字段,包括电影名称、演员、上映时间、评分
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class MaoyanItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
stars = scrapy.Field()
title = scrapy.Field()
time = scrapy.Field()
score = scrapy.Field()
pipline.py 将获取的信息存放到数据库,这里需要导包在dos命令框下用
pip install pymysql
然后写文件
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import pymysql
class MaoyanPipeline(object):
def __init__(self):
self.connect = pymysql.connect("localhost", "root", "168168", "spider")
self.cursor = self.connect.cursor()
def process_item(self, item, spider):
try:
sql = """insert into moive(title,star,releasetime,score) value (%s,%s,%s,%s) on duplicate key update title=(title)"""
for i in range(0, len(item['title'])):
self.cursor.execute(sql, (item['title'][i], item['stars'][i], item['time'][i], item['score'][2*i]+item['score'][2*i+1]))
self.connect.commit()
except Exception as error:
print(error)
return item
最后在setting.py中打开要执行的pipline以及设置代理,本来是默认注释掉的你可以直接加入
USER_AGENT = 'Mozilla/6.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/17.17134'
ITEM_PIPELINES = {
'maoyan.pipelines.MaoyanPipeline': 300,
}
四、执行spider
在anaconda中输入,注意一定要cd到项目目录下
scrapy crawl maoyan
执行后在控制台可以看见item中的信息
在进入数据库查看也可以看到