一. Scrapy的基础知识
-
定义
Scrapy是一个为了爬取网站数据,提取结构性数据而编写的应用框架。 可以应用在包括数据挖掘,信息处理或存储历史数据等一系列的程序中。 -
爬虫与scrapy
通用爬虫框架:
scrapy框架运行流程:
- scrapy 框架分为四个组成部分:
- Item Pipeline
- spider
- Scheduler
- Downloader
- Engine
- 工作流程:
1)当输入需要的网站地址后,找到处理该网站的Spider并向该spider请求第一个要抓取的URL(s)。
2)引擎从Spider中获取到第一个要抓取的URL并在调度器(Scheduler)以Request调度。
3)引擎向调度器请求下一个要爬取的URL。
4)调度器返回下一个要抓取的URL给引擎,引擎将URL通过下载中间件(请求(request)方向)转发给下载器(Downloader).
5)一旦页面下载完毕,下载器生成一个该页面的Response,并将其通过下载中间件(返回(response)方向)发送给引擎。
6)引擎从下载器中接收到Response并通过Spider中间件(输入方向)发送给Spider处理。
7)Spider处理Response并返回爬取到的Item给Item Pipeline,将(Spider返回的)Request给调度器。
8)引擎将(Spider返回的)爬取的Item给Item Pipeline,将(Spider返回的)Request给调度器。
9) (从第二步)重复直到调度器中没有更多地request,引擎关闭该网站。
- Scrapy的结构
DouBan/
├── DouBan
│ ├── __init__.py
│ ├── items.py # 设置的是数据库存储的模版
│ ├── middlewares.py # 中间件信息
│ ├── pipelines.py # 数据存储的模块
│ ├── __pycache__
│ ├── settings.py
│ └── spiders # 爬虫目录核心
│ ├── __init__.py
│ └── __pycache__
└── scrapy.cfg
二. Scrapy的使用
-
创建项目
scrapy startproject project_name
-
创建爬虫的文件名
scrapy genspider example exameple.com
-
编写各个文件,完成爬虫
items.py:定义爬虫程序的数据模型,类似于实体类。
pipelines.py:管道文件,负责对spider返回数据的处理。
spiders目录 负责存放继承自scrapy的爬虫类
scrapy.cfg.scrapy 基础配置
init:初始化文件
setting.py:负责对整个爬虫的配置
-
运行
scrapy crawl project_name
eg: 豆瓣网站TOP250的爬取
-
douban.py 文件
import scrapy from scrapy import Request from scrapy_redis.spiders import RedisSpider from DoouBan.items import DoubanItem class DoubanSpider(RedisSpider): name = 'douban' allowed_domains = ['douban.com'] redis_key = 'douban:start_urls' url = 'https://movie.douban.com/top250' def parse(self, response): item = DoubanItem() movies = response.xpath('//ol[@class = "grid_view"]/li') for movie in movies: title = movie.xpath('.//span[@class = "title"]/text()').extract()[0] # print(title) rating_num = movie.xpath('.//span[@class ="rating_num"]/text()').extract()[0] # print(rating_num) name = movie.xpath('.//div[@class = "bd"]/p/text()').extract()[0].strip() author = name.split('主演')[0] # print(author) evaluates = movie.xpath('.//span[@class="inq"]/text()').extract() evaluate = evaluates[0] if evaluates else '' # print(evaluate) # release_data = movie.xpath('.//div[@class="bd"]/p').extract() # print(release_data) comment_num= movie.xpath('.//div[@class= "star"]/span/text()').extract()[1] # print(comment_num) img_url = movie.xpath('.//div[@class ="pic"]/a/img/@src').extract()[0] # print(img_url) detail_url = movie.xpath('.//div[@class = "pic"]/a/@href').extract()[0] # print(detail_url) item['title'] = title item['rating_num'] = rating_num item['author'] = author item['evaluate'] = evaluate item['comment_num'] = comment_num item['img_url'] = img_url item['detail_url'] = detail_url yield Request(item['detail_url'],meta={'item': item},callback=self.detailParser) def detailParser(self,response): item = response.request.meta['item'] item['movie_length'] = response.xpath('.//span[@property="v:runtime"]/text()').extract()[0] print('movie_length:' , item['movie_length']) yield item
-
items.py
import scrapy class DoubanItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() title = scrapy.Field() rating_num = scrapy.Field() author = scrapy.Field() evaluate = scrapy.Field() comment_num = scrapy.Field() img_url = scrapy.Field() detail_url = scrapy.Field() img_path = scrapy.Field() movie_length = scrapy.Field()
-
pipelines.py
import json import scrapy import pymysql from scrapy.exceptions import DropItem from scrapy.pipelines.images import ImagesPipeline class DooubanPipeline(object): def process_item(self, item, spider): return item class AddScoreNum(object): def process_item(self,item,spider): if item['rating_num']: score = float(item['rating_num']) item['rating_num'] = str(score +1) return item else: raise Exception("爬取失败") class JsonWritePipeline(object): def open_spider(self,spider): self.file = open('douban.json','w') def process_item(self,item,spider): line = json.dumps(dict(item),indent = 4,ensure_ascii = False) self.file.write(line) return item def close_spider(self,spider): self.file.close() class MysqlPipeline(object): def open_spider(self,spider): self.connect = pymysql.connect( host = '127.0.0.1', port = 3306, db = 'scrapyProject', user = 'root', password = 'westos', charset = 'utf8', use_unicode = True, autocommit = True ) self.cursor = self.connect.cursor() self.cursor.execute("create table if not exists doubanTop(" "title varchar(50) unique ," "rating_num float," "author varchar(100)," "comment_num int," "evaluate varchar(100));") def process_item(self,item,spider): insert_sqli = "insert into doubanTop(title,rating_num,author,comment_num,evaluate) values ('%s','%s','%s','%s','%s')" %(item['title'],item['rating_num'],item['author'],item['comment_num'],item['evaluate'],) print(insert_sqli) try: self.cursor.execute(insert_sqli) self.connect.commit() except Exception as e: self.connect.rollback() return item def close_spider(self,spider): self.connect.commit() self.cursor.close() self.connect.close() class MyImagesPipeline(ImagesPipeline): def get_media_requests(self,item,info): return scrapy.Request(item['img_url']) def item_completed(self, results, item, info): img_paths = [x['path'] for isok,x in results if isok] if not img_paths: raise DropItem('Item contains images') item['img_path'] = img_paths[0] return item
-
setting.py
BOT_NAME = 'DoouBan' SPIDER_MODULES = ['DoouBan.spiders'] NEWSPIDER_MODULE = 'DoouBan.spiders' # Crawl responsibly by identifying yourself (and your website) on the user-agent #USER_AGENT = 'DoouBan (+http://www.yourdomain.com)' from fake_useragent import UserAgent ua = UserAgent() USER_AGENT = ua.random # Obey robots.txt rules # ROBOTSTXT_OBEY = True ROBOTSTXT_OBEY = False ITEM_PIPELINES = { # 'scrapy.pipelines.images.ImagesPipeline' : 1, # 'DoouBan.pipelines.MyImagesPipeline' :2, # 'DoouBan.pipelines.DooubanPipeline': 300, # 'DoouBan.pipelines.JsonWritePipeline': 200, 'scrapy_redis.pipelines.RedisPipeline' :100, # 'DoouBan.pipelines.AddScoreNum': 100, # 'DoouBan.pipelines.MysqlPipeline' : 200 } DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter" SCHEDULER = "scrapy_redis.scheduler.Scheduler" SCHEDULER_PERSIST = True REDIS_HOST = '172.25.254.46' REDIS_PORT = 6379 IMAGES_STORE = './images' IMAGES_THUMBS = { 'small' : (100,100), 'big' : (270,270) } IMAGES_MIN_HEIGHT = 110 IMAGES_MIN_WIDTH = 110
运行结果:
将爬取到的信息保存在csv文件中,也可以保存在数据库中