Scrapy
Windows环境下安装
以Pycharm为例,直接在虚拟终端中进行安装 pip install scrapy,或者在settings中project Interpreter中搜索添加。
linux 环境下的安装
根据官方指导安装依赖
- 链接:https://docs.scrapy.org/
- 依赖库:sudo apt-get install python-dev python-pip libxml2-dev libxslt1-dev zlib1g-dev libffi-dev libssl-dev
安装依赖库的时候会报错:Unable to locate Package 某个依赖库。
不要慌,更新一下:sudo apt-get update.
等待更新完成之后再安装就可以了。
- sudo apt-get install python3 python3-dev
安装虚拟工作环境
- 安装:
- virtualenv -p /usr/bin/python3 env
env 是虚拟环境的名字,见闻觉知最好啦。
- 激活:
- source env/bin/activate
- 退出虚拟环境
- deactivate
在虚拟环境中安装scrapy
- pip install scrapy
会安装许多依赖文件,耐心等待一下。安装失败好多次,网不好,多次请求超时,没办法,只能不断重装,直到完全安装。
创建项目
- scrapy startproject example(项目名字)
- 创建爬虫
- 会有提示,进入到项目中(cd example)
- scrapy genspider examplespider examplespider.com
运行项目
-
第一种方式,命令运行:scrapy crawl examplespider
-
第二种方式,新建run.py
from scrapy.cmdline import execute #下面两行效果相同,书写的时候更快捷。。 # execute(['scrapy', 'crawl', 'douban']) execute('scrapy crawl douban'.split())
运行run.py.
添加headers 的三种方式
- 全局设置
- 在setting.py中设置,此处是全局的设置,会影响所有的爬虫
- 局部设置
-
第一种:在spiders文件夹中的爬虫文件添加,如下
custom_settings = {
‘DEFAULT_REQUEST_HEADERS’ : {
‘Accept’: ‘text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8’,
‘Accept-Language’: ‘en’,
‘User-Agent’: ‘添加内容’
}
} -
第二种:在请求中为其添加url
seq = scrapy.Request(url=url, callback=self.parse)
在此处添加User-Agent
seq.headers[‘User-Agent’] = ‘’
yield seq
-
光说不练假把式——爬取豆瓣热门电影
链接是这个热门电影
- 简单看一下页面,最近热门电影的网页是动态加载的,所有打开控制台,看一下他的动态加载时的文件信息。来了,链接是这个热门电影。
- 创建项目:scrapy startproject hotmovie
- 进入项目空间:cd hotmovie,
- 创建爬虫:scrapy genspider douban douban.com
- 目前阶段,我们需要编辑的文件有,douban.py,MySqlHelper.py,items.py,pipelines.py,settings.py等等。
douban.py
这个文件的作用主要是产生要爬取的url,并解析出要搜集的字段信息
# -*- coding: utf-8 -*-
import scrapy
import json
from HotMovie.items import DoubanItem
class DoubanSpider(scrapy.Spider):
name = 'douban'
allowed_domains = ['douban.com']
# 爬取豆瓣 headers信息需要添加。
custom_settings = {
'DEFAULT_REQUEST_HEADERS' : {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36',
'Host': 'movie.douban.com',
'Referer': 'https://movie.douban.com/explore',
}
}
# start_urls = ['http://www.douban.com']
def start_requests(self):
base_url = "https://movie.douban.com/j/search_subjects?type=movie&tag=%E8%B1%86%E7%93%A3%E9%AB%98%E5%88%86&sort=recommend&page_limit=20&page_start={}"
page = 20
for i in range(page):
url = base_url.format(i*20)
seq = scrapy.Request(url=url, callback=self.parse)
# 在此处添加User-AAgent
# seq.headers['User-Agent'] = ''
yield seq
def parse(self, response):
json_str = response.body.decode('utf-8')
# print('-'*50)
# print(type(json_str))
result = json.loads(json_str)
items_list = result['subjects']
for item in items_list:
url = item['url']
yield scrapy.Request(url=url, callback=self.get_detail)
# print(url)
# filename = 'result.json'
# with open(filename, 'a') as f:
# f.write(json_str)
# self.log("文件{}下载完成!".format(filename))
def get_detail(self, response):
# 电影名字
name = response.xpath('//span[@property="v:itemreviewed"]/text()').extract_first()
# 上映时间
year = response.xpath('//span[@class="year"]/text()').extract_first()[1:-1]
# 导演
director = response.xpath('//a[@rel="v:directedBy"]/text()').extract_first()
# 编剧 是一个列表 会有多个编剧
screenwriter_list = response.xpath('//div[@id="info"]/span[2]/span[2]/a/text()').extract()
# 主演 也是一个列表
actor_list = response.xpath('//div[@id="info"]/span[3]/span[2]/a/text()').extract()
# 电影类型 也是列表
type_list = response.xpath('//span[@property="v:genre"]/text()').extract()
# 其他未在标签中的内容
others = response.xpath('//div[@id="info"]/text()').extract()
# 化简 去除空格,换行, 斜杠
simplify = [i for i in others if i.strip()!="\n" and i.strip()!="" and i.strip()!="/"]
# 制片国家和地区
made = simplify[1]
# 语言
language = simplify[2]
# 上映时间
publish_time = response.xpath('//span[@property="v:initialReleaseDate"]/text()').extract()
# 时长
length = response.xpath('//span[@property="v:runtime"]/text()').extract_first()
# 又名 列表
other_name = simplify[-1]
# 评分
score = float(response.xpath('//strong[@property="v:average"]/text()').extract_first())
# 参与评分人数
rate_people = int(response.xpath('//div[@class="rating_sum"]/a/span/text()').extract_first())
# 简介
summary = response.xpath('//div[@class="indent"]/span/text()').extract_first().strip()
# 海报链接
poster = response.xpath('//div[@id="mainpic"]/a/img/@src').extract_first()
item = DoubanItem()
item['name'] = name
item['year'] = year
item['director'] = director
item['screenwriter_list'] = ','.join(screenwriter_list)
item['actor_list'] = ','.join(actor_list)
item['type_list'] = ','.join(type_list)
# item['simplify'] = simplify
item['made'] = made
item['language'] = language
item['publish_time'] = ','.join(publish_time)
item['length'] = length
item['other_name'] = other_name
item['score'] = score
item['rate_people'] = rate_people
item['summary'] = summary
item['poster'] = poster
yield item
MySqlHelper.py
连接数据库方法类。
import pymysql
class SqlHelper(object):
def __init__(self):
self.conn = pymysql.connect(host="localhost", port=3306, user='root', password='root',db='scrapy', charset='utf8mb4')
self.cursor = self.conn.cursor()
def insert_data(self, sql, data):
self.cursor.execute(sql, data)
self.conn.commit()
def __del__(self):
self.cursor.close()
self.conn.close()
# 测试方法 可以不要
if __name__ == '__main__':
helper = SqlHelper()
sql = "INSERT INTO txt (con) VALUES (%s)"
data = ('这是一条测试语句',)
helper.insert_data(sql, data)
items.py
这个有点机械,注意命名,不要错乱囖。
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
class HotmovieItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
pass
class DoubanItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
name = scrapy.Field()
year = scrapy.Field()
director = scrapy.Field()
screenwriter_list = scrapy.Field()
actor_list = scrapy.Field()
type_list = scrapy.Field()
# simplify = scrapy.Field()
made = scrapy.Field()
language = scrapy.Field()
publish_time = scrapy.Field()
length = scrapy.Field()
other_name = scrapy.Field()
score = scrapy.Field()
rate_people = scrapy.Field()
summary = scrapy.Field()
poster = scrapy.Field()
def insert_data_sql(self):
sql = "INSERT INTO hotmovie (`name`,`year`,director,screenwriter_list,actor_list,type_list,made,`language`,publish_time,`length`,other_name,score,rate_people,summary,poster) VALUES (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)"
data = (self['name'],self['year'],self['director'],self['screenwriter_list'],self['actor_list'],self['type_list'],self['made'],self['language'],self['publish_time'],self['length'],self['other_name'],self['score'],self['rate_people'],self['summary'],self['poster'])
return (sql, data)
pipelines.py
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
from HotMovie.MySqlHelper import SqlHelper
class HotmoviePipeline(object):
def process_item(self, item, spider):
return item
# 创建了自己 pipeline 记得在settings.py,中修改,文下会说
class DoubanPipeline(object):
def __init__(self):
self.helper = SqlHelper()
def process_item(self, item, spider):
if 'insert_data_sql' in dir(item):
(sql, data) = item.insert_data_sql()
self.helper.insert_data(sql, data)
return item
settings.py
修改配置。
主要用到的几个点:
1、robots协议:不遵循。# Obey robots.txt rules
ROBOTSTXT_OBEY = False
2、添加自己定义的pipelines
ITEM_PIPELINES = {
#‘HotMovie.pipelines.HotmoviePipeline’: 300,
‘HotMovie.pipelines.DoubanPipeline’: 300,
}
run_douban.py
我们可以用命令:scrapy crawl douban 来运行
也可以,新建文件,码入。
from scrapy.cmdline import execute
# execute(['scrapy', 'crawl', 'douban']) # 下面一种方式也可以
execute('scrapy crawl douban'.split())
运行结果
不要搞的太频繁啦,君子协议。
sql语句建库:
CREATE TABLE hotmovie (
id INT AUTO_INCREMENT PRIMARY KEY ,
title VARCHAR(255) ,
year int ,
director VARCHAR(255) ,
screenwriter_list VARCHAR(255),
actor_list VARCHAR(255) ,
type_list VARCHAR(255) ,
made VARCHAR(255),
language VARCHAR(255) ,
publish_time VARCHAR(255),
length VARCHAR(100) ,
other_name VARCHAR(255) ,
score VARCHAR(11) ,
rate_people VARCHAR(100),
summary TEXT ,
poster VARCHAR(255)
)DEFAULT CHAR SET = ‘utf8mb4’;
我的IP已经被封了,今天数据库搞错了,没弄好,访问的有点频繁。该准备整一整代理了。下次再见。