实践6 – Scrapy安装和配置
在Linux中安装Scrapy
- 激活Python虚拟环境
- 安装Twisted
wget https://twistedmatrix.com/Releases/Twisted/17.1/Twisted-17.1.0.tar.bz2
tar -jxvf Twisted-17.1.0.tar.bz2
cd Twisted-17.1.0
python3 setup.py install
- 安装scrapy
pip install -i https://pypi.doubanio.com/simple/ scrapy3
创建Scrapy项目
- 创建Scrapy项目
scrapy startproject douban
- 增加第一个爬虫
cd douban
scrapy genspider douban_spider movie.douban.com #域名
实践7 – 使用Scrapy爬取豆瓣电影短评
- scrapy爬取4步骤:
1、新建项目
2、明确目标
3、制作爬虫
4、存储内容
- 修改setting设置
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
- 明确目标
# items.py
import scrapy
class DoubanItem(scrapy.Item):
serial_number = scrapy.Field()
movie_name = scrapy.Field()
introduce = scrapy.Field()
stars = scrapy.Field()
evaluate = scrapy.Field()
describe = scrapy.Field()
- 制作爬虫
# douban_spider.py
import scrapy
from douban.items import DoubanItem
class DoubanSpiderSpider(scrapy.Spider):
name = 'douban_spider'
allowed_domains = ['movie.douban.com']
start_urls = ['https://movie.douban.com/top250']
def parse(self, response):
# 处理当前页
movie_list = response.xpath("//ol[@class='grid_view']/li")
for item in movie_list:
douban_item = DoubanItem()
douban_item["serial_number"] = item.xpath(".//div[@class='pic']/em/text()").extract_first()
douban_item["movie_name"] = item.xpath(".//div[@class='info']/div[@class='hd']/a/span[1]/text()").extract_first()
content = item.xpath(".//div[@class='info']/div[@class='bd']/p[1]/text()").extract()
content_s = ""
for i_content in content:
line = "".join(i_content.split())
content_s = content_s + line + "/"
douban_item["introduce"] = content_s
douban_item["stars"] = item.xpath(".//div[@class='info']//div[@class='star']/span[@class='rating_num']/text()").extract_first()
douban_item["evaluate"] = item.xpath(".//div[@class='info']//div[@class='star']/span[4]/text()").extract_first()
douban_item["describe"] = item.xpath(".//div[@class='info']/div[@class='bd']//span[@class='inq']/text()").extract_first()
yield douban_item
# 找到下一页
next_link = response.xpath("//span[@class='next']/link/@href").extract()
if next_link:
next_link = next_link[0]
yield scrapy.Request("https://movie.douban.com/top250" + next_link, callback=self.parse)
为了让执行爬虫更方便,可以添加一个脚本完成以下命令
scrapy crawl douban_spider
新建 main.py 文件
# main.py
from scrapy import cmdline
cmdline.execute('scrapy crawl douban_spider'.split())
- 存储内容
在setting.py中添加数据库配置信息
mongo_host = '127.0.0.1'
mongo_port = 27017
mongo_db_name = 'douban'
mongo_db_collection = 'douban_movie'
在pipelines.py中添加自己的pipeline
import pymongo
from douban.settings import mongo_host,mongo_port,mongo_db_name,mongo_db_collection
class DoubanPipeline(object):
def __init__(self):
host = mongo_host
port = mongo_port
dbname = mongo_db_name
sheetname = mongo_db_collection
client = pymongo.MongoClient(host=host, port=port)
mydb = client[dbname]
self.post = mydb[sheetname]
def process_item(self, item, spider):
data = dict(item)
self.post.insert(data)
return item
在setting.py中开启自己定义的pipeline
# Configure item pipelines
ITEM_PIPELINES = {
'douban.pipelines.DoubanPipeline': 300,
}