Scrapy第一个项目

Adolf_1993

已于 2024-06-07 07:56:41 修改

阅读量226

点赞数

分类专栏： Python Scrapy 文章标签： scrapy python 开发语言

于 2022-12-06 11:41:08 首次发布

本文链接：https://blog.csdn.net/nnjy_1993/article/details/128195946

版权

Python 同时被 2 个专栏收录

12 篇文章 0 订阅

订阅专栏

Scrapy

6 篇文章 0 订阅

订阅专栏

1. 安装scrapy

pip install scrapy

2. 创建项目的命令

scrapy startproject <项目名>

- 示例:

scrapy startproject myscrapyPro

3.创建爬虫

cd 进到项目名

执行:

scrapy genspider <爬虫名字> <允许的域名>

示例:

scrapy genspider myspider www.xxx.com

4.完善爬虫

import scrapy

class mySpider(scrapy.Spider):  # 继承scrapy.spider
    # 爬虫名字 
    name = 'mypro' 
    # 允许爬取的范围
    allowed_domains = ['xxx.com'] 
    # 开始爬取的url地址
    start_urls = ['http://www.xxx.com']

    # 数据提取的方法，接受下载中间件传过来的response
    def parse(self, response): 
        # 直接打印response看下结果
        print(response)

        # scrapy的response对象可以直接进行xpath
        names = response.xpath('//div[@class="tea_con"]//li/div/h3/text()') 
        print(names)

        # 获取具体数据文本的方式如下
        # 分组
        li_list = response.xpath('//div[@class="tea_con"]//li') 
        for li in li_list:
            # 创建一个数据字典
            item = {}
            # 利用scrapy封装好的xpath选择器定位元素，并通过extract()或extract_first()来获取结果
            item['name'] = li.xpath('.//h3/text()').extract_first() # 名字
            item['level'] = li.xpath('.//h4/text()').extract_first() # 级别
            item['text'] = li.xpath('.//p/text()').extract_first() # 介绍
            print(item)

4.1 先打开setting文件中修改下UA

4.2 日志级别改为

LOG_LEVEL = "ERROR" 这样只显示报错信息

5. 终端执行运行命令

scrapy crawl <爬虫名字>

scrapy crawl <爬虫名字> -- nolog 不显示日志信息

5.1 提取数据,属性的方法

response.xpath方法的返回结果是一个类似list的类型，其中包含的是selector对象，操作和列表一样，但是有一些额外的方法
额外方法extract()：返回一个包含有字符串的列表
额外方法extract_first()：返回列表中的第一个字符串，列表为空没有返回None

5.2 response响应对象的常用属性

1.响应url

print(response.url)

2.请求url

print(response.request.url)

3.响应头

print(response.headers)

4.请求头

print(response.request.headers)

5.html

print(response.body)

6.响应状态码

print(response.status)

6.保存数据

6.1 在命令行 scrapy可以直接保存 csv,json,xml

执行命令:

scrapy crawl <爬虫名> -o <文件名>.csv

6.2 管道存储

6.2.1 导入myproItem对象

import scrapy

# 导入myproItem
from mySpider.items import myproItem

6.2.2 实例化myproItem对象并提交到item

# 数据提取的方法，接受下载中间件传过来的response
def parse(self, response):
    # 获取具体数据文本的方式如下
    # 分组
    li_list = response.xpath('//div[@class="tea_con"]//li') 
    for li in li_list:
        # 实例化item对象
        item = myproItem()

        # 利用scrapy封装好的xpath选择器定位元素，并通过extract()或extract_first()来获取结果
        item['name'] = li.xpath('.//h3/text()').extract_first() # 名字
        item['level'] = li.xpath('.//h4/text()').extract_first() # 级别
        item['text'] = li.xpath('.//p/text()').extract_first() # 介绍
        
        # 提交item
        yield item

6.2.3 在item类中定义相关属性

import scrapy


class ItcastproItem(scrapy.Item):
    # define the fields for your item here like:
    name = scrapy.Field()
    level = scrapy.Field()
    text = scrapy.Field()

6.3 在管道文件中定义方法

class ItcastproPipeline:
    
    def __init__(self):
        print("开始爬虫")
        self.f = open('123.txt',mode='w',encoding='utf-8')

    

    def process_item(self, item, spider):
        # 把itme对象转成str
        name = item['name']
        level = item['level']
        text = item['text']
        
        self.f.write(name)
        self.f.write('\n')
        self.f.write(level)
        self.f.write('\n')
        self.f.write(text)
        self.f.write('\n')
        
        return item
    
    

    def __del__(self):
        print("结束爬虫")
        self.f.close()

6.3.1 也可以这样写 (传入spider)

def open_spider(self,spider):
        # 打开文件
        self.f = open("")
    
    def process_item(self, item, spider):
        # 写入数据
        self.f.write()
        
        return item
    
    
    def close_spider(self,spider):
        # 关闭文件
        self.f.close()

6.4 启用管道

ITEM_PIPELINES = {
    'myspider.pipelines.ItcastPipeline': 400
}

6.5 执行scrapy crawl <爬虫名>

Adolf_1993

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
1
评论
Scrapy第一个项目

1. 安装scrapy pip install scrapy 2. 创建项目的命令 scrapy startproject - 示例: scrapy startproject myscrapyPro3.创建爬虫 cd 进到项目名执行: scrapy genspider 示例:
复制链接

扫一扫