scrapy 爬虫教程

最新推荐文章于 2024-07-04 16:04:39 发布

小毅哥哥

最新推荐文章于 2024-07-04 16:04:39 发布

阅读量936

点赞数

本文链接：https://blog.csdn.net/qq_19678579/article/details/86497071

版权

Python 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

scrapy 爬虫教程

文章目录

scrapy 爬虫教程
一. 环境配置
- 1. 进去项目目录处
- 2. 安装`pipenv`环境和`scrapy`框架
二. 项目开始
三. 文件释义
四. PyCharm使用
五. 错误信息
- 1. ModuleNotFoundError: No module named 'win32api'
在当前虚拟环境中安装 pypiwin32 依赖
- 2. 打印网页内容乱码问题

一. 环境配置

使用pipenv虚拟环境:

1. 进去项目目录处

cd /Users/xiaoyigege/Desktop/Python/ptest

2. 安装`pipenv`环境和`scrapy`框架

安装环境

pipenv install

更换源

url = “https://pypi.tuna.tsinghua.edu.cn/simple”

安装框架

pipenv install scrapy

测试 scrapy
//有内容返回则测试通过

pipenv run scrapy fetch “http://www.baidu.com”

二. 项目开始

一. 新建项目

在ptest文件下新建项目

pipenv run scrapy startproject ITcast

ITcast目录结构如下:

.
├── ITcast
│   ├── ITcast
│   │   ├── __init__.py
│   │   ├── __pycache__
│   │   ├── items.py
│   │   ├── middlewares.py
│   │   ├── pipelines.py
│   │   ├── settings.py
│   │   └── spiders
│   │       ├── __init__.py
│   │       └── __pycache__
│   └── scrapy.cfg
├── Pipfile
└── Pipfile.lock

进入ITcast项目文件中新建爬虫文件
项目地址: /Users/xiaoyigege/Desktop/Python/ptest/ITcast

cd ITcast
pipenv run scrapy genspider itcast “www.itcast.cn”

二. 编写爬虫,代码实现功能

三. 保存为本地文件

/Users/xiaoyigege/Desktop/Python/ptest/Itcast

创建个文件夹

mkdir data

进入data文件夹

cd data

运行爬虫

pipenv run scrapy crawl itcast

.
├── ITcast
│   ├── __init__.py
│   ├── __pycache__
│   ├── items.py
│   ├── middlewares.py
│   ├── pipelines.py
│   ├── settings.py
│   └── spiders
│       ├── __init__.py
│       ├── __pycache__
│       └── itcast.py
├── data
│   └── itcast_pipeline.json
└── scrapy.cfg

三. 文件释义

1. settings.py文件

#管道配置 ,数值越小优先级越高
ITEM_PIPELINES = {
   'ITcast.pipelines.ItcastPipeline': 300,
   'ITcast.pipelines.xxxPipeline': 400,   # xxxPipeline为管道名称
}

2. pipelines.py 文件

import json  #导入系统JSon模块

# 管道名称
class ItcastPipeline(object):
	#初始化,在爬虫的生命周期内只会执行一次(存本地的时候要实现,不必要)
    def __init__(self):
       self.f = open("itcast_pipeline.json", "w")

	# 管道实现,必须要的方法
    def process_item(self, item, spider):
        # 内容避免写在同一行直接拼接+",\n"
        content = json.dumps(dict(item), ensure_ascii=False) + ",\n"
        self.f.write(content)
        return item
	
	#在爬虫结束的时候关闭文件(不必要)
    def close_spider(self,spider):
        self.f.close()

# xxxPipeline :管道名称,在setting文件中使用
#class xxxPipeline(object):

3. items.py 文件

import scrapy

class ItcastItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()

    #老师姓名
    name = scrapy.field()
    #老师职称
    title = scrapy.field()
    #老师信息
    info = scrapy.field()

    # pass

4. spider.py

import scrapy
from ITcast.items import ItcastItem  #导入item模块

class ItcastSpider(scrapy.Spider):
    # 爬虫名,启动爬虫时需要的参数*必需
    name = 'itcast'
    # 爬取域范围,允许爬虫在这个域名下进行爬取(可选)
    allowed_domains = ['www.itcast.cn']
    # 起始url列表,爬虫执行后第一批请求,将从这个列表里获取
   start_urls = ['http://www.itcast.cn/channel/teacher.shtml']

    def parse(self, response):

        node_list = response.xpath("//div[@class='li_txt']")

        # 用来存储所有的item字段的
        for node in node_list
            # 创建item字段对象,用来存储信息
            item = ItcastItem()

            # extract() 将xpath对象转换为 Unicode字符串
            name = node.xpath("./h3/text()").extract()
            title = node.xpath("./h4/text()").extract()
            info = node.xpath("./p/text()").extract()

            item['name'] = name[0]
            item['title'] = title[0]
            item['info'] = info[0]

            # 返回提取到的每个item数据,给管道文件处理,同时还回来继续执行后面的代码(下一次for)
            yield item

        #pass

四. PyCharm使用

查看python项目的虚拟环境位置

pipenv --venv #显示当前虚拟环境所用的解释器位置

启动PyCharm，打开名称为ITcast的项目
进入项目设置，搜索Project Interpreter
在Project Interpreter的右上角配置按钮上选择Add Local
选择VirtualEnv Environment
复制刚才的环境路径/Users/xiaoyigege/.local/share/virtualenvs/ptest-uyD_8yIs/bin/python到粘贴板(需要自己拼接/bin/python)，粘贴到existing environment的interpreter下面，点击确定。
这样，你就可以在PyCharm里用为ITcast专门创建的python环境了。

五. 错误信息

1. ModuleNotFoundError: No module named ‘win32api’

在win7中缺少pypiwin32模块是需要安装pypiwin32环境：Scrapy执行crawl命令报错：ModuleNotFoundError: No module named ‘win32api’

在当前虚拟环境中安装 pypiwin32 依赖

pipenv install pypiwin32

2. 打印网页内容乱码问题

在爬虫文件中:

 def parse(self, response):
    	# 转换中文编码
    	html1 = response.body
    	html1 = html1.decode('utf-8')
    	
    	print(html1)
        # pass

小毅哥哥

关注

0
点赞
踩
5

收藏

觉得还不错? 一键收藏
打赏
0
评论
scrapy 爬虫教程

scrapy 爬虫教程一. 环境配置使用pipenv虚拟环境:1. 进去项目目录处cd /Users/xiaoyigege/Desktop/Python/ptest2. 安装pipenv环境和scrapy框架安装环境pipenv install更换源url = “https://pypi.tuna.tsinghua.edu.cn/simple”安装框...
复制链接

扫一扫