Python 三.创建第一个scrapy爬虫项目(分布式爬虫打造搜索引擎)

码点

已于 2022-03-20 11:05:57 修改

阅读量1.9k

点赞数 2

分类专栏： python scrapy 文章标签： python 爬虫 pycharm

于 2022-03-18 16:24:09 首次发布

本文链接：https://blog.csdn.net/qq_31939617/article/details/123575456

版权

Scrapy Python爬虫网页数据提取 XPath PyCharm调试

关键词由CSDN通过智能技术生成

python 同时被 2 个专栏收录

22 篇文章

订阅专栏

scrapy

1 篇文章

订阅专栏

1.安装pywin32
打开cmd窗口

pip install -i https://pypi.tuna.tsinghua.edu.cn/simple pywin32

2.安装Twisted网络数据处理的集成包

pip install -i https://pypi.tuna.tsinghua.edu.cn/simple Twisted

3.安装scrapy

pip install -i https://pypi.tuna.tsinghua.edu.cn/simple scrapy

4.查看scrapy的版本号

scrapy version

5.进入需要创建爬虫项目的工作空间

scrapy startproject ArticleSpider

在这里插入图片描述
创建项目成功

6.用pycharm打开项目结构如下

在这里插入图片描述

7.使用scrapy genspider命令创建爬虫主文件，指定爬取的url
生成爬虫主文件

scrapy genspider jobbole blog.jobbole.com

在这里插入图片描述
8.再看项目结构，已经多了jobbole.py这个文件

9.使用scrapy crawl 启动爬虫，查看是否能够访问url，如果200说明访问成功，爬虫框架搭建成功

scrapy crawl jobbole

在这里插入图片描述
10.生成main.py文件，便于调试
项目名称ArticleSpider上右键，new ,Python File

这样就生成了main.py文件
11.main.py加入文件指向

from scrapy.cmdline import execute

import sys
import os

print(os.getcwd())
execute(["scrapy","crawl","jobbole"])

在这里插入图片描述
12.settings.py配置
将ROBOTSTXT_OBEY 属性改为False

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

在这里插入图片描述

13.运行项目在这里插入图片描述
14.打印结果

成功，跟第9点的是一样的

14.抓取数据
我们的目标，抓取到标题，日期，阅读，内容
在这里插入图片描述
13.xpath路径，使用Chrome浏览器F12，选择指针，定位到标题，然后在右侧标题上右键，选择复制Xpath,得到

//*[@id="index-left-graphic"]/div[1]/div[2]/div[1]/a/h1

这个就是我们需要的标题路径

在这里插入图片描述
14.回到项目的jobbole.py，提取response中的内容

import re

import scrapy


class JobboleSpider(scrapy.Spider):
    name = 'jobbole'
    allowed_domains = ['blog.jobbole.com']
    start_urls = ['http://blog.jobbole.com/']

    def parse(self, response):
        title = response.xpath('//*[@id="index-left-graphic"]/div[1]/div[2]/div[1]/a/h1/text()').extract()[0]
        print(title)

        content = response.xpath('//*[@id="index-left-graphic"]/div[1]/div[2]/div[2]/text()').extract()[0]
        print(content)

        date = response.xpath('//*[@id="index-left-graphic"]/div[1]/div[2]/div[3]/div[1]/span[1]/text()').extract()[0]
        print(date)

        read1 = response.xpath('//*[@id="index-left-graphic"]/div[1]/div[2]/div[3]/div[1]/span[2]/text()').extract()[0]
        read = re.findall(r"\d+\.?d*",read1)[0]
        print(read)
        pass

结果：
在这里插入图片描述
这个基本没有什么难度，只是评论多做了一个从字符串中提取数字的动作。

15.在代码中调试，每次都需要请求一次，会影响效率。这里我们还可以多学习一项技巧，使用cmd窗口，用scrapy shell方式进行调试，得到结果后，再到写到代码中。首先进入项目目录下，运行

scrapy shell http://blog.jobbole.com/

结果
在这里插入图片描述
这个跟我们在jobole.py得到的response是一样的
16.获取title

title=response.xpath('//*[@id="index-left-graphic"]/div[1]/div[2]/div[1]/a/h1/text()')

在这里插入图片描述
其他的内容获取方式也基本一样