Scrapy

@Scrapy学习

Scrapy使用异步网络库处理网络通讯,

安装Scrapy

pip install Scrapy

可能遇到的坑

报错1

VC++14.0 Twisted

解决方法

离线安装,访问网站:

https://www.lfd.uci.edu/~gohlke/pythonlibs/

根据版本选择文件
使用命令进行安装离线文件

pip install xxx.whl

报错2

scrapy bencn 运行报错

解决方法

pip install pywin32

Scrapy原理

在这里插入图片描述
很重要!

直接项目

新建项目

采集目标

采集西刺网的ip代理

scrapy startproject xicidailiSpider

D:\untitled1>scrapy startproject xicidailiSpider
New Scrapy project 'xicidailiSpider', using template directory 'd:\anaconda3\lib\site-packages\scrapy\templates\project', created in:
    D:\untitled1\xicidailiSpider

You can start your first spider with:
    cd xicidailiSpider
    scrapy genspider example example.com

创建成功之后使用cd命令移动到文件夹中:

D:\untitled1>cd  D:\untitled1\xicidailiSpider

在这里插入图片描述
移动到有这个文件的目录下。

D:\untitled1\xicidailiSpider>scrapy genspider xicidaili xicidaili.com

其中xicidaili是爬虫名字(不能和项目名重复),xicidaili.com是域名。
在这里插入图片描述
这个时候就会产生一个新的文件,这个文件就是我们的爬虫文件。

import scrapy


# 创建爬虫类 并且继承来自scrapy.Spider——》最基础的类 另外及各类都是继承这个类
class XicidailiSpider(scrapy.Spider):
    name = 'xicidaili' # 爬虫名字--》必须唯一
    allowed_domains = ['xicidaili.com'] # 允许采集的域名
    start_urls = ['http://xicidaili.com/'] # 开始采集的网址

    # 解析相应数据 提取数据 或者网址 response 网页源码
    def parse(self, response):
        pass

采集数据

1.正则表达式(基础 必回 难掌握)
2.XPath–》从html中提取数据
3.CSS

对于某些网站代理ip和端口号分开的情况,可以先用xpath对整个tr进行提取也就是 //tr 然后使用for循环遍历所有的提取项,再使用xpath进行提取,提取的时候使用 . 开头写xpath语句, . 的作用是表示当前的html中筛选。

import scrapy


# 创建爬虫类 并且继承来自scrapy.Spider——》最基础的类 另外及各类都是继承这个类
class XicidailiSpider(scrapy.Spider):
    name = 'xicidaili' # 爬虫名字--》必须唯一
    allowed_domains = ['xiladaili.com'] # 允许采集的域名
    start_urls = ['http://www.xiladaili.com/gaoni/'] # 开始采集的网址

    # 解析相应数据 提取数据 或者网址 response 网页源码
    def parse(self, response):
        # 提取数据
        data = response.xpath("/html/body/div/div[3]/div[2]/table/tbody/tr/td[1]/text()")
        print(data)
        pass

这里临时换个网址,因为我在学的时候西刺网已经无了

运行代码:

scrapy crawl xicidaili

最后的xicidaili是我们的爬虫名。
如果遇到反爬可以尝试增加user-agent
在settings.py文件中修改。
在这里插入图片描述

2021-08-11 10:02:04 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.xiladaili.com/gaoni/> (referer: None)
[<Selector xpath='/html/body/div/div[3]/div[2]/table/tbody/tr/td[1]/text()' data='103.103.3.6:8080'>, <Selector xpath='/html/body/div/div[3]/div[2]/table/tbody/tr/td[1]/text()' data='27.192.200.7:9000'>, <Selector xpath='/html/body/div/div[3]/div[2]/table/tbody/tr/td[1]/text()' data='113.237.3.178:9999'>, <Selector xpath='/html/body/div/div[3]/div[2]/table/tbody/tr/td[1]/text()' data='61.37.223.152:8080'>, <Selector xpath='/html/body/div/div[3]/div[2]/table/tbody/tr/td[1]/text()' data='118.117.188.171:3256'>, <Selector xpath='/html/body/div/div[3]/div[2]/table/tbody/tr/td[1]/text()' data='45.228.188.241:999'>, <Selector xpath='/html/body/div/div[3]/div[2]/table/tbody/tr/td[1]/text()' data='104.254.238.122:20171'>, <Selector xpath='/html/body/div/div[3]/div[2]/table/tbody/tr/td[1]/text()' data='211.24.95.49:47615'>, <Selector xpath='/html/body/div/div[3]/div[2]/table/tbody/tr/td[1]/text()' data='47.104.66.204:80'>, <Selector xpath='/html/body/div/div[3]/div[2]/table/tbody/tr/td[1]/text()' data='191.101.39.193:80'>, <Selector xpath='/html/body/div/div[3]/div[2]/table/tbody/tr/td[1]/text()' data='112.104.28.117:3128'>, <Selector xpath='/html/body/div/div[3]/div[2]/table/tbody/tr/td[1]/text()' data='103.205.15.97:8080'>, <Selector xpath='/html/body/div/div[3]/div[2]/table/tbody/tr/td[1]/text()' data='185.179.30.130:8080'>, <Selector xpath='/html/body/div/div[3]/div[2]/table/tbody/tr/td[1]/text()' data='190.108.88.97:999'>, <Selector xpath='/html/body/div/div[3]/div[2]/table/tbody/tr/td[1]/text()' data='81.30.220.116:8080'>, <Selector xpath='/html/body/div/div[3]/div[2]/table/tbody/tr/td[1]/text()' data='218.88.204.125:3256'>, <Selector xpath='/html/body/div/div[3]/div[2]/table/tbody/tr/td[1]/text()' data='178.62.56.172:80'>, <Selector xpath='/html/body/div/div[3]/div[2]/table/tbody/tr/td[1]/text()' data='178.134.208.126:50824'>, <Selector xpath='/html/body/div/div[3]/div[2]/table/tbody/tr/td[1]/text()' data='193.149.225.163:80'>, <Selector xpath='/html/body/div/div[3]/div[2]/table/tbody/tr/td[1]/text()' data='182.87.136.228:9999'>, <Selector xpath='/html/body/div/div[3]/div[2]/table/tbody/tr/td[1]/text()' data='175.42.122.142:9999'>, <Selector xpath='/html/body/div/div[3]/div[2]/table/tbody/tr/td[1]/text()' data='192.109.165.128:80'>, <Selector xpath='/html/body/div/div[3]/div[2]/table/tbody/tr/td[1]/text()' data='181.3.91.56:10809'>, <Selector xpath='/html/body/div/div[3]/div[2]/table/tbody/tr/td[1]/text()' data='182.84.144.91:3256'>, <Selector xpath='/html/body/div/div[3]/div[2]/table/tbody/tr/td[1]/text()' data='188.166.125.206:38892'>, <Selector xpath='/html/body/div/div[3]/div[2]/table/tbody/tr/td[1]/text()' data='118.194.242.57:80'>, <Selector xpath='/html/body/div/div[3]/div[2]/table/tbody/tr/td[1]/text()' data='167.172.180.46:33555'>, <Selector xpath='/html/body/div/div[3]/div[2]/table/tbody/tr/td[1]/text()' data='58.255.7.90:9999'>, <Selector xpath='/html/body/div/div[3]/div[2]/table/tbody/tr/td[1]/text()' data='190.85.244.70:999'>, <Selector xpath='/html/body/div/div[3]/div[2]/table/tbody/tr/td[1]/text()' data='118.99.100.164:8080'>, <Selector xpath='/html/body/div/div[3]/div[2]/table/tbody/tr/td[1]/text()' data='182.84.145.181:3256'>, <Selector xpath='/html/body/div/div[3]/div[2]/table/tbody/tr/td[1]/text()' data='94.180.72.40:3128'>, <Selector xpath='/html/body/div/div[3]/div[2]/table/tbody/tr/td[1]/text()' data='45.236.168.183:999'>, <Selector xpath='/html/body/div/div[3]/div[2]/table/tbody/tr/td[1]/text()' data='01.20.217.52:8080'>, <Selector xpath='/html/body/div/div[3]/div[2]/table/tbody/tr/td[1]/text()' data='175.146.211.158:9999'>, <Selector xpath='/html/body/div/div[3]/div[2]/table/tbody/tr/td[1]/text()' data='114.233.189.228:9999'>, <Selector xpath='/html/body/div/div[3]/div[2]/table/tbody/tr/td[1]/text()' data='125.113.133.47:8118'>, <Selector xpath='/html/body/div/div[3]/div[2]/table/tbody/tr/td[1]/text()' data='171.35.213.44:9999'>, <Selector xpath='/html/body/div/div[3]/div[2]/table/tbody/tr/td[1]/text()' data='131.153.151.250:43064'>, <Selector xpath='/html/body/div/div[3]/div[2]/table/tbody/tr/td[1]/text()' data='106.45.220.42:3256'>, <Selector xpath='/html/body/div/div[3]/div[2]/table/tbody/tr/td[1]/text()' data='177.229.194.30:999'>, <Selector xpath='/html/body/div/div[3]/div[2]/table/tbody/tr/td[1]/text()' data='191.101.39.110:80'>, <Selector xpath='/html/body/div/div[3]/div[2]/table/tbody/tr/td[1]/text()' data='195.46.124.94:4444'>, <Selector xpath='/html/body/div/div[3]/div[2]/table/tbody/tr/td[1]/text()' data='113.237.1.179:9999'>, <Selector xpath='/html/body/div/div[3]/div[2]/table/tbody/tr/td[1]/text()' data='182.84.144.12:3256'>, <Selector xpath='/html/body/div/div[3]/div[2]/table/tbody/tr/td[1]/text()' data='36.56.102.35:9999'>, <Selector xpath='/html/body/div/div[3]/div[2]/table/tbody/tr/td[1]/text()' data='131.153.151.250:8003'>, <Selector xpath='/html/body/div/div[3]/div[2]/table/tbody/tr/td[1]/text()' data='45.225.88.220:999'>, <Selector xpath='/html/body/div/div[3]/div[2]/table/tbody/tr/td[1]/text()' data='195.9.61.22:45225'>, <Selector xpath='/html/body/div/div[3]/div[2]/table/tbody/tr/td[1]/text()' data='43.239.152.254:8080'>]

数据很乱而且没有拿出来,修改代码:

import scrapy


# 创建爬虫类 并且继承来自scrapy.Spider——》最基础的类 另外及各类都是继承这个类
class XicidailiSpider(scrapy.Spider):
    name = 'xicidaili' # 爬虫名字--》必须唯一
    allowed_domains = ['xiladaili.com'] # 允许采集的域名
    start_urls = ['http://www.xiladaili.com/gaoni/'] # 开始采集的网址

    # 解析相应数据 提取数据 或者网址 response 网页源码
    def parse(self, response):
        # 提取数据
        data = response.xpath("/html/body/div/div[3]/div[2]/table/tbody/tr/td[1]/text()")
        print(data.getall())
        pass
2021-08-11 10:10:36 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.xiladaili.com/gaoni/> (referer: None)
['103.103.3.6:8080', '27.192.200.7:9000', '113.237.3.178:9999', '61.37.223.152:8080', '118.117.188.171:3256', '45.228.188.241:999', '104.254.238.122:20171', '211.24.95.49:47615', '47.104.66.204:80', '191.101.39.193:80', '112.104.28.117:3128', '103.205.15.97:8080', '185.179.30.130:8080', '190.108.88.97:999', '81.30.220.116:8080', '218.88.204.125:3256', '178.62.56.172:80', '178.134.208.126:50824', '193.149.225.163:80', '182.87.136.228:9999', '175.42.122.142:9999', '192.109.165.128:80', '181.3.91.56:10809', '182.84.144.91:3256', '188.166.125.206:38892', '118.194.242.57:80', '167.172.180.46:33555', '58.255.7.90:9999', '190.85.244.70:999', '118.99.100.164:8080', '182.84.145.181:3256', '94.180.72.40:3128', '45.236.168.183:999', '01.20.217.52:8080', '175.146.211.158:9999', '114.233.189.228:9999', '125.113.133.47:8118', '171.35.213.44:9999', '131.153.151.250:43064', '106.45.220.42:3256', '177.229.194.30:999', '191.101.39.110:80', '195.46.124.94:4444', '113.237.1.179:9999', '182.84.144.12:3256', '36.56.102.35:9999', '131.153.151.250:8003', '45.225.88.220:999', '195.9.61.22:45225', '43.239.152.254:8080']

如果只有一个元素使用get()就行如果多个元素使用getall()

翻页操作

import scrapy


# 创建爬虫类 并且继承来自scrapy.Spider——》最基础的类 另外及各类都是继承这个类
class XicidailiSpider(scrapy.Spider):
    name = 'xicidaili' # 爬虫名字--》必须唯一
    allowed_domains = ['xiladaili.com'] # 允许采集的域名
    start_urls = ['http://www.xiladaili.com/gaoni/'] # 开始采集的网址

    # 解析相应数据 提取数据 或者网址 response 网页源码
    def parse(self, response):
        # 提取数据
        data = response.xpath("/html/body/div/div[3]/div[2]/table/tbody/tr/td[1]/text()")
        print(data.getall())

        # 翻页操作

        next_page = response.xpath("/html/body/div/div[3]/nav/ul/li[16]/a/@href")
        if next_page:
            print(next_page.get())

        pass

2021-08-11 10:24:34 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2021-08-11 10:24:35 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.xiladaili.com/gaoni/> (referer: None)
['103.103.3.6:8080', '27.192.200.7:9000', '113.237.3.178:9999', '61.37.223.152:8080', '118.117.188.171:3256', '45.228.188.241:999', '104.254.238.122:20171', '211.24.95.49:47615', '47.104.66.204:80', '191.101.39.193:80', '112.104.28.117:3128', '103.205.15.97:8080', '185.179.30.130:8080', '190.108.88.97:999', '81.30.220.116:8080', '218.88.204.125:3256', '178.62.56.172:80', '178.134.208.126:50824', '193.149.225.163:80', '182.87.136.228:9999', '175.42.122.142:9999', '192.109.165.128:80', '181.3.91.56:10809', '182.84.144.91:3256', '188.166.125.206:38892', '118.194.242.57:80', '167.172.180.46:33555', '58.255.7.90:9999', '190.85.244.70:999', '118.99.100.164:8080', '182.84.145.181:3256', '94.180.72.40:3128', '45.236.168.183:999', '01.20.217.52:8080', '175.146.211.158:9999', '114.233.189.228:9999', '125.113.133.47:8118', '171.35.213.44:9999', '131.153.151.250:43064', '106.45.220.42:3256', '177.229.194.30:999', '191.101.39.110:80', '195.46.124.94:4444', '113.237.1.179:9999', '182.84.144.12:3256', '36.56.102.35:9999', '131.153.151.250:8003', '45.225.88.220:999', '195.9.61.22:45225', '43.239.152.254:8080']
/gaoni/2/
2021-08-11 10:24:35 [scrapy.core.engine] INFO: Closing spider (finished)
2021-08-11 10:24:35 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 303,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 39781,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'elapsed_time_seconds': 0.542365,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2021, 8, 11, 2, 24, 35, 437527),
 'log_count/DEBUG': 1,
 'log_count/INFO': 10,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2021, 8, 11, 2, 24, 34, 895162)}
2021-08-11 10:24:35 [scrapy.core.engine] INFO: Spider closed (finished)

获取到的网址并不全,所以需要进行网址的拼接。

import scrapy


# 创建爬虫类 并且继承来自scrapy.Spider——》最基础的类 另外及各类都是继承这个类
class XicidailiSpider(scrapy.Spider):
    name = 'xicidaili' # 爬虫名字--》必须唯一
    allowed_domains = ['xiladaili.com'] # 允许采集的域名
    start_urls = ['http://www.xiladaili.com/gaoni/'] # 开始采集的网址

    # 解析相应数据 提取数据 或者网址 response 网页源码
    def parse(self, response):
        # 提取数据
        data = response.xpath("/html/body/div/div[3]/div[2]/table/tbody/tr/td[1]/text()")
        print(data.getall())

        # 翻页操作

        next_page = response.xpath("/html/body/div/div[3]/nav/ul/li[16]/a/@href")
        if next_page:
            print(next_page.get())
            # 拼接网址
            next_url = response.urljoin(next_page)
            # 发出请求 Request callback 回调函数 将请求交给自己处理
            yield scrapy.Request(next_url,callback=self.parse) # 生成器
            
        pass

*** 回调函数别写 () 。 此处不再尝试,怕被封。。***

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值