因工作需要开发南美的客户,于是我就想到 https://www.paginasamarillas.com 西语国家的黄页网站,在上面一搜确实有很多内容。
刚好可以来练手Scrapy.
源码:
./spiders/paginasamarillas_spider.py
from scrapy import Request
from scrapy.spiders import Spider
from paginasamarillas.items import PaginasAmarillasItem
import time
class PaginasAmarillasSpider(Spider):
name = "empaque_flexible"
def start_requests(self):
url = 'http://www.paginasamarillas.com.co/servicios/empaque-flexible'
yield Request(url)
def parse(self,response):
empresas = response.xpath('//div[@class="col-sm-10"]')
for empresa in empresas:
item = PaginasAmarillasItem()
item['nombre']=empresa.xpath('.//span[@class="semibold"]/text()').extract()[0]
item['sitio']=empresa.xpath('.//div[@class="url"]/a/@href').extract_first()
item['des']=empresa.css('div.col-sm-12.infoBox p::text').extract_first()
yield item
for i in range(2,35):
time.sleep(5)
next_url = "http://www.paginasamarillas.com.co/servicios/empaques-y-envases-flexibles?page="+str(i)
yield Request(next_url)
./items.py
import scrapy
class PaginasAmarillasItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
nombre = scrapy.Field()
sitio = scrapy.Field()
des = scrapy.Field()
pass
./settings.py
这个时候!我们发现这只小蜘蛛一般来说爬不到任何东西,为什么呢?是不是代码哪里错了呢?不是!是因为Scrapy有个隐藏坑!就是它默认遵守网站的robots.text规则,网站不让它爬,它就不爬。很搞笑好吗!于是我们就在根目录 setings.py 里,找到这个ROBOTSTXT_OBEY 把它False
另外,USER_AGENT也是要设置的,因为这里会默认告诉服务器“我就是个爬虫,快拒绝我吧!”
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit / 537.36(KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False