Scrapy是一个用于Python的开源和协作网络爬虫框架,用于从网站上抓取数据。以下用两个例子来说明一下。
案例一:抓取一个网站的基本信息
-
创建项目
首先,你需要在你的终端中使用以下命令创建一个新的Scrapy项目:
scrapy startproject car_spider
-
定义Item
在
car_spider/car_spider/items.py
中定义你要抓取的数据结构:
import scrapy
class CarSpiderItem(scrapy.Item):
brand = scrapy.Field()
mileage = scrapy.Field()
licensing_date = scrapy.Field()
location = scrapy.Field()
price = scrapy.Field()
-
编写Spider
在
car_spider/car_spider/spiders
目录下创建一个名为car_spider.py
的文件,定义Spider:
import scrapy
from car_spider.items import CarSpiderItem
class CarSpider(scrapy.Spider):
name = 'car_spider'
allowed_domains = ['car_spider.com']
start_urls = ['https://www.car_spider.com/']
def parse(self, response):
for sel in response.xpath('//ul[@class="viewlist_ul"]/li'):
item = CarSpiderItem()
item['brand'] = sel.xpath('.//div[@class="cards-bottom"]/h4/text()').extract()
item['mileage'] = sel.xpath(".//p[@class='cards-unit']/text()").extract()
item['licensing_date'] = sel.xpath(".//p[@class='cards-unit']/text()").extract()
item['location'] = sel.xpath(".//p[@class='cards-unit']/text()").extract()
item['price'] = sel.xpath('.//a/div[2]/div[1]/span[1]/text()').extract()
print(item)
yield item
-
运行Spider
在终端中运行以下命令来启动爬虫:
cd car_spider
scrapy crawl car
案例二:抓取多页面数据并使用管道存储到数据库
假设我们要抓取一个网站的所有产品列表,并将数据保存到MySQL数据库中。
- 定义Item
import scrapy
class CarSpiderItem(scrapy.Item):
# define the fields for your item here like:
brand = scrapy.Field()
mileage = scrapy.Field()
licensing_date = scrapy.Field()
location = scrapy.Field()
price = scrapy.Field()
- 编写Spider
import scrapy
from car_spider.items import CarSpiderItem
class CarSpider(scrapy.Spider):
name = "car_spider"
allowed_domains = ['car_spider.com']
start_urls = [
'http://car_spider.com/products',
]
def parse(self, response):
for sel in response.xpath('//ul[@class="viewlist_ul"]/li'):
item = CarSpiderItem()
item['brand'] = sel.xpath('.//div[@class="cards-bottom"]/h4/text()').extract()
item['mileage'] = sel.xpath(".//p[@class='cards-unit']/text()").extract()
item['licensing_date'] = sel.xpath(".//p[@class='cards-unit']/text()").extract()
item['location'] = sel.xpath(".//p[@class='cards-unit']/text()").extract()
item['price'] = sel.xpath('.//a/div[2]/div[1]/span[1]/text()').extract()
print(item)
yield item
next_page = response.xpath('//*[@id="listpagination"]/a[9]/@href').get()
if next_page is not None:
yield response.follow(next_page, self.parse)
-
设置管道
在
settings.py
中启用管道:
ITEM_PIPELINES = {
'myspider.pipelines.ProductsPipeline': 300,
}
-
编写管道
在
myspider/pipelines.py
中编写将数据保存到MySQL的代码:
import pymysql
class ProductsPipeline:
def __init__(self):
self.conn = pymysql.connect(
host='localhost', user='root', password='password',
db='mydatabase', charset='utf8mb4', cursorclass=pymysql.cursors.DictCursor
)
self.cursor = self.conn.cursor()
def process_item(self, item, spider):
sql = """
INSERT INTO cars (brand, mileage, licensing_date, location, price)
VALUES (%s, %s, %s, %s, %s)
"""
self.cursor.execute(sql, (item['brand'], item['mileage'], item['licensing_date'], item['location'], item['price']))
self.conn.commit()
return item
- 运行Spider
scrapy crawl car_spider
以上是Scrapy的无分页用法和有分页用法的示例,涉及多页抓取和数据存储。希望这能帮助你开始使用Scrapy进行网络爬虫开发!