上一篇爬虫案例讲了数据库的连接,本片来讲利用异步网络框架来连接数据库:Twisted.enterprise中adbapi模块,adbapi中的connectionPool方法是用来创建数据库连接池对象的,每个连接对象在独立的线程当中工作,内部依然是用pymysql等库去访问数据库的;adbapi中的runInteraction方法以异步方式去调用insert_db函数,连接对象中的参数item就会被传给insert_db的第二个参数,等insert_db执行完之后,连接对象会自动调用commit方法。
本片爬取的网站是:http://www.ci.commerce.ca.us/bids.aspx ,我们先建立新的爬虫项目,取名随意,先看items.py:
from scrapy import Item,Field
class Csdn2Item(Item):
title = Field()
dueDate = Field()
web_url = Field()
bid_url = Field() #标书链接
issuedate = Field()
category = Field() #标书分类
在spiders文件夹下创建Commerce.py爬虫文件:
import scrapy
import datetime
from scrapy.http import Request
from CSDN2.items import Csdn2Item
class CommerceSpider(scrapy.Spider):
name = 'commerce'
start_urls = ['http://www.ci.commerce.ca.us/bids.aspx']
domain = 'http://www.ci.commerce.ca.us/'
def parse(self, response):
# xpath定位
result_list = response.xpath("//*[@id='BidsLeftMargin']/..//div[5]//tr")
for result in result_list:
item = Csdn2Item()
title1 = result.xpath("./td[2]/span[1]/a/text()").extract_first()
if title1:
item["title"] = title1
item["web_url"] = self.start_urls[0]
urls = self.domain + result.xpath("./td[2]/span[1]/a/@href").extract_first()
item['bid_url'] = urls
yield Request(urls, callback=self.parse_content, meta={'item': item})
def parse_content(self, response):
item = response.meta['item']
category = response.xpath("//span[@class='BidDetailSpec']//text()").extract_first()
item['category'] = category
issuedate = response.xpath("//span[(text()='Publication Date/Time:')]/following::span[1]/text()").extract_first()
item['issuedate'] = issuedate
expireDate = response.xpath("//span[(text()='Closing Date/Time:')]/following::span[1]/text()").extract_first()
# 如果截止日期是open until contracted的话,就默认截止日期是明天
if 'Open Until Contracted'in expireDate:
expiredate1 = (datetime.datetime.now() + datetime.timedelta(days=1)).strftime('%m/%d/%Y')
else:
expiredate1=expireDate
item['dueDate'] = expiredate1
yield item
接下来是pipelines.py文件的编写:
# -*- coding: utf-8 -*-
from scrapy.conf import settings
from twisted.enterprise import adbapi #记得安装twisted包
class Csdn2Pipeline(object):
def open_spider(self,spider):
db = settings['MYSQL_DB_NAME']
host = settings['MYSQL_HOST']
port = settings['MYSQL_PORT']
user = settings['MYSQL_USER']
passwd = settings['MYSQL_PASSWORD']
#创建连接池对象,用pymysql访问数据库
self.dbpool = adbapi.ConnectionPool('pymysql',host = host,port = port, db= db,user=user,passwd=passwd,charset ='utf8')
def process_item(self, item, spider):
#去调用insert_db函数
self.dbpool.runInteraction(self.insert_db,item)
return item
def insert_db(self, tx, item):
values = (
item['title'],
item['dueDate'],
item['web_url'],
item['bid_url'],
item['issuedate'],
item['category'],
)
try:
sql = 'INSERT INTO bids(title,dueDate,web_url,bid_url,issuedate,category) VALUES (%s,%s,%s,%s,%s,%s)'
tx.execute(sql, values)
print("数据插入成功")
except Exception as e:
print('数据插入错误:', e)
def close_spider(self,spider):
self.dbpool.close()
在settings.py中:
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
ITEM_PIPELINES = {
'CSDN2.pipelines.Csdn2Pipeline': 300,
}
# Mysql数据库连接
MYSQL_HOST = 'localhost'
MYSQL_DB_NAME = 'db_name'
MYSQL_USER = 'root'
MYSQL_PASSWORD = '123456'
MYSQL_PORT =3306
在爬取之前,我们先去navicat 里边先把table建好:
建好之后是这样:
之后我们在terminal中运行代码:
scrapy crawl commerce
产生的结果如下: