1 如果没有看过scrapy的朋友,可以到scrapy的官网看一下再来看这篇文章
2 创建一个scrapy的项目,请看http://blog.csdn.net/chenguolinblog/article/details/19699865
3 下面我们就一个一个文件的来分析,最后我会给出GitHub上面的源码
(1)第一个文件 spidr.py,这个文件的作用就是我们自己定义的蜘蛛,用来爬取网页的,具体看以下的注释
__author__ = 'chenguolin'
"""
Date: 2014-03-06
"""
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule #这个是预定义的蜘蛛,使用它可以自定义爬取链接的规则rule
from scrapy.selector import HtmlXPathSelector #导入HtmlXPathSelector进行解析
from firstScrapy.items import FirstscrapyItem
class firstScrapy(CrawlSpider):
name = "firstScrapy" #爬虫的名字要唯一
allowed_domains = ["yuedu.baidu.com"] #运行爬取的网页
start_urls = ["http://yuedu.baidu.com/book/list/0?od=0&show=1&pn=0"] #第一个爬取的网页
#以下定义了两个规则,第一个是当前要解析的网页,回调函数是myparse;第二个则是抓取到下一页链接的时候,不需要回调直接跳转
rules = [Rule(SgmlLinkExtractor(allow=('/ebook/[^/]+fr=booklist')), callback='myparse'),
Rule(SgmlLinkExtractor(allow=('/book/list/[^/]+pn=[^/]+', )), follow=True)]
#回调函数
def myparse(self, response):
x = HtmlXPathSelector(response)
item = FirstscrapyItem()
# get item
item['link'] = response.url
item['title'] = ""
strlist = x.select("//h1/@title").extract()
if len(strlist) > 0:
item['title'] = strlist[0]
# return the item
return item
(2)第二个文件是items.py,定义我们所需要的字段,因为我们这边只抓取图书的“名字”和“链接“,于是字段都是str
from scrapy.item import Item, Field
class FirstscrapyItem(Item):
title = Field(serializer=str)
link = Field(serializer=str)
(3) 第三个文件是pipelines.py,由于要连接数据库,这边用到了twisted连接mysql的方法
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
from twisted.enterprise import adbapi #导入twisted的包
import MySQLdb
import MySQLdb.cursors
class FirstscrapyPipeline(object):
def __init__(self): #初始化连接mysql的数据库相关信息
self.dbpool = adbapi.ConnectionPool('MySQLdb',
db = 'bookInfo',
user = 'root',
passwd = '123456',
cursorclass = MySQLdb.cursors.DictCursor,
charset = 'utf8',
use_unicode = False
)
# pipeline dafault function #这个函数是pipeline默认调用的函数
def process_item(self, item, spider):
query = self.dbpool.runInteraction(self._conditional_insert, item)
return item
# insert the data to databases #把数据插入到数据库中
def _conditional_insert(self, tx, item):
sql = "insert into book values (%s, %s)"
tx.execute(sql, (item["title"], item["link"]))
(4)在unbuntu下mysql的可视化工具截图
(5)大家可以从我的github上面直接clone项目,地址:https://github.com/chenguolin/firstScrapyProject.git
==================================
== from:陈国林 ==
== email:cgl1079743846@gmail.com ==
== 转载请注明出处,谢谢! ==
==================================