首先安装好相关环境:我的python2.7+pycharm
安装配置scrapy参考 http://cuiqingcai.com/912.html ;讲的很详细,需要说的是,我的下载pip后,使用pip安装依旧有很多问题,所以使用的是easy_install
1 新建一个project,在pycharm就能看到这个目录结构
在cmd下进入到pycharm的location目录(我的是D:/MyPython),新建一个scrapy项目MyTest
进入MyTest的spiders目录下,新建一个spider文件,qiubai.py
2 可以看到当前工程目录下的结构(items.json是数据保存文件暂时不用管)
3
可以看到目录结构下主要有:setting.py items.py pipelines.py __init__.py 以及spiders目录,spiders目录下有__init.py__及新建的qiubai.py
settings.py:是配置文件
items.py:用于定义需要提取的数据
pipelines.py:对数据进行处理、过滤、存储等
spider目录:spider
4
qiubai.py
# -*- coding: utf-8 -*-
import scrapy
import re
from MyTest.items import *
#sys.path.append("../items.py")
#from scrapy.selector import Selector
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors.sgml import SgmlLinkExtractor as sle
#from scrapy.utils.response import get_base_url
from scrapy.selector import HtmlXPathSelector
class MytestSpider(CrawlSpider):
name = "mytest"
allowed_domains = ["qiushibaike.com"]
start_urls = (
'http://www.qiushibaike.com/hot',
)
rules = [ # 定义爬取URL的规则
Rule(sle(allow=("/hot/page/\d{1,}\?s=\d{1,}")), follow=True, callback='parse_item')
]
'''
rules = [
Rule(sle(allow=('/hot/page/\d{1,}\?s=\d{1,}')),follow=True,callback='parse')
]
'''
def parse_item(self, response):
#print response.body
items = []
#
sel = HtmlXPathSelector(response)
#
#base_url = get_base_url(response)
sites_even = sel.select('//div[@class="article block untagged mb15"]')
#print('length of sites_even:'+len(sites_even))
#print sites_even
for site in sites_even:
item = MyTestItem()
item['author'] = site.select('div[@class="author clearfix"]/a/h2/text()').extract()
#print '我是item'
#print item['author']
item['content'] = site.select('div[@class="content"]/text()').extract()
#print item['content']
items.append(item)
print repr(item).decode("unicode-escape") + '\n'
return items
items.py
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html
import scrapy
from scrapy.item import Item,Field
class MyTestItem(Item):
# define the fields for your item here like:
# name = scrapy.Field()
author = Field()
content = Field()
#pass
pipelines.py
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
import json
import codecs
class MytestPipeline(object):
def __init__(self):
self.file = codecs.open('items.json','w',encoding='utf-8')
def process_item(self, item, spider):
print "............................"
#self.file.write('.....................')
line = json.dumps(dict(item),ensure_ascii=False) + '\n'
self.file.write(line)
return item
def spider_closed(self, spider):
self.file.close()
setting.py
BOT_NAME = 'MyTest'
SPIDER_MODULES = ['MyTest.spiders']
NEWSPIDER_MODULE = 'MyTest.spiders'
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2049.0 Safari/537.36'
ITEM_PIPELINES = {
'MyTest.pipelines.MytestPipeline': 300,
}
在完成将爬取数据存储为json文件后,尝试使用mysql数据库存储数据:
首先检查python是否支持mysql,在python shell中可以import MySQLdb; 如果不报错则说明以及安装了相关包,如果报错
easy_install mysql_python 完成安装
首先修改settings.py
MYSQL_HOST = 'localhost'
MYSQL_DBNAME = 'scrapytest1'
MYSQL_USER = 'root'
MYSQL_PASSWD = '123456'
修改pipelines.py
class MySQLStorePipeline(object):
def __init__(self,dbpool):
self.dbpool = dbpool
@classmethod
def from_settings(cls,settings):
print 'from_setting:'
#注意这里的settings不要写成setting
#utf8而不是utf-8
dbargs = dict(
host = settings['MYSQL_HOST'],
db = settings['MYSQL_DBNAME'],
user = settings['MYSQL_USER'],
passwd = settings['MYSQL_PASSWD'],
charset = 'utf8',
cursorclass = MySQLdb.cursors.DictCursor,
use_unicode = True,
)
dbpool = adbapi.ConnectionPool('MySQLdb', **dbargs)
log.msg('设置数据库连接')
return cls(dbpool)
def process_item(self,item,spider):
query = self.dbpool.runInteraction(self._conditional_insert,item)
query.addErrback(self.handle_error)
return item
def handle_error(self, e):
log.err(e)
def _conditional_insert(self,tx,item):
#这一段没有必要,是测试
str_author = str(item['author']).replace('u\'','\'')
str_author.decode('unicode-escape')
str_content=str(item['content']).decode('unicode-escape')
#这里总是出问题
tx.execute("insert into qiubai(author,content) values(%s,%s)",(item['author'],str_content))
继续修改settings.py
ITEM_PIPELINES = {
#'MyTest.pipelines.MytestPipeline': 300,
'MyTest.pipelines.MySQLStorePipeline': 300,
}
总结:
1 urls生效如要spider继承crawlspider类
2 xpaht 可以在chrome 使用$x(‘’);来测试是否正确
3 存储json,输出汉字格式
json.dumps(dict(item),ensure_ascii=False)
</pre>
4 编码上出了很多问题,花费了很多时间精力:
1 在数据库建立的时候 CREATE DATABASE **** DEFAULT CHARACTER SET utf8 COLLATE utf8_general_ci;
2 在dbpool设置的时候,charset要设置为utf8,如果utf-8也会有问题
3 还有在存入数据库的时候,发现content的编码会出问题,author不处理也是可以的,也许和content内容有关系
5 在测试过程中,因为qiubai的表设置id为自增,越来越大,delete删除数据库表后id不能重置
使用truncate table qiubai;就可以完成删除表数据且id重置