Python爬虫学习之scrapy框架(二)爬取纵横月票榜
本篇和第一篇内容基本一样,就是练手的,具体的细节可以多看第一篇
项目资源链接
文章目录
一.创建Scrapy项目
scrapy startproject douban
cd douban
scrapy genspider book qidian.com
二.设置数据存储模板
–item.py
class DoubanItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
title = scrapy.Field()
img_addr = scrapy.Field()
writer = scrapy.Field()
tag = scrapy.Field()
detail_addr = scrapy.Field()
intro = scrapy.Field()
三. 编写爬虫
网址链接: 纵横月票榜
–book.py
import scrapy
from douban.items import DoubanItem
from scrapy import Request
import re
class BookSpider(scrapy.Spider):
name = 'book'
allowed_domains = ['qidian.com']
start_urls = ['https://www.qidian.com/finish?action=hidden&orderId=&style=1&pageSize=20&siteid=1&pubflag=0&hiddenField=2&page=1']
url_sets = set()
def parse(self, response):
List = response.xpath("//div[@class='book-img-text']/ul[@class='all-img-list cf']/li")
for obj in List:
item = DoubanItem()
item['title'] = obj.xpath("./div[@class='book-mid-info']/h4/a/text()").extract()[0]
item['img_addr'] = obj.xpath("./div[@class='book-img-box']/a/img/@src").extract()[0]
item['writer'] = obj.xpath("./div[@class='book-mid-info']/p[@class='author']/a[@class='name']/text()").extract()[0]
item['tag'] = obj.xpath("./div[@class='book-mid-info']/p[@class='author']/a[2]/text()").extract()[0]
item['detail_addr'] = obj.xpath("./div[@class='book-mid-info']/h4/a/@href").extract()[0]
item['intro'] = obj.xpath("./div[@class='book-mid-info']/p[@class='intro']/text()").extract()[0]
yield item
urls = response.xpath("//div[@class='lbf-pagination']/ul/li/a/@href").extract()
for url in urls:
url = 'https:'+url
if url.startswith('https://www.qidian.com'):
print('start with //www.qidian.com')
if url in self.url_sets:
pass
else:
self.url_sets.add(url)
yield self.make_requests_from_url(url)
else:
pass
四.编写数据处理脚本
数据库设计两张表,一张保存小说信息,另一张保存tag,两表关联,爬取数据成功后,编写程序进行简单处理,便可以进行一些简单的数据分析了,我将对数据库进行整理的程序页放在链接里的资源中了。
–pipelines.py
class DoubanPipeline:
def process_item(self, item, spider):
#保存图片
# url = item['img_addr']
# req = urllib.request.Request(url)
# with urllib.request.urlopen(req) as pic:
# data = pic.read()
# file_name = os.path.join(r'D:\bookpic',item['name'] + '.jpg')
# with open(file_name, 'wb') as fp:
# fp.write(data)
#保存到数据库
info = [item['title'], item['img_addr'], item['writer'], item['tag'], item['detail_addr'], item['intro']]
connection = pymysql.connect(host='localhost', user='root', password='', database='topnovel', charset='utf8')
try:
with connection.cursor() as cursor:
sql = 'insert into shownovel_book (title, img_addr, writer, tag, detail_addr, intro) values (%s, %s, %s, %s, %s, %s)'
affectedcount = cursor.execute(sql, info)
print('成功修改{0}条数据'.format(affectedcount))
connection.commit()
except pymysql.DatabaseError:
connection.rollback()
finally:
connection.close()
return item
五.设置配置文件
–settings.py增加如下内容
ITEM_PIPELINES = {
'novel.pipelines.NovelPipeline': 100,
}
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) G ecko/20100101 Firefox/52.0'
}
六. 执行爬虫
scrapy crawl book