别总存本地了,爬书内容太大了,存进数据库吧。数据库:MySQL5.5.29;可视化工具:navicat fo mysql。/20171230
navicat中的“截断表”就是"truncate",而“清空表"delete"。/20180102
两者区别总结:
- truncate是整体删除(速度较快), delete是逐条删除(速度较慢)。
- truncate不写服务器log,delete写服务器log,也就是truncate效率比delete高的原因。
- truncate不激活trigger(触发器),但是会重置Identity(标识列、自增字段),相当于自增列会被置为初始值,又重新从1开始记录,而不是接着原来的ID数。而delete删除以后,Identity依旧是接着被删除的最近的那一条记录ID加1后进行记录。
- 如果只需删除表中的部分记录,只能使用DELETE语句配合where条件。 DELETE FROM wp_comments WHERE……
__author__ = '姜枫渔火'
import requests
import re
import pymysql
conn = pymysql.connect(
host = 'localhost',
port = 3306,
user = 'root',
passwd = '1a2b3c4d',
db = 'test',
charset = 'utf8' # 不是'utf-8'
)
cursor = conn.cursor()
def GetSortNovelList():
response = requests.get('http://www.quanshuwang.com/list/1_1.html')
response.encoding = 'gbk' # Headers没有返回编码,查看源代码网页为'gbk'格式
result = response.text
# print(result)
pattern = re.compile(r'<a target="_blank" title="(.+?)" href="(.+?)"')
novel_list = re.findall(pattern, result)
return novel_list
def GetNovelInfo(url):
response = requests.get(url)
response.encoding = 'gbk'
result = response.text
pattern = re.compile(r'<meta property="og:image" content="(.+?)".+?category" content="(.+?)"/.+?'
r'author" content="(.+?)".+?status" content="(.+?)"', re.S)
# print(re.findall(pattern, result))
img_url, sort, author, status = re.findall(pattern, result)[0]
chapter_list_url = re.findall(r'.+?<a href="(.+?)" class="reader"', result)[0]
return img_url, sort, author, status, chapter_list_url
def GetChapterList(url):
response = requests.get(url)
response.encoding = 'gbk'
result = response.text
pattern = re.compile(r'<li><a href="(.+?)" title=".+?">(.+?)</a></li>')
chapter_content = re.findall(pattern, result)
return chapter_content
def GetChapterContent(url):
response = requests.get(url)
response.encoding = 'gbk'
result = response.text
pattern = re.compile(r'style5\(\);</script>(.+?)<script type="text/javascript"',re.S)
chapter_url = re.findall(pattern, result)[0]
return chapter_url
for novel_name, novel_url in GetSortNovelList():
# print(novel_name, novel_url)
# GetNovelInfo(novel_url)
img_url, sort, author, status, chapter_list_url = GetNovelInfo(novel_url)
# print(sort, novel_name, img_url, status, author)
cursor.execute("insert into novel(sortname, novelname, imgurl, description, status, author)values"
"('{}', '{}', '{}', '{}', '{}', '{}')".format(sort, novel_name, img_url, u'这里是描述', status, author))
conn.commit()
lastrowid = cursor.lastrowid # 刚才插入数据的ID值,主键,自增
print('lastrowid = ' + str(lastrowid))
for chapter_url, chapter_name in GetChapterList(chapter_list_url):
chapter_content = GetChapterContent(chapter_url)
cursor.execute("insert into chapter(novel_id, title, content) values({}, '{}', '{}')".format(lastrowid, chapter_name, chapter_content))
conn.commit()
cursor.close()
conn.close()