利用pymysql爬虫后导入数据库

别总存本地了,爬书内容太大了,存进数据库吧。数据库:MySQL5.5.29;可视化工具:navicat fo mysql。/20171230

navicat中的“截断表”就是"truncate",而“清空表"delete"。/20180102

两者区别总结:

  • truncate是整体删除(速度较快), delete是逐条删除(速度较慢)。
  • truncate不写服务器log,delete写服务器log,也就是truncate效率比delete高的原因。
  • truncate不激活trigger(触发器),但是会重置Identity(标识列、自增字段),相当于自增列会被置为初始值,又重新从1开始记录,而不是接着原来的ID数。而delete删除以后,Identity依旧是接着被删除的最近的那一条记录ID加1后进行记录。
  • 如果只需删除表中的部分记录,只能使用DELETE语句配合where条件。 DELETE FROM wp_comments WHERE……

__author__ = '姜枫渔火'
import requests
import re
import pymysql

conn = pymysql.connect(
    host = 'localhost',
    port = 3306,
    user = 'root',
    passwd = '1a2b3c4d',
    db = 'test',
    charset = 'utf8'    # 不是'utf-8'
)

cursor = conn.cursor()

def GetSortNovelList():
    response = requests.get('http://www.quanshuwang.com/list/1_1.html')
    response.encoding = 'gbk'    # Headers没有返回编码,查看源代码网页为'gbk'格式
    result = response.text
    # print(result)
    pattern = re.compile(r'<a target="_blank" title="(.+?)" href="(.+?)"')
    novel_list = re.findall(pattern, result)
    return novel_list

def GetNovelInfo(url):
    response = requests.get(url)
    response.encoding = 'gbk'
    result = response.text
    pattern = re.compile(r'<meta property="og:image" content="(.+?)".+?category" content="(.+?)"/.+?'
                         r'author" content="(.+?)".+?status" content="(.+?)"', re.S)
    # print(re.findall(pattern, result))
    img_url, sort, author, status = re.findall(pattern, result)[0]
    chapter_list_url = re.findall(r'.+?<a href="(.+?)" class="reader"', result)[0]
    return img_url, sort, author, status, chapter_list_url

def GetChapterList(url):
    response = requests.get(url)
    response.encoding = 'gbk'
    result = response.text
    pattern = re.compile(r'<li><a href="(.+?)" title=".+?">(.+?)</a></li>')
    chapter_content = re.findall(pattern, result)
    return chapter_content

def GetChapterContent(url):
    response = requests.get(url)
    response.encoding = 'gbk'
    result = response.text
    pattern = re.compile(r'style5\(\);</script>(.+?)<script type="text/javascript"',re.S)
    chapter_url = re.findall(pattern, result)[0]
    return chapter_url


for novel_name, novel_url in GetSortNovelList():
    # print(novel_name, novel_url)
    # GetNovelInfo(novel_url)
    img_url, sort, author, status, chapter_list_url = GetNovelInfo(novel_url)
    # print(sort, novel_name, img_url, status, author)
    cursor.execute("insert into novel(sortname, novelname, imgurl, description, status, author)values"
                  "('{}', '{}', '{}', '{}', '{}', '{}')".format(sort, novel_name, img_url, u'这里是描述', status, author))
    conn.commit()
    lastrowid = cursor.lastrowid    # 刚才插入数据的ID值,主键,自增
    print('lastrowid = ' + str(lastrowid))
    for chapter_url, chapter_name in GetChapterList(chapter_list_url):
        chapter_content = GetChapterContent(chapter_url)
        cursor.execute("insert into chapter(novel_id, title, content) values({}, '{}', '{}')".format(lastrowid, chapter_name, chapter_content))
        conn.commit()

cursor.close()
conn.close()


  • 3
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值