信息技术手册可视化进度报告基于BeautifulSoup框架的python3爬取数据并连接保存到MySQL数据库...

最新推荐文章于 2022-12-11 14:38:32 发布

weixin_30906701

最新推荐文章于 2022-12-11 14:38:32 发布

阅读量138

点赞数

文章标签： python 数据库爬虫

原文链接：http://www.cnblogs.com/w-honey/p/10583205.html

版权

老师给我们提供了一个word文档，里面是一份信息行业热词解释手册，要求我们把里面的文字存进数据库里面，然后在前台展示出来。

首先面临的问题是怎么把数据导进MySQL数据库，大家都有自己的方法，我采用了将word转换成html文件，然后通过爬虫技术将内容提取出来保存到数据库。

写这篇博客的时候我刚存进数据库里，所以就介绍一下我的爬虫代码，下一篇将介绍通过微信小程序展示MySQL中的数据。

python的爬虫框架有很多，我用的是BeautifulSoup框架，首先要在头文件引用一下包from bs4 import BeautifulSoup

BeautifulSoup框架常用的用的一些函数有：

find（）#获得一条map数据

find_all（name , attrs , recursive , string , **kwargs ）#搜索当前tag的所有tag子节点,并判断是否符合过滤器的条件，获得list列表

select（）#跟find_all类似，常用的是find_all（），可以通过select('#id')取得含有特定CSS属性的元素

get_text()#返回一个tag节点内的文字

同学也有用xpath做爬虫的

XPath 是一门在 XML 文档中查找信息的语言。 
BeautifulSoup是一种在BeautifulSoup（）处理后的树形文档中解析的语言 
re正则表达式只能对string类型对象进行解析

下面是代码

from bs4 import BeautifulSoup
import pymysql

#数据时从本地文件里来
def read_file(path):
    #注意编码格式可能会出错
    with open(path, 'r+', encoding='ANSI') as f:
        str = f.read()
    return str.strip().replace('\ufeff', '')

# 解析目录数据
def parse_data(data):
    #读取第一个MsoToc1和第二个MsoToc1之间的所有数据
    for str1 in data.split('class=MsoToc1')[1:]:
        bs = BeautifulSoup(str1, 'lxml')
        index = 0
        title1 = ""
        title2 = ""
        title3 = ""
        try:
            for tag in bs.select('a'):
                strs = tag.get_text().split(' ')[0].rstrip()
                if ('第' in strs and '篇' in strs):
                    title1 = tag.get_text().split(' ')[1].replace('.', '')

                elif ('第' in strs and '章' in strs):
                    title2 = tag.get_text().split(' ')[1].replace('.', '')
                else:
                    index = strs;
                    title3 = tag.get_text().split(' ')[1].replace('.', '')
                    save(index, title1, title2, title3)
        except:
            print("数据有误，跳过执行")
    bigdiv = data.split('class=WordSection3')[1]
    for str1 in bigdiv.split('class=3132020')[1:]:
        soup = BeautifulSoup('<p class=3132020 '+str1, 'lxml')
        content = ""
        index = int(soup.find('p', {'class': '3132020'}).get_text().split(' ')[0])
        for tag in soup.find_all('p', {'class': '4'}):
            content += tag.get_text()+'\r\n'
        update(index,content)
    return
#保存到数据库
def save(index,title1,title2,title3):
    db = pymysql.connect(host='localhost', user='root', password='root', db='jaovo_msg')
    conn = db.cursor()  # 获取指针以操作数据库
    conn.execute('set names utf8')
    t = (int(index), title1, title2, title3)
    sql = "INSERT INTO datasfromhtml(`index`,title1,title2,title3) values(%d,'%s','%s','%s')" % t

    try:
        conn.execute(sql)
        # 执行sql语句
        db.commit()
    except:
        # 发生错误时回滚
        db.rollback()
    # 关闭数据库连接
    db.close()
    return

#修改到数据库
def update(index,content):
    db = pymysql.connect(host='localhost', user='root', password='root', db='jaovo_msg')
    conn = db.cursor()  # 获取指针以操作数据库
    conn.execute('set names utf8')
    t = (content,int(index))
    sql = "update datasfromhtml set content = '%s' where `index` = %d" % t
    try:
        conn.execute(sql)
        # 执行sql语句
        db.commit()
    except:
        # 发生错误时回滚
        db.rollback()
    # 关闭数据库连接
    db.close()
    return

if __name__ == '__main__':
    str=read_file('../resource/HB.htm')
    parse_data(str)

转载于:https://www.cnblogs.com/w-honey/p/10583205.html