Python—轻松获取HTML网页内的表格内容并写入数据库

最新推荐文章于 2024-08-18 22:49:52 发布

置顶 chaodaibing

最新推荐文章于 2024-08-18 22:49:52 发布

阅读量2.1k

点赞数 1

分类专栏： python

本文链接：https://blog.csdn.net/chaodaibing/article/details/115524212

版权

python html

python 专栏收录该内容

15 篇文章 2 订阅

订阅专栏

前面说过，使用selenium可以轻松获取网页内的表格内容，但是selenium需要安装浏览器和下载对应的webdriver，不是很方便。我探索出了一个更便利的方式，那就是Python内置的html模块。因为是内置模块，不需要额外做什么。

from html.parser import HTMLParser

class MyHTMLParser(HTMLParser):
  def __init__(self):
    HTMLParser.__init__(self)
    self.in_td = False
    self.in_th = False
    self.data = []      #所有数据
    self.line = []      #行数据
    self.header=[]      #标题列表

  def handle_starttag(self, tag, attrs):
    if tag=='th':self.in_th = True
    else:self.in_th = False
    if tag=='td':self.in_td = True
    else:self.in_td = False
    if tag=='tr':self.line=[]        #开始存储一行的数据

  def handle_endtag(self, tag):
    if tag=='td':self.in_td=False
    if tag=='th':self.in_th=False
    if tag=='tr' and self.line:      #收行尾
      self.data.append(self.line)    #行数据打包

  def handle_data(self, data):
    data=data.strip()
    if self.in_th==True:      #标题
      self.header.append(data)
    if self.in_td==True:      #表格
      self.line.append(data)
     
if __name__=='__main__':
  with open("target.html","r",encoding="utf-8") as f: #读取html文件
    cnt=f.read()
  parser = MyHTMLParser()
  parser.feed(cnt)                                    #填入文件分析
  data=parser.data                                    #表格内容
  header=parser.header                                #表格标题
  import pandas as pd
  df=pd.DataFrame(parser.data,columns=parser.header)
  #导出为Excel
  df.to_excel('dump.xlsx',index=False)	
  #或者导出到MySQL
  from sqlalchemy import create_engine
  conn=create_engine("mysql+pymysql://use:pass@127.0.0.1:3306/tbname")
  df.to_sql(name='temp',con=conn,if_exists='replace',index=False)

这里target.html就是下载到的html文件，如果直接从网页获取，就这样

import requests
res=requests.get(url)
res.encoding='utf-8'   #防止乱码
cnt=res.text

chaodaibing

关注

1
点赞
踩
15

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录