工作自动化（续）-html解析

最新推荐文章于 2024-03-25 00:11:20 发布

power_water

最新推荐文章于 2024-03-25 00:11:20 发布

阅读量146

点赞数

分类专栏：技术类文章标签： python linux

本文链接：https://blog.csdn.net/lijinbinlaa/article/details/106180283

版权

技术类专栏收录该内容

16 篇文章 0 订阅

订阅专栏

其实是在填坑。

对方技术给过来的Excle文件中嵌入的是html。Excle打开时会报异常。导致python相关模块无法按照xlsx打开文件。

解决方案：打开文件前添加一步，即，以普通文件IO打开并解析html，提取数据后重新写入Excle。

借用了系统自带的html解析模块，封装一个私有类实现解析。

问题及解决：

1. 无法识别文件中的中文字符：

用codecs打开并指定中文编码格式。

2. xlwt模块因文件太大而无法写入：

如果超过65536行，则不能用xlwt写入，用openpyxl可以解决。

3. 数据格式问题：

部分数据必须为float而不是string，所以需要类型转换。通过配置文件中_num识别是否需要转为float。

而在六万行之后再次出现title，导致类型转换失败。

解决方案：转换前判断是否为中文，如果是，取消转换。

你可以在hub下载到整个项目源码：git@gitee.com:powerbinbin/auto_sys_dialy.git

import codecs
import re
from openpyxl import *
from html.parser import HTMLParser

debug=False

pat='_num'
key=[]
datax=[]

class htmlTrans():
    def __init__(self, file, sheet=None, data_format=None):
        print("class-htmlTrans:init...[file,,, sheet]=[%s,,,%s]" %(file, sheet))
        print("class-htmlTrans:init...[data_format]='%s'" %(data_format))

        # file:("xx.xls", "C:\\path\\xx.xls")
        self.fp = codecs.open(file[1], "r", 'utf-8')
        self.par = MyHTMLParser()

        # clean up before use
        global key, datax
        datax = []
        key = data_format
        print("key===", key)

        self.outwb = Workbook()
        if sheet is None:
            self.outws = self.outwb.create_sheet(title="NNew")
        else:
            self.outws = self.outwb.create_sheet(title=sheet)

        file_src=file[0]
        tmp=file_src.split('.')
        self.file_dst=tmp[0]+'.xlsx'
        print("class-htmlTrans:init done\n")

    def is_chinese(self, string):
        rt = False
        if string >= u"\u4e00" and string <= u"\u9fa6":
            rt = True
        return rt

    def read(self):
        print("class-htmlTrans:read...")
        data=self.fp.read()
        self.par.feed(data)
        print("class-htmlTrans:read done\n")

    def write(self):
        print("class-htmlTrans:writing...")
        index_r=1
        index_c=1

        for dict in datax:
            if not dict:
                continue

            global pat
            index_c=1
            for i in key:
                # find '_num' to check if need trans to float
                result=re.search(pat, i)
                if(result != None):
                    # title comes again in 65000 Line, so check and continue: 债权协议编号 
                    if(self.is_chinese(dict[i])):
                        self.outws.cell(row=index_r, column=index_c).value = dict[i]
                    else:
                        self.outws.cell(row=index_r, column=index_c).value = float(dict[i])
                else:
                    self.outws.cell(row=index_r, column=index_c).value = dict[i]
                index_c+=1

            index_r += 1

        self.outwb.save(self.file_dst)
        self.outwb.close()
        print("class-htmlTrans:write done [%s]\n" %self.file_dst)
        return self.file_dst

    def work(self):
        self.read()
        self.write()
        return self.file_dst

class MyHTMLParser(HTMLParser):
    tempstr = str()
    count_tr = 0
    dic={}

    def handle_starttag(self, tag, attrs):
        if (tag == 'tr'):
            self.count_td = 0
            self.count_th = 0
            self.dic = {}
        elif (tag == 'td' or tag == 'th'):
            self.tempstr = ''

    def handle_endtag(self, tag):
        global datax
        global key
        if (tag == 'tr'):
            self.count_tr += 1

            datax.append(self.dic)
            if debug is True:
                print("tr%d==%s" %(self.count_tr, self.dic))
        elif (tag == 'td' or tag == 'th'):
            self.dic[key[self.count_th%len(key)]]=self.tempstr
            self.count_th += 1

    def handle_data(self, data):
        data=data.strip()
        self.tempstr = data

power_water

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
工作自动化（续）-html解析

其实是在填坑。对方技术给过来的Excle文件中嵌入的是html。Excle打开时会报异常。导致python相关模块无法按照xlsx打开文件。解决方案：打开文件前添加一步，即，以普通文件IO打开并解析html，提取数据后重新写入Excle。借用了系统自带的html解析模块，封装一个私有类实现解析。问题及解决：1. 无法识别文件中的中文字符：用codecs打开并指定中文编码格式。2. xlwt模块因文件太大而无法写入：如果超过65536行，则不能用xlwt写入，用openpyxl
复制链接

扫一扫

专栏目录