目录
本文由Word编写,通过本程序转换得到,喜欢UU们可以直接跳转到完整代码配置使用💕
1 问题描述
现在我的WORD文档中有一个表格,内容格式如下:
点号 | 像点坐标 | 地面坐标 | |||
x(mm) | y(mm) | X(mm) | Y(mm) | Z(mm) | |
1 | -86.15 | -68.99 | 36589.41 | 25273.32 | 2195.17 |
2 | -53.4 | 82.21 | 37631.08 | 31324.51 | 728.69 |
3 | -14.78 | -76.63 | 39100.97 | 24934.98 | 2386.5 |
4 | 10.46 | 64.43 | 40426.54 | 30319.81 | 757.31 |
如何实现将word文档中的表格转为CSDN中能够显示的md+html格式
2 CSDN中支持的表格格式设置
2.1 md格式
Markdown基本输入方式如下:
|column1| column 2| column 3|#列标题
|:----------|:----------:|-----------:|#‘:’靠左(右)即左(右)对齐,两侧都有为居中;
|cell1 |cell2 |cell3 |#单元格信息
所以如果想实现问题中的表格,需要输入:
|点号|像点坐标x(mm)| 像点坐标y(mm)| 地面坐标X(mm)| 地面坐标Y(mm)| 地面坐标Z(mm)|
|:----------:|:----------:|:----------:|:----------:|:----------:|:----------:|
|1|-86.15|-68.99|36589.41|25273.32|2195.17|
|2|-53.4|82.21|37631.08|31324.51|728.69|
|3|-14.78|-76.63|39100.97|24934.98|2386.5|
|4|10.46|64.43|40426.54|30319.81|757.31|
实现的效果如下:
点号 | 像点坐标x(mm) | 像点坐标y(mm) | 地面坐标X(mm) | 地面坐标Y(mm) | 地面坐标Z(mm) |
---|---|---|---|---|---|
1 | -86.15 | -68.99 | 36589.41 | 25273.32 | 2195.17 |
2 | -53.4 | 82.21 | 37631.08 | 31324.51 | 728.69 |
3 | -14.78 | -76.63 | 39100.97 | 24934.98 | 2386.5 |
4 | 10.46 | 64.43 | 40426.54 | 30319.81 | 757.31 |
Markdown中默认的表格输入工具是不能够合并单元格的,如果有合并单元格的需要就需要按照Html语句输入。
2.2 html格式
Html格式基本输入方式如下:
<table>
<tr>
<td>第一列</td>
<td>第二列</td>
</tr>
<tr>
<td colspan="2">合并第一行</td>
</tr>
<tr>
<td colspan="2">合并第二行</td>
</tr>
</table>
如果想实现问题中的表格,需要输入:
<table>
<tr>
<td rowspan=2>点号</td><td colspan=2>像点坐标</td><td colspan=3>地面坐标</td>
</tr>
<tr>
<td>x(mm)</td><td>y(mm)</td><td>X(mm)</td><td>Y(mm)</td><td>Z(mm)</td>
</tr>
<tr>
<td>1</td><td>-86.15</td><td>-68.99</td><td>36589.41</td><td>25273.32</td><td>2195.17</td>
</tr>
<tr>
<td>2</td><td>-53.4</td><td>82.21</td><td>37631.08</td><td>31324.51</td><td>728.69</td>
</tr>
<tr>
<td>3</td><td>-14.78</td><td>-76.63</td><td>39100.97</td><td>24934.98</td><td>2386.5</td>
</tr>
<tr>
<td>4</td><td>10.46</td><td>64.43</td><td>40426.54</td><td>30319.81</td><td>757.31</td>
</tr>
</table>
实现的效果如下:
点号 | 像点坐标 | 地面坐标 | |||
x(mm) | y(mm) | X(mm) | Y(mm) | Z(mm) | |
1 | -86.15 | -68.99 | 36589.41 | 25273.32 | 2195.17 |
2 | -53.4 | 82.21 | 37631.08 | 31324.51 | 728.69 |
3 | -14.78 | -76.63 | 39100.97 | 24934.98 | 2386.5 |
4 | 10.46 | 64.43 | 40426.54 | 30319.81 | 757.31 |
3 转换表格
3.1 读取表格
3.1 不考虑单元格合并
如果没有单元格合并,我们可以通过遍历table对象中所有的rows,再在rows.cells中遍历所有的cell
def Get_table_format(table):
for row in table.rows:
for cell in row.cells:
print(cell.text)
if __name__ == '__main__':
doc = docx.Document(r'表格篇.docx') # 打开.docx文件
Get_table_format(doc.tables[0]) # 获取.docx文件中第一个表格信息
3.2 考虑单元格合并
3.1中,我们采用按行打印,此时按行合并的单元格地址相同,对目标表格打印cell和cell.text得到如下返回值
<docx.table._Cell object at 0x000002002A386560>
点号
<docx.table._Cell object at 0x000002002A3866E0>
像点坐标
<docx.table._Cell object at 0x000002002A3866E0>
像点坐标
<docx.table._Cell object at 0x000002002A386740>
地面坐标
<docx.table._Cell object at 0x000002002A386740>
地面坐标
<docx.table._Cell object at 0x000002002A386740>
地面坐标
<docx.table._Cell object at 0x000002002A3866B0>
点号
<docx.table._Cell object at 0x000002002A386860>
x(mm)
<docx.table._Cell object at 0x000002002A3868C0>
y(mm)
<docx.table._Cell object at 0x000002002A386920>
X(mm)
<docx.table._Cell object at 0x000002002A386980>
Y(mm)
<docx.table._Cell object at 0x000002002A3869E0>
Z(mm)
<docx.table._Cell object at 0x000002002A386A70>
1
<docx.table._Cell object at 0x000002002A386AD0>
-86.15
<docx.table._Cell object at 0x000002002A386B30>
-68.99
<docx.table._Cell object at 0x000002002A386B90>
36589.41
<docx.table._Cell object at 0x000002002A386BF0>
25273.32
<docx.table._Cell object at 0x000002002A386C50>
2195.17
<docx.table._Cell object at 0x000002002A386CE0>
2
<docx.table._Cell object at 0x000002002A386D40>
-53.4
<docx.table._Cell object at 0x000002002A386DA0>
82.21
<docx.table._Cell object at 0x000002002A386E00>
37631.08
<docx.table._Cell object at 0x000002002A386E60>
31324.51
<docx.table._Cell object at 0x000002002A386EC0>
728.69
<docx.table._Cell object at 0x000002002A386F50>
3
<docx.table._Cell object at 0x000002002A386FB0>
-14.78
<docx.table._Cell object at 0x000002002A387010>
-76.63
<docx.table._Cell object at 0x000002002A387070>
39100.97
<docx.table._Cell object at 0x000002002A3870D0>
24934.98
<docx.table._Cell object at 0x000002002A387130>
2386.5
<docx.table._Cell object at 0x000002002A3871C0>
4
<docx.table._Cell object at 0x000002002A387220>
10.46
<docx.table._Cell object at 0x000002002A3864D0>
64.43
<docx.table._Cell object at 0x000002002A3872B0>
40426.54
<docx.table._Cell object at 0x000002002A3863E0>
30319.81
<docx.table._Cell object at 0x000002002A3866B0>
757.31
仔细观察不难发现,按列合并的单元格地址还是不相同,所以还需按列找出重复的单元格,再与按行找到的单元格合并去重,再进行文本转换
Get_table_format(table):获取表格格式
input:一个table对象
output:table对象转变为html格式的字符串
'''
def Get_table_format(table):
table_text = '<table>\n'
# 按行/列将cell地址存入二维列表中
row_cells, col_cells = [], []
for row in table.rows:
cells = []
for cell in row.cells:
cells.append(cell)
row_cells.append(cells)
for col in table.columns:
cells = []
for cell in col.cells:
cells.append(cell)
col_cells.append(cells)
row_temp, col_temp = [], []
for i in range(len(table.rows)):
table_text = table_text + '<tr>\n'
for j in range(len(table.columns)):
col_counts = row_cells[i].count(row_cells[i][j]) # 确定行中重复地址数目以确定合并数量
row_counts = col_cells[j].count(col_cells[j][i]) # 确定列中重复地址数目以确定合并数量
if row_cells[i][j] not in row_temp and col_cells[j][i] not in col_temp: # 行列地址值去重
if col_counts == 1 and row_counts == 1: # 单元格没有合并
table_text = table_text + '<td>' + table.rows[i].cells[j].text + '</td>'
elif col_counts != 1 and row_counts == 1: # 横向合并
table_text = table_text + '<td colspan={}>'. \
format(col_counts) + table.rows[i].cells[j].text + '</td>'
elif col_counts == 1 and row_counts != 1: # 纵向合并
table_text = table_text + '<td rowspan={}>'. \
format(row_counts) + table.rows[i].cells[j].text + '</td>'
else: # 横纵同时合并
table_text = table_text + '<td colspan={0} rowspan={1}>'. \
format(col_counts, row_counts) + table.rows[i].cells[j].text + '</td>'
row_temp.append(row_cells[i][j])
col_temp.append(col_cells[j][i])
table_text = table_text + '\n</tr>\n'
table_text = table_text + '</table>\n'
return table_text
3.2 考虑paragraphs和table顺序
最后,我们还需要把文字和表格顺序调整一下,docx目前似乎还没有提供官方的item_blocks方法,不过Github上给出了个可行的办法,源码如下:
def iter_block_items(parent):
"""
Yield each paragraph and table child within *parent*, in document order.
Each returned value is an instance of either Table or Paragraph. *parent*
would most commonly be a reference to a main Document object, but
also works for a _Cell object, which itself can contain paragraphs and tables.
"""
if isinstance(parent, Document):
parent_elm = parent.element.body
elif isinstance(parent, _Cell):
parent_elm = parent._tc
else:
raise ValueError("something's not right")
for child in parent_elm.iterchildren():
if isinstance(child, CT_P):
yield Paragraph(child, parent)
elif isinstance(child, CT_Tbl):
yield Table(child, parent)
这种办法能够将doc文档存储为block列表,列表内的顺序是列表对象在docx文件中所在的顺序,列表的返回值包括但不限于:Paragraph、Table,通过print(block)获取类似如下信息:
<docx.text.paragraph.Paragraph object at 0x0000019543C53DF0>
<docx.table.Table object at 0x0000019543C53E80>
据此,参考Jeff Pan96的思路,我们可以在主函数中调用我们需要的内容:
if __name__ == '__main__':
doc = docx.Document(r'表格篇.docx') # 打开.docx文件
md_text=''#考虑table没有text方法,这里将table转为text时存储到md_text中
is_code = False # 判断文本是否为代码块
for block in iter_block_items(doc):
# block.style.name可以直接返回:heading 1、normal、normal table
if block.style.name == 'Normal Table':
table_text = Get_table_format(block)
md_text = md_text + table_text+'\n'
else:
if "```" in block.text:
md_text = md_text + block.text+'\n'# 打印段落中的文本
if is_code == False:
is_code = True # 标记为代码段开始
continue
elif is_code == True:
is_code = False # 标记为代码段结束
continue
else:
if is_code == False:
for run in block.runs: # 实例化段落中一个节段
Get_run_format(run) # 修改文字格式
Get_paragraph_format(block) # 修改段落格式
md_text = md_text + block.text+'\n'# 打印段落中的文本
print(md_text)
4 完整代码
import docx # 导入python-docx库
from docx.document import Document
from docx.oxml.table import CT_Tbl
from docx.oxml.text.paragraph import CT_P
from docx.table import _Cell, Table
from docx.text.paragraph import Paragraph
# 判断是否添加</font>,这里如果添加多个会导致TOC目录显示异常eg:1</font>.1 </font>标题</font>2</font>
def exist_font(run):
if '</font>' in run.text:
pass
else:
run.text = run.text + '</font>'
'''
Get_run_format(run): 修改节段的格式信息
input: 实例化节段对象
'''
def Get_run_format(run):
# TODO: 自定义添加你需要的节段格式转换
# 判断文本是否加粗
if run.font.bold == None:
pass
else:
run.text = '<b>' + run.text + '</b>'
# 判断文本是否为斜体
if run.font.italic == None:
pass
else:
run.text = '<i>' + run.text + '</i>'
# 判断文本是否有下划线
if run.font.underline == None:
pass
else:
run.text = '<u>' + run.text + '</u>'
# 判断文本是否添加删除线
if run.font.strike == None:
pass
else:
run.text = '<s>' + run.text + '</s>'
# 设置文字颜色
if run.font.color.rgb == None:
pass
else:
exist_font(run)
run.text = '<font color=#{}>'.format(run.font.color.rgb) + run.text
# 判断字体是否高亮显示:
if run.font.highlight_color == None:
pass
else:
'''
这里我尝试过了下面的语句,根据https://blog.csdn.net/ningmengshuxiawo/article/details/109112540介绍是可以更换成红色的
<mark style="background:red" >这里是输入的文本</mark>
但是我自己尝试的时候发现背景色并不能更换,这里就直接用黄色高亮标记
'''
run.text = '<mark>' + run.text + '</mark>'
# 设置字体,默认为楷体
if run.font.name == None:
Is_Chi = False # 判断run.text中是否有中文
for i in range(len(run.text)):
if '\u4e00' <= run.text[i] <= '\u9fff': # 中文字符串unicode范围\u4e001-\u9fff,设置为楷体
Is_Chi = True
break
else:
continue
if Is_Chi == True:
exist_font(run)
run.text = '<font face="楷体">' + run.text
else: # 数字&英文设置为Times New Roman
exist_font(run)
run.text = '<font face="Times New Roman">' + run.text
else:
exist_font(run)
run.text = '<font face={}>'.format(run.font.name) + run.text
# 设置文字大小,默认为3
if run.font.size == None or run.font.size == 152400:
font_size = 3 # 默认字体/4号字设置为3
elif run.font.size < 152400:
if run.font.size < 95250:
font_size = 1 # 比六号字小的设置为1
else:
font_size = 2 # 介于六号字到4号字之间的设置为2
else:
if run.font.size == 177800:
font_size = 4 # 四号字置为4
elif run.font.size < 203200:
font_size = 5 # 介于四号字到三号字之间的设置为5
elif run.font.size < 279400:
font_size = 6 # 介于三号字到二号字之间的设置为6
else:
font_size = 7 # 大于二号字之间的设置为7
exist_font(run)
run.text = '<font size={}>'.format(font_size) + run.text
'''
Get_paragraph_format(paragraph): 修改段落的格式信息
input: 实例化段落对象
'''
def Get_paragraph_format(paragraph):
# TODO: 自定义添加你需要的段落格式转换
# #行间距=行高-字体大小
# if paragraph.paragraph_format.line_spacing!=None:
# for run in paragraph.runs:
# exist_font(run)
# run.text = '<font style="line-height:{};">'.format(paragraph.paragraph_format.line_spacing) + run.text
# else:
# for run in paragraph.runs:
# exist_font(run)
# run.text = '<font style="line-height:1.0;">' + run.text
# #段前间距
# if paragraph.paragraph_format.space_before!=None:
# pass
# else:
# pass
# 读取段落标题 docx中最高支持9级标题,但Markdown最高只支持6级标题
# paragraph.style.name 返回值“Heading 标题等级数字”
if 'Heading' in paragraph.style.name: # 判断段落是否为标题
level = eval(paragraph.style.name[-1])
for run in paragraph.runs:
i = run.text.index('size=') + 5 # 查询标题中设置标题字号的文本位置
run.text = run.text[:i] + '{}'.format(7 - level) + run.text[i + 1:] # 1级标题文字大小为6
paragraph.text = '#' * level + ' ' + paragraph.text
# 首行缩进
# 首行缩进的单位支持Pt、Cm、Mm、Inches等,如果想要缩进几个字符,需要自己进行转换,因为不同字号字符占用的磅数是不同的(五号字体 = 10.5pt = 3.70mm = 14px = 0.146inch)
if paragraph.paragraph_format.first_line_indent != None: # 判断段落是否使用首行缩进
paragraph.text = ' ' * 2 + paragraph.text
'''
Get_table_format(table):获取表格格式
input:一个table对象
output:table对象转变为html格式的字符串
'''
def Get_table_format(table):
table_text = '<table>\n'
# 按行/列将cell地址存入二维列表中
row_cells, col_cells = [], []
for row in table.rows:
cells = []
for cell in row.cells:
cells.append(cell)
row_cells.append(cells)
for col in table.columns:
cells = []
for cell in col.cells:
cells.append(cell)
col_cells.append(cells)
row_temp, col_temp = [], []
for i in range(len(table.rows)):
table_text = table_text + '<tr>\n'
for j in range(len(table.columns)):
col_counts = row_cells[i].count(row_cells[i][j]) # 确定行中重复地址数目以确定合并数量
row_counts = col_cells[j].count(col_cells[j][i]) # 确定列中重复地址数目以确定合并数量
if row_cells[i][j] not in row_temp and col_cells[j][i] not in col_temp: # 行列地址值去重
if col_counts == 1 and row_counts == 1: # 单元格没有合并
table_text = table_text + '<td>' + table.rows[i].cells[j].text + '</td>'
elif col_counts != 1 and row_counts == 1: # 横向合并
table_text = table_text + '<td colspan={}>'. \
format(col_counts) + table.rows[i].cells[j].text + '</td>'
elif col_counts == 1 and row_counts != 1: # 纵向合并
table_text = table_text + '<td rowspan={}>'. \
format(row_counts) + table.rows[i].cells[j].text + '</td>'
else: # 横纵同时合并
table_text = table_text + '<td colspan={0} rowspan={1}>'. \
format(col_counts, row_counts) + table.rows[i].cells[j].text + '</td>'
row_temp.append(row_cells[i][j])
col_temp.append(col_cells[j][i])
table_text = table_text + '\n</tr>\n'
table_text = table_text + '</table>\n'
return table_text
def iter_block_items(parent):
"""
Yield each paragraph and table child within *parent*, in document order.
Each returned value is an instance of either Table or Paragraph. *parent*
would most commonly be a reference to a main Document object, but
also works for a _Cell object, which itself can contain paragraphs and tables.
"""
if isinstance(parent, Document):
parent_elm = parent.element.body
elif isinstance(parent, _Cell):
parent_elm = parent._tc
else:
raise ValueError("something's not right")
for child in parent_elm.iterchildren():
if isinstance(child, CT_P):
yield Paragraph(child, parent)
elif isinstance(child, CT_Tbl):
yield Table(child, parent)
if __name__ == '__main__':
doc = docx.Document(r'表格篇.docx') # 打开.docx文件
md_text=' '#考虑table没有text方法,这里将table转为text时存储到md_text中
is_code = False # 判断文本是否为代码块
for block in iter_block_items(doc):
# block.style.name可以直接返回:heading 1、normal、normal table
if block.style.name == 'Normal Table':
table_text = Get_table_format(block)
md_text = md_text + table_text+'\n'
else:
if "```" in block.text:
md_text = md_text + block.text+'\n'# 打印段落中的文本
if is_code == False:
is_code = True # 标记为代码段开始
continue
elif is_code == True:
is_code = False # 标记为代码段结束
continue
else:
if is_code == False:
for run in block.runs: # 实例化段落中一个节段
Get_run_format(run) # 修改文字格式
Get_paragraph_format(block) # 修改段落格式
md_text = md_text + block.text+'\n'# 打印段落中的文本
print(md_text)
记得点个赞哦💕