pdf复制出乱码解决方案

qq_35742456

已于 2022-10-15 22:31:41 修改

阅读量3k

点赞数 3

文章标签： python

于 2022-10-06 23:47:53 首次发布

本文链接：https://blog.csdn.net/qq_35742456/article/details/127189081

版权

pdf复制出乱码解决方案

偶然发现有一类pdf文档中的文字虽然能被选中，在拷贝时却出现了乱码。遂拿来研究一下

一、pdf格式基础

网上的pdf格式介绍文章包括adobe官方文档已经介绍的相当全面了，这里不加赘述
pdf格式介绍文章
 一个极小的pdf文件demo
pdf1.7官方文档
总结下来本问题需要了解一下几点（最近两天粗略看的，不知道理解的对不对）：

pdf中的内容是以对象构成，类似于以下的方式进行组织

class Object {
	int id;
	int ref;
	virtual std::string name()=0;
}
class Page : public Object{
	virtual std::string name() {
		return "Page";
	}
	Object* content;
	Object* image;
	Object* font;
}
class Content: public Object{
	virtual std::string name() {
		return "Content";
	}
	Object* contentstream;
	....
}
class Image: public Object{
	virtual std::string name() {
		return "Image";
	}
	Object *imagestream;
	...
}
class Font: public Object{
	virtual std::string name() {
		return "Font";
	}
	Object *fontstream;
	...
}
class Stream:....

pdf中的文字对象大多是流的形式构成，流中的内容是由某种压缩算法指定，需要解压后才能看到原始的值，流中的文本对象以BT开始，以ET结束，其中内容有两种表示方式，一种是十六进制的方式，用尖括号表示，例如<31323334>，表示文本0x31,0x32,0x33,0x34，即1234，也可以直接使用明文表示，比如(1234)，也表示文本1234.
文本中使用的字体在BT和ET之间，末尾以Tf结束，比如/G1 1 Tf，表示使用G1字体，大小为1，G1字体到底是什么参见下面的5
ToUnicode，自行阅读pdf1.7文档5.9.2 ToUnicode CMaps
pdf中某一页的字体有一个特殊的映射表，一个可能的Page对象如下所示

94 0 obj
<< /Type /Page /Parent 3 0 R /Resources 97 0 R /Contents 95 0 R /MediaBox
[0 0 595.2756 841.8898] >>
endobj
97 0 obj
<< /ProcSet [ /PDF /Text /ImageB /ImageC /ImageI ] /ColorSpace << /Cs1 7 0 R
>> /ExtGState << /Gs4 59 0 R /Gs3 60 0 R /Gs1 26 0 R >> /Font << /TT4 20 0 R
/G1 8 0 R /TT1 11 0 R >> /XObject << /Im14 98 0 R >> >>

94号对象是Page，可以看出，他的Resouce是97号对象，其中字体存在如下映射

/Font << /TT4 20 0 R /G1 8 0 R /TT1 11 0 R >>

TT4表示第20个字体对象，G1表示第8个字体对象，TT1表示第11个字体对象，所以上述的G1字体应该存在obj 8里

8 0 obj
<< /Type /Font /Subtype /Type0 /Encoding /Identity-H /DescendantFonts [260 0 R]
/BaseFont /QWWOYD+STSongti-SC-Bold /ToUnicode 261 0 R >>
endobj

该字体的名字是QWWOYD+STSongti-SC-Bold，存在第260个对象中，此时可以根据对象的引用继续查找，找到该字体对应的二进制编码，解压后就是相应的ttf/ocf/cff字体文件
也可以使用pdffont工具直接提取pdf中的字体文件（链接）

二、ttf格式基础

ttf是一种字体文件，存储着某个unicode/ascii编码到对应字体的映射，可以利用fontcreator等软件读取整个文件，值得注意的是，这个映射不一定包含了所有的字体，可以包含某几个特定的字体（例如在pdf中提取的ttf文件可能只有几KB，显然没有包含所有字体），而且其中的unicode等编码的映射关系也是自定义的。也就是说，在生成pdf时，假如使用可嵌入字体，可能unicode到对应字体之间的关系是错的。本来“的”这个字的unicode编码是U+7684，但是可能在ttf文件中被定义成了U+0001

三、解决乱码

1.提取文字，识别文字

由于ttf中unicode编码的映射顺序是乱的，无法直接根据序号做一一映射，有两种方法进行提取文字识别：

a.和原文件比对

由于字体文件中有字体名，可以从网上找一份完整的ttf文件，和该文件进行比对，找出每个文字在原ttf文件上的位置
可以通过fontTools库中TTFont.saveXML方法把ttf文件中的关键信息提取成xml文件，发现每个字符都在ttf文件上维护了一个如下的点集，类似于svg文件的格式，这个点集中的pt标签猜测代表了绘制文字时需要走的“路径”。

<TTGlyph name="glyph1" xMin="110" yMin="-165" xMax="937" yMax="783">
      <contour>
        <pt x="641" y="529" on="1"/>
        <pt x="596" y="438" on="0"/>
        <pt x="489" y="298" on="1"/>
        ...
      </contour>\
    <instructions/>
  </TTGlyph>

对比了一下完整的ttf文件和从pdf中dump出的ttf文件，发现不是很一致，不过大部分的x，y坐标时相同的。有可能在生成pdf时，对字模进行了重编码的优化？不过可以通过启发式的算法（更为通俗的叫法是——乱搞）来生成pdf中字体自定义的unicode编码到真实unicode编码之间的映射。
记pdf中字体某个unicode编码i对应的点集为 $TARGET_i=\{(x_1,y_1),(x_2,y_2),...,(x_n,y_n)\}$ ,真实的unicode编码j对应的点集为 $BASE_j=\{(x_1,y_1),(x_2,y_2),...,(x_m,y_m)\}$ ,，定义 $score_2)_{ij}$ 为二者元素个数的差 $card(TARGET_i)-card(BASE_j)|$ ，定义 $score_1)_{ij}$ 为二者的交集元素的个数 $card(TARGET_i \cap BASE_j)$ ，定义两个字模之间的整体差异 $score_{ij} = -(score_1)_{ij}+(score_2)_{ij}$ ，我们需要做的就是对一个pdf中的字模i，找到真实编码中的字模j，使得 $score_{ij}$ 最大，从而字模i真实的unicode编码可以认为就是j。
经过测试，上述的算法可以识别测试用的pdf中所有的文字（340个）对应的unicode编码。出于时间问题，只写了暴力手段进行字模搜索的代码

from fontTools.ttLib import TTFont
from fontTools.pens.basePen import BasePen
from reportlab.graphics.shapes import Path
from reportlab.lib import colors  
import json

class ReportLabPen(BasePen):

    """A pen for drawing onto a reportlab.graphics.shapes.Path object."""

    def __init__(self, glyphSet, path=None):
        BasePen.__init__(self, glyphSet)
        if path is None:
            path = Path()
        self.path = path

    def _moveTo(self, p):
        (x,y) = p
        self.path.moveTo(x,y)

    def _lineTo(self, p):
        (x,y) = p
        self.path.lineTo(x,y)

    def _curveToOne(self, p1, p2, p3):
        (x1,y1) = p1
        (x2,y2) = p2
        (x3,y3) = p3
        self.path.curveTo(x1, y1, x2, y2, x3, y3)

    def _closePath(self):
        self.path.closePath()

baseFontName = "STSongti-SC-Bold"
fontName = "QWWOYD+STSongti-SC-Bold"

def getTTFMap(font_):
    fontName="%s.ttf"%font_
    font = TTFont(fontName)
    gs = font.getGlyphSet()
    ascii_map={}
    reverse_map=font.getReverseGlyphMap()
    for key in reverse_map:
        ascii_map[reverse_map[key]]=key
    updated_map={}
    for table in font["cmap"].tables:
        if len(table.cmap)<len(ascii_map):
            updated_map=table.cmap    
    for key in updated_map:
        ascii_map[key]=updated_map[key]
    ttfinfo={}
    for index in ascii_map:
        i=ascii_map[index]
        if i[0] == '.':#跳过'.notdef', '.null'
            continue
        g = gs[i]
        pen = ReportLabPen(gs, Path(fillColor=colors.black, strokeWidth=1))
        g.draw(pen)
        ttfinfo[index]=pen.path.points
        continue
    return ttfinfo

baseTTFInfo=getTTFMap(baseFontName)
TTFInfo=getTTFMap(fontName)
ttfMap={}
def getMap(lst):
    mp={}
    for item in lst:
        if item in mp.keys():
            mp[item]+=1
        else:
            mp[item]=1 
    return mp
for key in TTFInfo:
    maxn=-1e9
    dst=-1
    for each in baseTTFInfo:
        target = TTFInfo[key]
        base = baseTTFInfo[each]
        
        score1=-abs(len(target)-len(base))
        targetMp=getMap(target)
        baseMp=getMap(base)
        score2=0
        for pos in targetMp:
            if pos in baseMp:
                score2+=min(targetMp[pos],baseMp[pos])
        score=score1+score2

        if (score>=maxn):
            maxn=score
            dst=each
    ttfMap[key]=chr(dst)
    print("key %d is %s"%(key,chr(dst)))
        
lst=json.dumps(ttfMap)
open("%s.json"%fontName,"w").write(lst)

b.OCR+手工辅助识别

刚开始，笔者尚未从网上找到STSongti-SC-Bold这个字体对应的ttf文件，对于这种情况，只能退而求其次，用网上现有的模型跑神经网络或者用封装好的库进行OCR，其中识别出现问题的保存在文件中手工辅助识别，python代码如下：

from fontTools.ttLib import TTFont
from fontTools.pens.basePen import BasePen
from reportlab.graphics.shapes import Path
from reportlab.lib import colors
from reportlab.graphics import renderPM
from reportlab.graphics.shapes import Group, Drawing    
import json

class ReportLabPen(BasePen):

   """A pen for drawing onto a reportlab.graphics.shapes.Path object."""

   def __init__(self, glyphSet, path=None):
       BasePen.__init__(self, glyphSet)
       if path is None:
           path = Path()
       self.path = path

   def _moveTo(self, p):
       (x,y) = p
       self.path.moveTo(x,y)

   def _lineTo(self, p):
       (x,y) = p
       self.path.lineTo(x,y)

   def _curveToOne(self, p1, p2, p3):
       (x1,y1) = p1
       (x2,y2) = p2
       (x3,y3) = p3
       self.path.curveTo(x1, y1, x2, y2, x3, y3)

   def _closePath(self):
       self.path.closePath()

import easyocr
fontName = "QWWOYD+STSongti-SC-Bold"


available_chars={}
for i in range(32,128):
   available_chars[chr(i)]=i
available_chars["space"]=ord(" ")

def ttfToImage(font_,imagePath,fmt="png"):
   fontName="%s.ttf"%font_
   font = TTFont(fontName)
   gs = font.getGlyphSet()
   ascii_map={}
   reverse_map=font.getReverseGlyphMap()
   for key in reverse_map:
       ascii_map[reverse_map[key]]=key
   updated_map={}
   for table in font["cmap"].tables:
       if len(table.cmap)<len(ascii_map):
           updated_map=table.cmap    
   for key in updated_map:
       ascii_map[key]=updated_map[key]
   reader = easyocr.Reader(['ch_sim','en']) # need to run only once to load model into memory
   ttfinfo={}
   for index in ascii_map:
       i=ascii_map[index]
       if i[0] == '.':#跳过'.notdef', '.null'
           continue
       print((index,i))
       g = gs[i]
       pen = ReportLabPen(gs, Path(fillColor=colors.black, strokeWidth=1))
       g.draw(pen)
       w, h = g.width, g.width
       g = Group(pen.path)
       g.translate(0, 200)
       
       d = Drawing(w/4, h/4)
       d.add(g)
       d.scale(0.1,0.1)
       imageFile = "tmp.png"
       renderPM.drawToFile(d, imageFile, fmt)
       
       result = reader.readtext(imageFile,detail=0)
       if (len(result)==1 and len(result[0])==1):
           ttfinfo[index]=result[0][0]
       else:
           ttfinfo[index]=None
           print("%s recognize failed"%i)
           renderPM.drawToFile(d, "images/%s.png"%i, fmt)
   lst=json.dumps(ttfinfo)
   open("%s.json"%font_,"w").write(lst)
   return ttfinfo

ttfinfo=ttfToImage(font_=fontName,imagePath="images")

for key in ttfinfo.keys():
   if (ttfinfo[key]==None):
       s=input("charater in index %s:"%key)
       print("charater in index %s is %s"%(key,json.dumps(s)))
       ttfinfo[key]=s

lst=json.dumps(ttfinfo)
open("%s.json"%fontName,"w").write(lst)

上述两种方式的代码最终都输出一个json文件，key是ttf文件中自定义的unicode编码，value是真实的字体所对应的unicode编码

2.修改pdf中的文字，使其可复制

a.选择pdf开发库

知乎上有一篇文章(链接)总结了一下操作pdf常用的python库，总结就是从文章中盗的该图片：
在这里插入图片描述
看了一下pymupdf和borb的repo，感觉borb的文档比较齐全，选择直接用borb进行开发。
但是borb的文档中貌似没有给直接从流层面修改文字的api，参考了一下源码的写法，只需要新建一个stream替换原有的即可。（新建stream的demo见源码中为pdf添加JavaScript代码片段链接）

b.修改pdf

一个简单的思路是，直接用真实的编码替换原pdf中的字符编码，但是会出现一个问题是虽然可以复制了，但是输出的pdf文档会是乱码。
修复方案

用pdf编辑软件替换pdf中的字体，只要替换成非pdf中的内嵌字体，按理说就可以正常显示
直接修改pdf文件，向其中硬编码进对应字体的完整ttf文件

方案一读者可以自行实现，下面主要介绍方案二的实现
使用常见字体（本文中以楷体为例）替换pdf中的内嵌字体

from borb.pdf.canvas.font.simple_font.true_type_font import TrueTypeFont
commom_unicode_font=TrueTypeFont.true_type_font_from_file(Path("C:/windows/fonts/simkai.ttf"))
for page_number in range(pages):
    page = pdf.get_page(page_number)
    font_resource = page["Resources"]["Font"]
    for font_name in font_resource:
        if (full_name == replace_font_name):
          font_resource[font_name]=commom_unicode_font
          full_name=font_resource[font_name]["BaseFont"]

然后利用把pdf中的unicode编码改成真实的unicode编码

for mpos in range(rec,pos,4):
    index = int(''.join(list_text[mpos:mpos+4]),16)
    newcode = ord(ttfinfo[str(index)])
    code = "%04x"%newcode
    list_text[mpos],list_text[mpos+1],list_text[mpos+2],list_text[mpos+3]=code[0],code[1],code[2],code[3]

此时转换的pdf文件依然是乱码，原因是系统中的楷体对应的ttf文件也维护了一套自己的ToUnicode表，需要将真实的unicode再转换成simkai.ttf中自定义的unicode编码：

for mpos in range(rec,pos,4):
    index = int(''.join(list_text[mpos:mpos+4]),16)
    newcode = commom_unicode_font.unicode_to_character_identifier((ttfinfo[str(index)]))
    code = "%04x"%newcode
    list_text[mpos],list_text[mpos+1],list_text[mpos+2],list_text[mpos+3]=code[0],code[1],code[2],code[3]

c.完整代码

from borb.pdf import PDF
from pathlib import Path
from borb.io.read.types import Name
from borb.io.read.types import Decimal as bDecimal
from borb.pdf.canvas.font.simple_font.true_type_font import TrueTypeFont

import zlib
import json

in_file_handle=open("1.pdf", "rb")
pdf = PDF.loads(in_file_handle)
pages = int(pdf.get_xmp_document_info().get_number_of_pages())

replace_font_name="QWWOYD+STSongti-SC-Bold"

ttfinfo=json.loads(open("%s.json"%replace_font_name,"r").read())

commom_unicode_font=TrueTypeFont.true_type_font_from_file(Path("C:/windows/fonts/simkai.ttf"))
for page_number in range(pages):
   print("converting page %d"%page_number)
   
   page = pdf.get_page(page_number)
   
   stream = page["Contents"]
   font_resource = page["Resources"]["Font"]
   font_map={}
   for font_name in font_resource:
       full_name=font_resource[font_name]["BaseFont"]
       if (full_name == replace_font_name):
         font_resource[font_name]=commom_unicode_font
         full_name=font_resource[font_name]["BaseFont"]          
       font_map[str(font_name)]=str(full_name)
   text=stream["DecodedBytes"].decode("latin1")
   list_text=list(text)
   
   pos=text.find("Tf <")
   while (pos!=-1):
       fontpos=pos-2
       while (text[fontpos]!=' ' and text[fontpos]!='\n' and text[fontpos]!='\r'):
           fontpos-=1
       font_front_pos=fontpos
       while (text[font_front_pos]!='/'):
           font_front_pos-=1
       text_font=text[font_front_pos+1:fontpos]
       text_font=font_map[text_font]
       
       pos+=4
       # if (text[pos-1]=='('):
       #     rec = pos
       #     pos=text.find(")",pos)
       #     print(text[rec-15:pos+5])
       #     pos=text.find("Tf ",pos)
       #     continue
       rec=pos
       pos=text.find(">",pos)
       if (pos == -1):
           break
       for mpos in range(rec,pos,4):
           index = int(''.join(list_text[mpos:mpos+4]),16)
           newcode = commom_unicode_font.unicode_to_character_identifier((ttfinfo[str(index)]))
           code = "%04x"%newcode
           list_text[mpos],list_text[mpos+1],list_text[mpos+2],list_text[mpos+3]=code[0],code[1],code[2],code[3]
       pos=text.find("Tf <",pos)
   text_new=''.join(list_text)

   
   stream[Name("DecodedBytes")] = bytes(text_new, "latin1")
   stream[Name("Bytes")] = zlib.compress(stream[Name("DecodedBytes")], 9)
   stream[Name("Length")] = bDecimal(len(stream[Name("Bytes")]))
   page[Name("Contents")]=stream
   
#store the PDF
with open("output.pdf", "wb") as pdf_file_handle:
   PDF.dumps(pdf_file_handle, pdf)