我正在使用pytesseract、枕头、cv2对图像进行OCR处理,并获取图像中的文本。由于我的输入是一个扫描的PDF文档,我首先将其转换为图像(JPEG)格式,然后尝试提取文本。我只差一半就到了。输入是一个表格,标题不显示,因为标题的背景是黑色的。我也试过getstructuringelement,但找不到办法。这是我到现在为止所做的-import cv2
import os
import numpy as np
import pytesseract
#import pillow
#Since scanned PDF can't be handled by pdf2image, convert the scanned PDF into a JPEG format using the below code-
filename = path
from pdf2image import convert_from_path
pages = convert_from_path(filename, 500) for page in pages:
page.save("dest", 'JPEG')
imgname = "path"
oriimg = cv2.imread(imgname,cv2.IMREAD_COLOR)
cv2.imshow("original image", oriimg)
cv2.waitKey(0)
#img = cv2.resize(oriimg,None,fx=0.5,fy=0.5,interpolation=cv2.INTER_CUBIC)
img = cv2.resize(oriimg,(