图片去噪 ai_用ai去噪扫描的报纸

最新推荐文章于 2024-06-11 10:57:20 发布

勃斯李

最新推荐文章于 2024-06-11 10:57:20 发布

阅读量795

点赞数

文章标签： python java

原文链接：https://medium.com/scmp-inside-the-wonton/using-ai-to-denoise-scanned-newspaper-88f4e34f59c0

版权

图片去噪 ai

TL; DR(TL;DR)

We are in the process of improving OCR text of our historical digital news archive. The archive comprises of PDFs, produced from microfilms and original news prints. The PDF quality varies, resulting in a poor OCR result. Our supposition is that if the image quality can be improved, the OCR accuracy will in turn improve.

我们正在改善历史数字新闻档案的OCR文本。档案库包含从缩微胶卷和原始新闻印刷品中提取的PDF。 PDF的质量各不相同，导致OCR结果差。我们的假设是，如果可以提高图像质量，则OCR精度将随之提高。

We came across a machine learning technique named Denoising Autoencoder (DAE), it encodes and decodes image by running it through a convolutional neural network (CNN). Since this is a lossy process, the model is trained to ignore unwanted signals: “the noise”. The end result is a clearer PDF image.Although more effort is needed to refine the model, results so far are promising.

我们遇到了一种称为降噪自动编码器(DAE)的机器学习技术，该技术通过卷积神经网络(CNN)运行图像来对图像进行编码和解码。由于这是一个有损过程，因此模型被训练为忽略不需要的信号：“噪声”。最终结果是获得更清晰的PDF图像。尽管需要付出更多的努力来完善模型，但迄今为止的结果是有希望的。

背景 (Background)

We have over 100 years of news archive in the form of microfilms and news print. They were scanned by a vendor, and converted to PDF and TIFF formats. The files were then passed to another vendor to extract the text (OCR).

我们拥有超过100年的缩微胶片和新闻印刷形式的新闻档案。他们由供应商扫描，并转换为PDF和TIFF格式。然后将文件传递给另一家供应商以提取文本(OCR)。

The OCR results were extremely poor, due to wear and tear of the microfilms through usage. Although not visible to human eyes, vertical lines are appeared across the text in the scans. Ink patches and dense background noise due to paper quality and the printing processes, as well as misaligned and skewed text added to the complexity.

由于使用过程中缩微胶片的磨损，OCR结果极差。尽管肉眼看不见，但扫描中的文本上仍显示垂直线。由于纸张质量和打印过程而导致的油墨斑块和密集的背景噪音，以及未对齐和偏斜的文本，增加了复杂性。

The OCR vendor we engaged made our effort to re-OCR job easier. Articles are clipped into individual PDFs, accompanied by an XML file containing the article title, sub-title, byline and category. The XML file also details the coordinates of the articles in the original paper, in case we have to go back to the original for reference.

我们聘请的OCR供应商使我们的工作更加轻松。文章被剪切成单独的PDF，并附带一个XML文件，其中包含文章标题，副标题，署名和类别。 XML文件还详细说明了原始论文中文章的坐标，以防万一我们不得不返回原始文献以供参考。

Therefore, all we have to do is to find the best way to denoise the PDF images and OCR again with Tesseract.

因此，我们要做的就是找到使用Tesseract再次对PDF图像和OCR进行去噪的最佳方法。

传统方法 (The traditional approaches)

At the very beginning, we looked for a simple image processing technique to remove the noise.

在一开始，我们就寻求一种简单的图像处理技术来消除噪声。

Although most of the noise was removed from the above experiments, some character details were also eroded.

尽管从上述实验中消除了大多数噪音，但一些字符的细节也被侵蚀了。

Median blurring is very effective against salt-and-pepper noise that we are seeking to remove, but the lines remained. The method dilation and erosion had a better result than median blurring, but traces of noises remain across the whole paragraph, which affected the OCR accuracy. Most of our scanned newspapers are already binarized, so the adaptive threshold can not help in this case. Adaptive threshold works best when there is a considerable variation in the background intensity, where text is usually darker than the background noise.

中值模糊对我们要消除的椒盐噪声非常有效，但线条依然存在。该方法的膨胀和腐蚀效果比中值模糊效果更好，但是整个段落中仍然保留了痕迹，这影响了OCR的准确性。我们扫描的大多数报纸已经被二值化，因此在这种情况下自适应阈值无济于事。当背景强度有相当大的变化时，文本通常比背景噪声更暗，自适应阈值最有效。

人工智能方法 (The AI approach)

After several trials of using the above technique, we started to look for a more “intelligent” approach. We came across the term “autoencoder”.

在使用上述技术进行了几次尝试之后，我们开始寻找一种更“智能”的方法。我们遇到了术语“自动编码器”。

去噪自动编码器 (Denoising Autoencoder)

Denoising Autoencoder (DAE) is a type of artificial neural network that composes of an encoder and a decoder. The encoder converts the input into a compressed representation, while the decoder reconstructs the compressed form to as close as possible to its original input. Convolutional Neural Network (CNN) is used for feature engineering, which is to extract and select useful features in the process.

去噪自动编码器(DAE)是一种人工神经网络，由编码器和解码器组成。编码器将输入转换为压缩的表示形式，而解码器将压缩形式重构为尽可能接近其原始输入。卷积神经网络(CNN)用于特征工程，即在过程中提取和选择有用的特征。

Our goal is to train a neural model to output a cleaned image from the original noisy one. Under supervised training, we feed the model with the noisy image, encoder learns what features to retain and ignore the “noises”. The decoder learns to generate the output and compare the result with our manually cleaned samples. We repeat the training process until validation accuracy can no longer be improved, and before overfitting happens.

我们的目标是训练一种神经模型，以从原始嘈杂的图像中输出干净的图像。在有监督的训练下，我们向模型输入噪声图像，编码器了解保留哪些特征并忽略“噪声”。解码器学习生成输出并将结果与我们手动清理的样本进行比较。我们重复训练过程，直到无法再提高验证准确性以及过度拟合之前。

The encoding process is lossy, and we may use the encoder to compress images. This is similar to how the JPEG works. But in general, the compression works better in JPEG, as the encoder only works best with the type of data it is trained with.

编码过程是有损的，我们可能会使用编码器来压缩图像。这类似于JPEG的工作方式。但总的来说，压缩在JPEG中效果更好，因为编码器仅对训练数据类型最有效。

Keras的DAE示例 (DAE example with Keras)

input_img = Input(shape=(28, 28, 1))x = Conv2D(32, (3, 3), activation='relu', padding='same')(input_img)
x = MaxPooling2D((2, 2), padding='same')(x)
x = Conv2D(32, (3, 3), activation='relu', padding='same')(x)
encoded = MaxPooling2D((2, 2), padding='same')(x)x = Conv2D(32, (3, 3), activation='relu', padding='same')(encoded)
x = UpSampling2D((2, 2))(x)
x = Conv2D(32, (3, 3), activation='relu', padding='same')(x)
x = UpSampling2D((2, 2))(x)
decoded = Conv2D(1, (3, 3), activation='sigmoid', padding='same')(x)autoencoder = Model(input_img, decoded)
autoencoder.compile(optimizer='adadelta', loss='binary_crossentropy')

Code sample courtesy of Francois Chollet

代码示例由Francois Chollet提供

The input image is 28 by 28 pixels with 1 channel (1 color, grayscale) from the example above. The input is being processed with 2 convolutions. Each convolution layer has 32 filters with a 3x3 kernel to produce 32 feature maps. These feature maps are then downsized (subsampling) by a 2x2 kernel, resulting in a smaller feature map.

根据上面的示例，输入图像为28 x 28像素，带有1个通道(1色，灰度)。正在使用2个卷积处理输入。每个卷积层都有32个具有3x3内核的过滤器，以生成32个特征图。然后，将这些功能图通过2x2内核进行缩小(二次采样)，从而缩小功能图。

Although the information is lost, this helps speed up the training process and reduce system resource usage. After the 2 convolutions, the compressed representation will be reconstructed to produce the output. Another 2 convolutions are used here. This time, Upsampling is used to restore the output size to match the input size. In the last convolution layer, notice the number of filters set to 1 because we need to reduce the channel to 1 to match the input channel number.

尽管信息丢失了，但这有助于加速培训过程并减少系统资源的使用。经过2次卷积后，将重建压缩表示以产生输出。这里使用另外两个卷积。这次，上采样用于恢复输出大小以匹配输入大小。在最后一个卷积层中，请注意将过滤器的数量设置为1，因为我们需要将通道减小到1以匹配输入通道号。

If you run the above python code together with autoencoder.summary(), you will get the following model details. The shape is the same for layer input_1 and conv2d_4.

如果将上面的python代码与autoencoder.summary()一起运行，则将获得以下模型详细信息。图层input_1和conv2d_4的形状相同。

_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
input_1 (InputLayer)         [(None, 28, 28, 1)]       0
_________________________________________________________________
conv2d (Conv2D)              (None, 28, 28, 32)        320
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 14, 14, 32)        0
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 14, 14, 32)        9248
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 7, 7, 32)          0
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 7, 7, 32)          9248
_________________________________________________________________
up_sampling2d (UpSampling2D) (None, 14, 14, 32)        0
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 14, 14, 32)        9248
_________________________________________________________________
up_sampling2d_1 (UpSampling2 (None, 28, 28, 32)        0
_________________________________________________________________
conv2d_4 (Conv2D)            (None, 28, 28, 1)         289
=================================================================
Total params: 28,353
Trainable params: 28,353
Non-trainable params: 0
_________________________________________________________________

We are still looking for a good model that can handle our use case, but choosing DAE is definitely the right approach to solve our problem “intelligently”.

我们仍在寻找一个可以处理用例的好的模型，但是选择DAE绝对是“智能”解决问题的正确方法。