基于OCR和大型语言模型LLM进行收据解析 | 附代码

在本教程中,我将介绍如何利用OCR从收据中捕获数据,然后利用大型语言模型(LLM)提取相关详细信息,如总金额、收据的日期和时间,以及其他相关信息。

为了从收据中检索信息,我将使用Azure的OpenAI功能。

构建OCR输出数据

让我们开始在您的计算机上安装docTR和所需的库。我将不详细介绍安装过程,感兴趣可以在下面的Git存储库中找到详细的说明:https://github.com/mindee/doctr?source=post_page-----7aa733d5e335--------------------------------#installation让我们通过在Jpeg中使用提供的收据图像执行以下代码来测试安装是否成功而没有错误。

bf5a9c8e2f301c13701eba571772ac8f.jpeg

import os
import json


# Let's pick the desired backend
# os.environ['USE_TF'] = '1'
os.environ['USE_TORCH'] = '1'


import matplotlib.pyplot as plt


from doctr.io import DocumentFile
from doctr.models import ocr_predictor


# Read the file
doc = DocumentFile.from_images("receipt.jpg")
print(f"Number of pages: {len(doc)}")

如果没有错误,您将得到以下输出:

Number of pages: 1

让我们继续实例化一个预训练模型。

# Instantiate a pretrained model
predictor = ocr_predictor(pretrained=True)

以JSON格式导出输出.

result = predictor(doc)


# JSON export
json_export = result.export()
print(json_export)

您将得到以下输出:

{'pages': [{'page_idx': 0, 'dimensions': (600, 600), 'orientation': {'value': None, 'confidence': None}, 'language': {'value': None, 'confidence': None}, 'blocks': [{'geometry': ((0.2734375, 0.0), (0.6875, 0.1162109375)), 'lines': [{'geometry': ((0.33984375, 0.0), (0.6171875, 0.0234375)), 'words': [{'value': '#01-901', 'confidence': 0.9932250380516052, 'geometry': ((0.33984375, 0.001953125), (0.416015625, 0.0234375))}, {'value': 'SINGAPORE', 'confidence': 0.9812156558036804, 'geometry': ((0.4208984375, 0.001953125), (0.54296875, 0.01953125))}, {'value': '380011', 'confidence': 0.562835156917572, 'geometry': ((0.5458984375, 0.0), (0.6171875, 0.017578125))}]}, {'geometry': ((0.2734375, 0.017578125), (0.6875, 0.05078125)), 'words': [{'value': 'GST', 'confidence': 0.9999666213989258, 'geometry': ((0.2734375, 0.02734375), (0.3212890625, 0.0498046875))}, {'value': 'Reg:', 'confidence': 0.9997168183326721, 'geometry': ((0.322265625, 0.02734375), (0.3671875, 0.05078125))}, {'value': 'M2-0065333-5', 'confidence': 0.6861922740936279, 'geometry': ((0.3720703125, 0.0234375), (0.5087890625, 0.0439453125))}, {'value': 'UEN:', 'confidence': 0.9687079787254333, 'geometry': ((0.5087890625, 0.0205078125), (0.5625, 0.0419921875))}, {'value': '198304925E', 'confidence': 0.9952959418296814, 'geometry': ((0.56640625, 0.017578125), (0.6875, 0.0380859375))}]}, {'geometry': ((0.3603515625, 0.0439453125), (0.6015625, 0.0693359375)), 'words': [{'value': 'Phone', 'confidence': 0.9936328530311584, 'geometry': ((0.3603515625, 0.0498046875), (0.423828125, 0.068359375))}, {'value': ':', 'confidence': 0.9998807907104492, 'geometry': ((0.423828125, 0.0478515625), (0.4404296875, 0.0693359375))}, {'value': '67472780', 'confidence': 0.9968281388282776, 'geometry': ((0.4365234375, 0.0458984375), (0.5380859375, 0.0673828125))}, {'value': 'Fax:-', 'confidence': 0.9917964935302734, 'geometry': ((0.5380859375, 0.0439453125), (0.6015625, 0.0654296875))}]}, {'geometry': ((0.3720703125, 0.0703125), (0.5888671875, 0.095703125)), 'words': [{'value': 'Manager:', 'confidence': 0.6913022398948669, 'geometry': ((0.3720703125, 0.07421875), (0.4609375, 0.095703125))}, {'value': 'SIVAKUMAR', 'confidence': 0.9983320832252502, 'geometry': ((0.4658203125, 0.0703125), (0.5888671875, 0.0908203125))}]}, {'geometry': ((0.373046875, 0.09375), (0.5869140625, 0.1162109375)), 'words': [{'value': 'Contact', 'confidence': 0.992266833782196, 'geometry': ((0.373046875, 0.0966796875), (0.4482421875, 0.115234375))}, {'value': 'No.:', 'confidence': 0.9826020002365112, 'geometry': ((0.4482421875, 0.09375), (0.4912109375, 0.1162109375))}, {'value': '88008584', 'confidence': 0.8402541875839233, 'geometry': ((0.494140625, 0.09375), (0.5869140625, 0.1123046875))}]}], 'artefacts': []}, {'geometry': ((0.3046875, 0.134765625), (0.6611328125, 0.2314453125)), 'lines': [{'geometry': ((0.3056640625, 0.134765625), (0.6572265625, 0.16015625)), 'words': [{'value': 'Terminal:', 'confidence': 0.8031894564628601, 'geometry': ((0.3056640625, 0.142578125), (0.396484375, 0.16015625))}, {'value': 'BK0003', 'confidence': 0.8097429275512695, 'geometry': ((0.404296875, 0.140625), (0.4814453125, 0.1591796875))}, {'value': '13/02/2022', 'confidence': 0.8739034533500671, 'geometry': ((0.4892578125, 0.1376953125), (0.595703125, 0.158203125))}, {'value': '19:21', 'confidence': 0.9997132420539856, 'geometry': ((0.603515625, 0.134765625), (0.6572265625, 0.15625))}]}, {'geometry': ((0.3046875, 0.1630859375), (0.6611328125, 0.1904296875)), 'words': [{'value': 'ReceiptTaxInvoice', 'confidence': 0.4457036852836609, 'geometry': ((0.3046875, 0.166015625), (0.4892578125, 0.1904296875))}, {'value': 'BKA3500490695', 'confidence': 0.504152774810791, 'geometry': ((0.49609375, 0.1630859375), (0.6611328125, 0.18359375))}]}, {'geometry': ((0.3662109375, 0.1884765625), (0.59765625, 0.208984375)), 'words': [{'value': 'Quotation', 'confidence': 0.8169445991516113, 'geometry': ((0.3662109375, 0.1904296875), (0.458984375, 0.208984375))}, {'value': 'No.', 'confidence': 0.9977673292160034, 'geometry': ((0.4609375, 0.1884765625), (0.498046875, 0.208984375))}, {'value': ':', 'confidence': 0.9996732473373413, 'geometry': ((0.4990234375, 0.189453125), (0.5126953125, 0.2080078125))}, {'value': 'S031362', 'confidence': 0.5456238985061646, 'geometry': ((0.51171875, 0.1884765625), (0.59765625, 0.20703125))}]}, {'geometry': ((0.34375, 0.208984375), (0.6220703125, 0.2314453125)), 'words': [{'value': 'Cashier:', 'confidence': 0.9858759045600891, 'geometry': ((0.34375, 0.212890625), (0.4228515625, 0.2314453125))}, {'value': 'HONG', 'confidence': 0.9993447661399841, 'geometry': ((0.43359375, 0.2099609375), (0.4990234375, 0.2314453125))}, {'value': 'THI', 'confidence': 0.9992380142211914, 'geometry': ((0.5, 0.2099609375), (0.537109375, 0.2294921875))}, {'value': 'BE', 'confidence': 0.9985008239746094, 'geometry': ((0.5390625, 0.208984375), (0.572265625, 0.2314453125))}, {'value': 'DAO', 'confidence': 0.9940517544746399, 'geometry': ((0.5732421875, 0.208984375), (0.6220703125, 0.228515625))}]}], 'artefacts': []}, {'geometry': ((0.2451171875, 0.24609375), (0.40234375, 0.26953125)), 'lines': [{'geometry': ((0.2451171875, 0.24609375), (0.40234375, 0.26953125)), 'words': [{'value': 'No', 'confidence': 0.9999253749847412, 'geometry': ((0.2451171875, 0.24609375), (0.2822265625, 0.26953125))}, {'value': 'Description', 'confidence': 0.9901004433631897, 'geometry': ((0.294921875, 0.248046875), (0.40234375, 0.26953125))}]}], 'artefacts': []}, {'geometry': ((0.564453125, 0.2421875), (0.7177734375, 0.26953125)), 'lines': [{'geometry': ((0.564453125, 0.2421875), (0.7177734375, 0.26953125)), 'words': [{'value': 'Qty', 'confidence': 0.9939969778060913, 'geometry': ((0.564453125, 0.2421875), (0.6064453125, 0.26953125))}, {'value': 'Amount', 'confidence': 0.9966546297073364, 'geometry': ((0.640625, 0.2431640625), (0.7177734375, 0.26171875))}]}], 'artefacts': []}, {'geometry': ((0.2578125, 0.2724609375), (0.5908203125, 0.298828125)), 'lines': [{'geometry': ((0.2578125, 0.2724609375), (0.5908203125, 0.298828125)), 'words': [{'value': '1.', 'confidence': 0.9985117316246033, 'geometry': ((0.2578125, 0.2744140625), (0.2919921875, 0.298828125))}, {'value': '#OTIS', 'confidence': 0.9894990921020508, 'geometry': ((0.2919921875, 0.275390625), (0.3642578125, 0.2978515625))}, {'value': 'BARISTA', 'confidence': 0.42725348472595215, 'geometry': ((0.3662109375, 0.2763671875), (0.458984375, 0.2939453125))}, {'value': 'OAT', 'confidence': 0.999354898929596, 'geometry': ((0.4609375, 0.2744140625), (0.5068359375, 0.2939453125))}, {'value': 'MILK', 'confidence': 0.9774147272109985, 'geometry': ((0.5087890625, 0.2724609375), (0.5634765625, 0.2939453125))}, {'value': '1L', 'confidence': 0.9945043325424194, 'geometry': ((0.5595703125, 0.2724609375), (0.5908203125, 0.29296875))}]}], 'artefacts': []}, {'geometry': ((0.2490234375, 0.30859375), (0.45703125, 0.40234375)), 'lines': [{'geometry': ((0.306640625, 0.30859375), (0.45703125, 0.326171875)), 'words': [{'value': '9421906089017', 'confidence': 0.9027230143547058, 'geometry': ((0.306640625, 0.30859375), (0.45703125, 0.326171875))}]}, {'geometry': ((0.3046875, 0.3330078125), (0.4208984375, 0.3544921875)), 'words': [{'value': '2', 'confidence': 0.9997554421424866, 'geometry': ((0.3046875, 0.3330078125), (0.322265625, 0.3544921875))}, {'value': 'for', 'confidence': 0.9995049238204956, 'geometry': ((0.3212890625, 0.3330078125), (0.3525390625, 0.3544921875))}, {'value': '$11.95', 'confidence': 0.9979997277259827, 'geometry': ((0.353515625, 0.333984375), (0.4208984375, 0.3525390625))}]}, {'geometry': ((0.2490234375, 0.3798828125), (0.3828125, 0.40234375)), 'words': [{'value': 'Total', 'confidence': 0.9654089212417603, 'geometry': ((0.2490234375, 0.3798828125), (0.302734375, 0.40234375))}, {'value': 'Amount', 'confidence': 0.9976258873939514, 'geometry': ((0.3056640625, 0.3818359375), (0.3828125, 0.400390625))}]}], 'artefacts': []}, {'geometry': ((0.529296875, 0.3056640625), (0.607421875, 0.32421875)), 'lines': [{'geometry': ((0.529296875, 0.3056640625), (0.607421875, 0.32421875)), 'words': [{'value': '4x6.95', 'confidence': 0.629564642906189, 'geometry': ((0.529296875, 0.3056640625), (0.607421875, 0.32421875))}]}], 'artefacts': []}, {'geometry': ((0.6513671875, 0.3017578125), (0.724609375, 0.47265625)), 'lines': [{'geometry': ((0.662109375, 0.3017578125), (0.7216796875, 0.3232421875)), 'words': [{'value': '27.80', 'confidence': 0.9991148114204407, 'geometry': ((0.662109375, 0.3017578125), (0.7216796875, 0.3232421875))}]}, {'geometry': ((0.66796875, 0.328125), (0.7216796875, 0.349609375)), 'words': [{'value': '-3.90', 'confidence': 0.9843301177024841, 'geometry': ((0.66796875, 0.328125), (0.7216796875, 0.349609375))}]}, {'geometry': ((0.6513671875, 0.375), (0.7216796875, 0.3974609375)), 'words': [{'value': '$23.90', 'confidence': 0.9994686245918274, 'geometry': ((0.6513671875, 0.375), (0.7216796875, 0.3974609375))}]}, {'geometry': ((0.65234375, 0.4111328125), (0.72265625, 0.4326171875)), 'words': [{'value': '$23.90', 'confidence': 0.9990628361701965, 'geometry': ((0.65234375, 0.4111328125), (0.72265625, 0.4326171875))}]}, {'geometry': ((0.666015625, 0.4501953125), (0.724609375, 0.47265625)), 'words': [{'value': '$0.00', 'confidence': 0.9990418553352356, 'geometry': ((0.666015625, 0.4501953125), (0.724609375, 0.47265625))}]}], 'artefacts': []}, {'geometry': ((0.248046875, 0.416015625), (0.4931640625, 0.560546875)), 'lines': [{'geometry': ((0.2509765625, 0.416015625), (0.4931640625, 0.4384765625)), 'words': [{'value': 'MASIERICOOID', 'confidence': 0.16996675729751587, 'geometry': ((0.2509765625, 0.416015625), (0.4931640625, 0.4384765625))}]}, {'geometry': ((0.2490234375, 0.455078125), (0.376953125, 0.48046875)), 'words': [{'value': 'Change', 'confidence': 0.9970219731330872, 'geometry': ((0.2490234375, 0.4560546875), (0.330078125, 0.48046875))}, {'value': 'Due', 'confidence': 0.9999706745147705, 'geometry': ((0.33203125, 0.455078125), (0.376953125, 0.4775390625))}]}, {'geometry': ((0.2490234375, 0.48828125), (0.447265625, 0.51171875)), 'words': [{'value': 'Items', 'confidence': 0.9890830516815186, 'geometry': ((0.2490234375, 0.490234375), (0.306640625, 0.51171875))}, {'value': 'Purchased', 'confidence': 0.9993000030517578, 'geometry': ((0.310546875, 0.4892578125), (0.4189453125, 0.509765625))}, {'value': ':', 'confidence': 0.9981997013092041, 'geometry': ((0.419921875, 0.490234375), (0.43359375, 0.509765625))}, {'value': '4', 'confidence': 0.9994581341743469, 'geometry': ((0.4296875, 0.48828125), (0.447265625, 0.509765625))}]}, {'geometry': ((0.248046875, 0.53125), (0.3935546875, 0.560546875)), 'words': [{'value': '#Total', 'confidence': 0.9086952209472656, 'geometry': ((0.248046875, 0.53125), (0.322265625, 0.560546875))}, {'value': 'Saving', 'confidence': 0.9651548862457275, 'geometry': ((0.3232421875, 0.5341796875), (0.3935546875, 0.5595703125))}]}], 'artefacts': []}, {'geometry': ((0.4296875, 0.5322265625), (0.4970703125, 0.5546875)), 'lines': [{'geometry': ((0.4296875, 0.5322265625), (0.4970703125, 0.5546875)), 'words': [{'value': '-', 'confidence': 0.43670952320098877, 'geometry': ((0.4296875, 0.5361328125), (0.4453125, 0.55078125))}, {'value': '$3.90', 'confidence': 0.9483895301818848, 'geometry': ((0.4365234375, 0.5322265625), (0.4970703125, 0.5546875))}]}], 'artefacts': []}, {'geometry': ((0.2509765625, 0.564453125), (0.6005859375, 0.5908203125)), 'lines': [{'geometry': ((0.2509765625, 0.564453125), (0.6005859375, 0.5908203125)), 'words': [{'value': 'GST', 'confidence': 0.9998136162757874, 'geometry': ((0.2509765625, 0.568359375), (0.30078125, 0.5908203125))}, {'value': '%', 'confidence': 0.999920129776001, 'geometry': ((0.3017578125, 0.5673828125), (0.3271484375, 0.5908203125))}, {'value': 'Exclude', 'confidence': 0.8899426460266113, 'geometry': ((0.353515625, 0.568359375), (0.43359375, 0.5869140625))}, {'value': 'GST', 'confidence': 0.9998469352722168, 'geometry': ((0.435546875, 0.5654296875), (0.4853515625, 0.587890625))}, {'value': 'GST', 'confidence': 0.998401939868927, 'geometry': ((0.5048828125, 0.564453125), (0.5546875, 0.5869140625))}, {'value': 'Amt', 'confidence': 0.850462794303894, 'geometry': ((0.5546875, 0.564453125), (0.6005859375, 0.5869140625))}]}], 'artefacts': []}, {'geometry': ((0.6533203125, 0.5654296875), (0.734375, 0.6171875)), 'lines': [{'geometry': ((0.6533203125, 0.5654296875), (0.7314453125, 0.583984375)), 'words': [{'value': 'Amount', 'confidence': 0.9848493337631226, 'geometry': ((0.6533203125, 0.5654296875), (0.7314453125, 0.583984375))}]}, {'geometry': ((0.6630859375, 0.5947265625), (0.734375, 0.6171875)), 'words': [{'value': '$23.90', 'confidence': 0.9978439807891846, 'geometry': ((0.6630859375, 0.5947265625), (0.734375, 0.6171875))}]}], 'artefacts': []}, {'geometry': ((0.2802734375, 0.599609375), (0.2978515625, 0.6220703125)), 'lines': [{'geometry': ((0.2802734375, 0.599609375), (0.2978515625, 0.6220703125)), 'words': [{'value': '7', 'confidence': 0.9998346567153931, 'geometry': ((0.2802734375, 0.599609375), (0.2978515625, 0.6220703125))}]}], 'artefacts': []}, {'geometry': ((0.4140625, 0.5986328125), (0.484375, 0.6201171875)), 'lines': [{'geometry': ((0.4140625, 0.5986328125), (0.484375, 0.6201171875)), 'words': [{'value': '$22.34', 'confidence': 0.9993184804916382, 'geometry': ((0.4140625, 0.5986328125), (0.484375, 0.6201171875))}]}], 'artefacts': []}, {'geometry': ((0.541015625, 0.5966796875), (0.6005859375, 0.6181640625)), 'lines': [{'geometry': ((0.541015625, 0.5966796875), (0.6005859375, 0.6181640625)), 'words': [{'value': '$1.56', 'confidence': 0.9944227337837219, 'geometry': ((0.541015625, 0.5966796875), (0.6005859375, 0.6181640625))}]}], 'artefacts': []}, {'geometry': ((0.4404296875, 0.666015625), (0.5400390625, 0.6875)), 'lines': [{'geometry': ((0.4404296875, 0.666015625), (0.5400390625, 0.6875)), 'words': [{'value': 'MASTER', 'confidence': 0.8670670986175537, 'geometry': ((0.4404296875, 0.666015625), (0.5400390625, 0.6875))}]}], 'artefacts': []}, {'geometry': ((0.248046875, 0.701171875), (0.6865234375, 0.74609375)), 'lines': [{'geometry': ((0.248046875, 0.701171875), (0.642578125, 0.7216796875)), 'words': [{'value': 'DatelTime:', 'confidence': 0.8654562830924988, 'geometry': ((0.248046875, 0.7041015625), (0.337890625, 0.7216796875))}, {'value': '13022022192100', 'confidence': 0.6854404211044312, 'geometry': ((0.35546875, 0.7041015625), (0.525390625, 0.71875))}, {'value': '(Contactiess)', 'confidence': 0.5816012024879456, 'geometry': ((0.5361328125, 0.701171875), (0.642578125, 0.7216796875))}]}, {'geometry': ((0.248046875, 0.7255859375), (0.6865234375, 0.74609375)), 'words': [{'value': 'Mercid', 'confidence': 0.8570956587791443, 'geometry': ((0.248046875, 0.7275390625), (0.3134765625, 0.74609375))}, {'value': '000001050644651', 'confidence': 0.7285884022712708, 'geometry': ((0.34375, 0.7265625), (0.4970703125, 0.744140625))}, {'value': 'Terminal', 'confidence': 0.9665992259979248, 'geometry': ((0.5068359375, 0.7265625), (0.5830078125, 0.744140625))}, {'value': '-', 'confidence': 0.93905109167099, 'geometry': ((0.591796875, 0.728515625), (0.6015625, 0.7421875))}, {'value': '51523260', 'confidence': 0.9988250136375427, 'geometry': ((0.6015625, 0.7255859375), (0.6865234375, 0.7431640625))}]}], 'artefacts': []}, {'geometry': ((0.2490234375, 0.75), (0.4697265625, 0.79296875)), 'lines': [{'geometry': ((0.25, 0.75), (0.4111328125, 0.7724609375)), 'words': [{'value': 'Approval', 'confidence': 0.9914907813072205, 'geometry': ((0.25, 0.7509765625), (0.326171875, 0.7724609375))}, {'value': ':', 'confidence': 0.9201642274856567, 'geometry': ((0.3330078125, 0.7509765625), (0.3466796875, 0.7705078125))}, {'value': 'R69046', 'confidence': 0.9995259046554565, 'geometry': ((0.3427734375, 0.75), (0.4111328125, 0.7685546875))}]}, {'geometry': ((0.2490234375, 0.771484375), (0.4697265625, 0.79296875)), 'words': [{'value': 'RefNo', 'confidence': 0.9922246932983398, 'geometry': ((0.2490234375, 0.771484375), (0.3125, 0.79296875))}, {'value': '000011076745', 'confidence': 0.9994035959243774, 'geometry': ((0.34375, 0.771484375), (0.4697265625, 0.7890625))}]}], 'artefacts': []}, {'geometry': ((0.5078125, 0.748046875), (0.576171875, 0.814453125)), 'lines': [{'geometry': ((0.5078125, 0.748046875), (0.5595703125, 0.7666015625)), 'words': [{'value': 'Batch', 'confidence': 0.9954745173454285, 'geometry': ((0.5078125, 0.748046875), (0.5595703125, 0.7666015625))}]}, {'geometry': ((0.5078125, 0.7685546875), (0.552734375, 0.7880859375)), 'words': [{'value': 'Card', 'confidence': 0.9997015595436096, 'geometry': ((0.5078125, 0.7685546875), (0.552734375, 0.7880859375))}]}, {'geometry': ((0.5078125, 0.7958984375), (0.576171875, 0.814453125)), 'words': [{'value': 'Amount', 'confidence': 0.9982516169548035, 'geometry': ((0.5078125, 0.7958984375), (0.576171875, 0.814453125))}]}], 'artefacts': []}, {'geometry': ((0.6015625, 0.7470703125), (0.6669921875, 0.7646484375)), 'lines': [{'geometry': ((0.6015625, 0.7470703125), (0.6669921875, 0.7646484375)), 'words': [{'value': '000435', 'confidence': 0.9871779680252075, 'geometry': ((0.6015625, 0.7470703125), (0.6669921875, 0.7646484375))}]}], 'artefacts': []}, {'geometry': ((0.65625, 0.7685546875), (0.7373046875, 0.857421875)), 'lines': [{'geometry': ((0.6728515625, 0.7685546875), (0.7138671875, 0.783203125)), 'words': [{'value': '1641', 'confidence': 0.9989182949066162, 'geometry': ((0.6728515625, 0.7685546875), (0.7138671875, 0.783203125))}]}, {'geometry': ((0.666015625, 0.7939453125), (0.732421875, 0.8125)), 'words': [{'value': '$23.90', 'confidence': 0.9973084926605225, 'geometry': ((0.666015625, 0.7939453125), (0.732421875, 0.8125))}]}, {'geometry': ((0.65625, 0.8330078125), (0.7373046875, 0.857421875)), 'words': [{'value': '$23.90', 'confidence': 0.9831066131591797, 'geometry': ((0.65625, 0.8330078125), (0.7373046875, 0.857421875))}]}], 'artefacts': []}, {'geometry': ((0.4208984375, 0.8369140625), (0.57421875, 0.9228515625)), 'lines': [{'geometry': ((0.4345703125, 0.8369140625), (0.560546875, 0.8603515625)), 'words': [{'value': 'Net', 'confidence': 0.9999843835830688, 'geometry': ((0.4345703125, 0.8369140625), (0.4765625, 0.8603515625))}, {'value': 'Amount', 'confidence': 0.9871867895126343, 'geometry': ((0.4775390625, 0.8369140625), (0.560546875, 0.8583984375))}]}, {'geometry': ((0.4208984375, 0.8984375), (0.57421875, 0.9228515625)), 'words': [{'value': 'APPROVED', 'confidence': 0.9999109506607056, 'geometry': ((0.4208984375, 0.8984375), (0.57421875, 0.9228515625))}]}], 'artefacts': []}]}]}

让我们使用matplotlib打印输出。

synthetic_pages = result.synthesize()
plt.figure(figsize=(18, 16))  # Adjust the width and height as needed
plt.imshow(synthetic_pages[0]); plt.axis('off'); plt.show()

0b4f8232fa7638f3f7d400a3564d1787.jpeg

我们需要从JSON输出中删除与块和行相关的维度、方向、语言、几何形状等不相关的信息。我的重点仅在于提取在下方框中突出显示的与单词相关的值和几何形状的数据,而不考虑置信度。

e2af1e11049c1db0592a6659a7f75144.jpeg

为了继续消除JSON输出中的不相关信息。

# Define a function to remove fields recursively
def remove_fields(obj, fields):
    if isinstance(obj, list):
        for item in obj:
            remove_fields(item, fields)
    elif isinstance(obj, dict):
        for key in list(obj.keys()):
            if key in fields:
                del obj[key]
            else:
                remove_fields(obj[key], fields)


# Function to remove 'geometry' key from 'blocks' and 'lines'
def remove_geometry(data):
    if isinstance(data, list):
        for item in data:
            remove_geometry(item)
    elif isinstance(data, dict):
        if 'geometry' in data:
            del data['geometry']
        for key, value in data.items():
            remove_geometry(value)


# Fields to remove
fields_to_remove = ['confidence', 'page_idx', 'dimensions', 'orientation', 'language', 'artefacts']


# Remove the specified fields
remove_fields(json_export, fields_to_remove)


# Remove 'geometry' from 'blocks' and 'lines'
for page in json_export['pages']:
    for block in page['blocks']:
        if 'geometry' in block:
            del block['geometry']
        for line in block.get('lines', []):
            if 'geometry' in line:
                del line['geometry']


# Convert the modified data back to JSON
modified_json = json.dumps(json_export, separators=(',', ':'))


# Print the modified JSON
print(modified_json)

随后,将输出保存到名为 OCR.txt 的文件中。

#Convert the JSON data to a string
json_export_str = str(modified_json)


# Write the JSON data to a file
with open("OCR.txt", "w") as file:
    file.write(json_export_str)

得到的输出如下所示:

d8ed7e9270d06bb7305806c1fed94a53.jpeg

现在,我们准备将这些信息提供给LLM。

输入到LLM

我们将通过导入LangChain库并输入Azure OpenAI API密钥来继续。

from langchain.vectorstores import FAISS
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.prompts import PromptTemplate
from langchain.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.chat_models import AzureChatOpenAI
from langchain.chains import RetrievalQA


import os


os.environ["OPENAI_API_TYPE"] = "azure"
os.environ["OPENAI_API_VERSION"] = ""
os.environ["OPENAI_API_BASE"] = ""
os.environ["OPENAI_API_KEY"] = ""

我们加载 OCR.txt 文件,分割其内容,并将其作为包含 OpenAI embeddings 的向量插入 FAISS 数据库。

embedding_model = OpenAIEmbeddings(chunk_size=10)
OCR_Content = TextLoader('OCR.txt').load()
text_splitter = CharacterTextSplitter(chunk_overlap=100)
content = text_splitter.split_documents(OCR_Content)
faiss_db = FAISS.from_documents(content, embedding_model)
retriever = faiss_db.as_retriever(search_type="similarity", search_kwargs={"k": 4})

我们将温度设置为 0,并使用 gpt-4 部署。此外,我们建立了提示模板。在提示中,我明确说明了:

llm = AzureChatOpenAI(
    temperature=0,
    deployment_name="gpt-4",
)


prompt_template = """


Task: Analyze the JSON receipt data provided and group "value" entries with similar "geometry" proximity under "words," then summarize this information into one concise sentence.
    
JSON Data:
{context}
    
User questions: 
{question}
       
Respond to the user in JSON format and include the key-value pairs:


"""
QA_PROMPT = PromptTemplate(
    template=prompt_template, input_variables=['context', 'question']
)

分析提供的JSON收据数据,并在“words”下将具有相似“geometry”接近性的“value”条目分组,然后将此信息总结为一句简明扼要的句子。

qa_chain = RetrievalQA.from_chain_type(
    llm=llm, 
    retriever=retriever, 
    chain_type_kwargs={"prompt": QA_PROMPT},
    verbose=True
)


question = """


Please extract the following details:
Amount, 
Receipt/Invoice number, 
Date & Time,
Line Items


"""


result = qa_chain({"query": question})
print(result["result"])

我们将使用RetrievalQA并使用特定的问题来提取金额、收据号码、日期和时间以及行项目等信息。


以下是输出:

96c6eaca01aa7682efb586abf5d6af87.jpeg

它成功准确提取了金额、收据号码和收据日期与时间。需要额外的微调来改进行项目的输出。

·  END  ·

HAPPY LIFE

4632cf188b8e3986a8fbd3ddff8b35ef.png

本文仅供学习交流使用,如有侵权请联系作者删除

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值