在本教程中,我将介绍如何利用OCR从收据中捕获数据,然后利用大型语言模型(LLM)提取相关详细信息,如总金额、收据的日期和时间,以及其他相关信息。
为了从收据中检索信息,我将使用Azure的OpenAI功能。
构建OCR输出数据
让我们开始在您的计算机上安装docTR和所需的库。我将不详细介绍安装过程,感兴趣可以在下面的Git存储库中找到详细的说明:https://github.com/mindee/doctr?source=post_page-----7aa733d5e335--------------------------------#installation让我们通过在Jpeg中使用提供的收据图像执行以下代码来测试安装是否成功而没有错误。
import os
import json
# Let's pick the desired backend
# os.environ['USE_TF'] = '1'
os.environ['USE_TORCH'] = '1'
import matplotlib.pyplot as plt
from doctr.io import DocumentFile
from doctr.models import ocr_predictor
# Read the file
doc = DocumentFile.from_images("receipt.jpg")
print(f"Number of pages: {len(doc)}")
如果没有错误,您将得到以下输出:
Number of pages: 1
让我们继续实例化一个预训练模型。
# Instantiate a pretrained model
predictor = ocr_predictor(pretrained=True)
以JSON格式导出输出.
result = predictor(doc)
# JSON export
json_export = result.export()
print(json_export)
您将得到以下输出:
{'pages': [{'page_idx': 0, 'dimensions': (600, 600), 'orientation': {'value': None, 'confidence': None}, 'language': {'value': None, 'confidence': None}, 'blocks': [{'geometry': ((0.2734375, 0.0), (0.6875, 0.1162109375)), 'lines': [{'geometry': ((0.33984375, 0.0), (0.6171875, 0.0234375)), 'words': [{'value': '#01-901', 'confidence': 0.9932250380516052, 'geometry': ((0.33984375, 0.001953125), (0.416015625, 0.0234375))}, {'value': 'SINGAPORE', 'confidence': 0.9812156558036804, 'geometry': ((0.4208984375, 0.001953125), (0.54296875, 0.01953125))}, {'value': '380011', 'confidence': 0.562835156917572, 'geometry': ((0.5458984375, 0.0), (0.6171875, 0.017578125))}]}, {'geometry': ((0.2734375, 0.017578125), (0.6875, 0.05078125)), 'words': [{'value': 'GST', 'confidence': 0.9999666213989258, 'geometry': ((0.2734375, 0.02734375), (0.3212890625, 0.0498046875))}, {'value': 'Reg:', 'confidence': 0.9997168183326721, 'geometry': ((0.322265625, 0.02734375), (0.3671875, 0.05078125))}, {'value': 'M2-0065333-5', 'confidence': 0.6861922740936279, 'geometry': ((0.3720703125, 0.0234375), (0.5087890625, 0.0439453125))}, {'value': 'UEN:', 'confidence': 0.9687079787254333, 'geometry': ((0.5087890625, 0.0205078125), (0.5625, 0.0419921875))}, {'value': '198304925E', 'confidence': 0.9952959418296814, 'geometry': ((0.56640625, 0.017578125), (0.6875, 0.0380859375))}]}, {'geometry': ((0.3603515625, 0.0439453125), (0.6015625, 0.0693359375)), 'words': [{'value': 'Phone', 'confidence': 0.9936328530311584, 'geometry': ((0.3603515625, 0.0498046875), (0.423828125, 0.068359375))}, {'value': ':', 'confidence': 0.9998807907104492, 'geometry': ((0.423828125, 0.0478515625), (0.4404296875, 0.0693359375))}, {'value': '67472780', 'confidence': 0.9968281388282776, 'geometry': ((0.4365234375, 0.0458984375), (0.5380859375, 0.0673828125))}, {'value': 'Fax:-', 'confidence': 0.9917964935302734, 'geometry': ((0.5380859375, 0.0439453125), (0.6015625, 0.0654296875))}]}, {'geometry': ((0.3720703125, 0.0703125), (0.5888671875, 0.095703125)), 'words': [{'value': 'Manager:', 'confidence': 0.6913022398948669, 'geometry': ((0.3720703125, 0.07421875), (0.4609375, 0.095703125))}, {'value': 'SIVAKUMAR', 'confidence': 0.9983320832252502, 'geometry': ((0.4658203125, 0.0703125), (0.5888671875, 0.0908203125))}]}, {'geometry': ((0.373046875, 0.09375), (0.5869140625, 0.1162109375)), 'words': [{'value': 'Contact', 'confidence': 0.992266833782196, 'geometry': ((0.373046875, 0.0966796875), (0.4482421875, 0.115234375))}, {'value': 'No.:', 'confidence': 0.9826020002365112, 'geometry': ((0.4482421875, 0.09375), (0.4912109375, 0.1162109375))}, {'value': '88008584', 'confidence': 0.8402541875839233, 'geometry': ((0.494140625, 0.09375), (0.5869140625, 0.1123046875))}]}], 'artefacts': []}, {'geometry': ((0.3046875, 0.134765625), (0.6611328125, 0.2314453125)), 'lines': [{'geometry': ((0.3056640625, 0.134765625), (0.6572265625, 0.16015625)), 'words': [{'value': 'Terminal:', 'confidence': 0.8031894564628601, 'geometry': ((0.3056640625, 0.142578125), (0.396484375, 0.16015625))}, {'value': 'BK0003', 'confidence': 0.8097429275512695, 'geometry': ((0.404296875, 0.140625), (0.4814453125, 0.1591796875))}, {'value': '13/02/2022', 'confidence': 0.8739034533500671, 'geometry': ((0.4892578125, 0.1376953125), (0.595703125, 0.158203125))}, {'value': '19:21', 'confidence': 0.9997132420539856, 'geometry': ((0.603515625, 0.134765625), (0.6572265625, 0.15625))}]}, {'geometry': ((0.3046875, 0.1630859375), (0.6611328125, 0.1904296875)), 'words': [{'value': 'ReceiptTaxInvoice', 'confidence': 0.4457036852836609, 'geometry': ((0.3046875, 0.166015625), (0.4892578125, 0.1904296875))}, {'value': 'BKA3500490695', 'confidence': 0.504152774810791, 'geometry': ((0.49609375, 0.1630859375), (0.6611328125, 0.18359375))}]}, {'geometry': ((0.3662109375, 0.1884765625), (0.59765625, 0.208984375)), 'words': [{'value': 'Quotation', 'confidence': 0.8169445991516113, 'geometry': ((0.3662109375, 0.1904296875), (0.458984375, 0.208984375))}, {'value': 'No.', 'confidence': 0.9977673292160034, 'geometry': ((0.4609375, 0.1884765625), (0.498046875, 0.208984375))}, {'value': ':', 'confidence': 0.9996732473373413, 'geometry': ((0.4990234375, 0.189453125), (0.5126953125, 0.2080078125))}, {'value': 'S031362', 'confidence': 0.5456238985061646, 'geometry': ((0.51171875, 0.1884765625), (0.59765625, 0.20703125))}]}, {'geometry': ((0.34375, 0.208984375), (0.6220703125, 0.2314453125)), 'words': [{'value': 'Cashier:', 'confidence': 0.9858759045600891, 'geometry': ((0.34375, 0.212890625), (0.4228515625, 0.2314453125))}, {'value': 'HONG', 'confidence': 0.9993447661399841, 'geometry': ((0.43359375, 0.2099609375), (0.4990234375, 0.2314453125))}, {'value': 'THI', 'confidence': 0.9992380142211914, 'geometry': ((0.5, 0.2099609375), (0.537109375, 0.2294921875))}, {'value': 'BE', 'confidence': 0.9985008239746094, 'geometry': ((0.5390625, 0.208984375), (0.572265625, 0.2314453125))}, {'value': 'DAO', 'confidence': 0.9940517544746399, 'geometry': ((0.5732421875, 0.208984375), (0.6220703125, 0.228515625))}]}], 'artefacts': []}, {'geometry': ((0.2451171875, 0.24609375), (0.40234375, 0.26953125)), 'lines': [{'geometry': ((0.2451171875, 0.24609375), (0.40234375, 0.26953125)), 'words': [{'value': 'No', 'confidence': 0.9999253749847412, 'geometry': ((0.2451171875, 0.24609375), (0.2822265625, 0.26953125))}, {'value': 'Description', 'confidence': 0.9901004433631897, 'geometry': ((0.294921875, 0.248046875), (0.40234375, 0.26953125))}]}], 'artefacts': []}, {'geometry': ((0.564453125, 0.2421875), (0.7177734375, 0.26953125)), 'lines': [{'geometry': ((0.564453125, 0.2421875), (0.7177734375, 0.26953125)), 'words': [{'value': 'Qty', 'confidence': 0.9939969778060913, 'geometry': ((0.564453125, 0.2421875), (0.6064453125, 0.26953125))}, {'value': 'Amount', 'confidence': 0.9966546297073364, 'geometry': ((0.640625, 0.2431640625), (0.7177734375, 0.26171875))}]}], 'artefacts': []}, {'geometry': ((0.2578125, 0.2724609375), (0.5908203125, 0.298828125)), 'lines': [{'geometry': ((0.2578125, 0.2724609375), (0.5908203125, 0.298828125)), 'words': [{'value': '1.', 'confidence': 0.9985117316246033, 'geometry': ((0.2578125, 0.2744140625), (0.2919921875, 0.298828125))}, {'value': '#OTIS', 'confidence': 0.9894990921020508, 'geometry': ((0.2919921875, 0.275390625), (0.3642578125, 0.2978515625))}, {'value': 'BARISTA', 'confidence': 0.42725348472595215, 'geometry': ((0.3662109375, 0.2763671875), (0.458984375, 0.2939453125))}, {'value': 'OAT', 'confidence': 0.999354898929596, 'geometry': ((0.4609375, 0.2744140625), (0.5068359375, 0.2939453125))}, {'value': 'MILK', 'confidence': 0.9774147272109985, 'geometry': ((0.5087890625, 0.2724609375), (0.5634765625, 0.2939453125))}, {'value': '1L', 'confidence': 0.9945043325424194, 'geometry': ((0.5595703125, 0.2724609375), (0.5908203125, 0.29296875))}]}], 'artefacts': []}, {'geometry': ((0.2490234375, 0.30859375), (0.45703125, 0.40234375)), 'lines': [{'geometry': ((0.306640625, 0.30859375), (0.45703125, 0.326171875)), 'words': [{'value': '9421906089017', 'confidence': 0.9027230143547058, 'geometry': ((0.306640625, 0.30859375), (0.45703125, 0.326171875))}]}, {'geometry': ((0.3046875, 0.3330078125), (0.4208984375, 0.3544921875)), 'words': [{'value': '2', 'confidence': 0.9997554421424866, 'geometry': ((0.3046875, 0.3330078125), (0.322265625, 0.3544921875))}, {'value': 'for', 'confidence': 0.9995049238204956, 'geometry': ((0.3212890625, 0.3330078125), (0.3525390625, 0.3544921875))}, {'value': '$11.95', 'confidence': 0.9979997277259827, 'geometry': ((0.353515625, 0.333984375), (0.4208984375, 0.3525390625))}]}, {'geometry': ((0.2490234375, 0.3798828125), (0.3828125, 0.40234375)), 'words': [{'value': 'Total', 'confidence': 0.9654089212417603, 'geometry': ((0.2490234375, 0.3798828125), (0.302734375, 0.40234375))}, {'value': 'Amount', 'confidence': 0.9976258873939514, 'geometry': ((0.3056640625, 0.3818359375), (0.3828125, 0.400390625))}]}], 'artefacts': []}, {'geometry': ((0.529296875, 0.3056640625), (0.607421875, 0.32421875)), 'lines': [{'geometry': ((0.529296875, 0.3056640625), (0.607421875, 0.32421875)), 'words': [{'value': '4x6.95', 'confidence': 0.629564642906189, 'geometry': ((0.529296875, 0.3056640625), (0.607421875, 0.32421875))}]}], 'artefacts': []}, {'geometry': ((0.6513671875, 0.3017578125), (0.724609375, 0.47265625)), 'lines': [{'geometry': ((0.662109375, 0.3017578125), (0.7216796875, 0.3232421875)), 'words': [{'value': '27.80', 'confidence': 0.9991148114204407, 'geometry': ((0.662109375, 0.3017578125), (0.7216796875, 0.3232421875))}]}, {'geometry': ((0.66796875, 0.328125), (0.7216796875, 0.349609375)), 'words': [{'value': '-3.90', 'confidence': 0.9843301177024841, 'geometry': ((0.66796875, 0.328125), (0.7216796875, 0.349609375))}]}, {'geometry': ((0.6513671875, 0.375), (0.7216796875, 0.3974609375)), 'words': [{'value': '$23.90', 'confidence': 0.9994686245918274, 'geometry': ((0.6513671875, 0.375), (0.7216796875, 0.3974609375))}]}, {'geometry': ((0.65234375, 0.4111328125), (0.72265625, 0.4326171875)), 'words': [{'value': '$23.90', 'confidence': 0.9990628361701965, 'geometry': ((0.65234375, 0.4111328125), (0.72265625, 0.4326171875))}]}, {'geometry': ((0.666015625, 0.4501953125), (0.724609375, 0.47265625)), 'words': [{'value': '$0.00', 'confidence': 0.9990418553352356, 'geometry': ((0.666015625, 0.4501953125), (0.724609375, 0.47265625))}]}], 'artefacts': []}, {'geometry': ((0.248046875, 0.416015625), (0.4931640625, 0.560546875)), 'lines': [{'geometry': ((0.2509765625, 0.416015625), (0.4931640625, 0.4384765625)), 'words': [{'value': 'MASIERICOOID', 'confidence': 0.16996675729751587, 'geometry': ((0.2509765625, 0.416015625), (0.4931640625, 0.4384765625))}]}, {'geometry': ((0.2490234375, 0.455078125), (0.376953125, 0.48046875)), 'words': [{'value': 'Change', 'confidence': 0.9970219731330872, 'geometry': ((0.2490234375, 0.4560546875), (0.330078125, 0.48046875))}, {'value': 'Due', 'confidence': 0.9999706745147705, 'geometry': ((0.33203125, 0.455078125), (0.376953125, 0.4775390625))}]}, {'geometry': ((0.2490234375, 0.48828125), (0.447265625, 0.51171875)), 'words': [{'value': 'Items', 'confidence': 0.9890830516815186, 'geometry': ((0.2490234375, 0.490234375), (0.306640625, 0.51171875))}, {'value': 'Purchased', 'confidence': 0.9993000030517578, 'geometry': ((0.310546875, 0.4892578125), (0.4189453125, 0.509765625))}, {'value': ':', 'confidence': 0.9981997013092041, 'geometry': ((0.419921875, 0.490234375), (0.43359375, 0.509765625))}, {'value': '4', 'confidence': 0.9994581341743469, 'geometry': ((0.4296875, 0.48828125), (0.447265625, 0.509765625))}]}, {'geometry': ((0.248046875, 0.53125), (0.3935546875, 0.560546875)), 'words': [{'value': '#Total', 'confidence': 0.9086952209472656, 'geometry': ((0.248046875, 0.53125), (0.322265625, 0.560546875))}, {'value': 'Saving', 'confidence': 0.9651548862457275, 'geometry': ((0.3232421875, 0.5341796875), (0.3935546875, 0.5595703125))}]}], 'artefacts': []}, {'geometry': ((0.4296875, 0.5322265625), (0.4970703125, 0.5546875)), 'lines': [{'geometry': ((0.4296875, 0.5322265625), (0.4970703125, 0.5546875)), 'words': [{'value': '-', 'confidence': 0.43670952320098877, 'geometry': ((0.4296875, 0.5361328125), (0.4453125, 0.55078125))}, {'value': '$3.90', 'confidence': 0.9483895301818848, 'geometry': ((0.4365234375, 0.5322265625), (0.4970703125, 0.5546875))}]}], 'artefacts': []}, {'geometry': ((0.2509765625, 0.564453125), (0.6005859375, 0.5908203125)), 'lines': [{'geometry': ((0.2509765625, 0.564453125), (0.6005859375, 0.5908203125)), 'words': [{'value': 'GST', 'confidence': 0.9998136162757874, 'geometry': ((0.2509765625, 0.568359375), (0.30078125, 0.5908203125))}, {'value': '%', 'confidence': 0.999920129776001, 'geometry': ((0.3017578125, 0.5673828125), (0.3271484375, 0.5908203125))}, {'value': 'Exclude', 'confidence': 0.8899426460266113, 'geometry': ((0.353515625, 0.568359375), (0.43359375, 0.5869140625))}, {'value': 'GST', 'confidence': 0.9998469352722168, 'geometry': ((0.435546875, 0.5654296875), (0.4853515625, 0.587890625))}, {'value': 'GST', 'confidence': 0.998401939868927, 'geometry': ((0.5048828125, 0.564453125), (0.5546875, 0.5869140625))}, {'value': 'Amt', 'confidence': 0.850462794303894, 'geometry': ((0.5546875, 0.564453125), (0.6005859375, 0.5869140625))}]}], 'artefacts': []}, {'geometry': ((0.6533203125, 0.5654296875), (0.734375, 0.6171875)), 'lines': [{'geometry': ((0.6533203125, 0.5654296875), (0.7314453125, 0.583984375)), 'words': [{'value': 'Amount', 'confidence': 0.9848493337631226, 'geometry': ((0.6533203125, 0.5654296875), (0.7314453125, 0.583984375))}]}, {'geometry': ((0.6630859375, 0.5947265625), (0.734375, 0.6171875)), 'words': [{'value': '$23.90', 'confidence': 0.9978439807891846, 'geometry': ((0.6630859375, 0.5947265625), (0.734375, 0.6171875))}]}], 'artefacts': []}, {'geometry': ((0.2802734375, 0.599609375), (0.2978515625, 0.6220703125)), 'lines': [{'geometry': ((0.2802734375, 0.599609375), (0.2978515625, 0.6220703125)), 'words': [{'value': '7', 'confidence': 0.9998346567153931, 'geometry': ((0.2802734375, 0.599609375), (0.2978515625, 0.6220703125))}]}], 'artefacts': []}, {'geometry': ((0.4140625, 0.5986328125), (0.484375, 0.6201171875)), 'lines': [{'geometry': ((0.4140625, 0.5986328125), (0.484375, 0.6201171875)), 'words': [{'value': '$22.34', 'confidence': 0.9993184804916382, 'geometry': ((0.4140625, 0.5986328125), (0.484375, 0.6201171875))}]}], 'artefacts': []}, {'geometry': ((0.541015625, 0.5966796875), (0.6005859375, 0.6181640625)), 'lines': [{'geometry': ((0.541015625, 0.5966796875), (0.6005859375, 0.6181640625)), 'words': [{'value': '$1.56', 'confidence': 0.9944227337837219, 'geometry': ((0.541015625, 0.5966796875), (0.6005859375, 0.6181640625))}]}], 'artefacts': []}, {'geometry': ((0.4404296875, 0.666015625), (0.5400390625, 0.6875)), 'lines': [{'geometry': ((0.4404296875, 0.666015625), (0.5400390625, 0.6875)), 'words': [{'value': 'MASTER', 'confidence': 0.8670670986175537, 'geometry': ((0.4404296875, 0.666015625), (0.5400390625, 0.6875))}]}], 'artefacts': []}, {'geometry': ((0.248046875, 0.701171875), (0.6865234375, 0.74609375)), 'lines': [{'geometry': ((0.248046875, 0.701171875), (0.642578125, 0.7216796875)), 'words': [{'value': 'DatelTime:', 'confidence': 0.8654562830924988, 'geometry': ((0.248046875, 0.7041015625), (0.337890625, 0.7216796875))}, {'value': '13022022192100', 'confidence': 0.6854404211044312, 'geometry': ((0.35546875, 0.7041015625), (0.525390625, 0.71875))}, {'value': '(Contactiess)', 'confidence': 0.5816012024879456, 'geometry': ((0.5361328125, 0.701171875), (0.642578125, 0.7216796875))}]}, {'geometry': ((0.248046875, 0.7255859375), (0.6865234375, 0.74609375)), 'words': [{'value': 'Mercid', 'confidence': 0.8570956587791443, 'geometry': ((0.248046875, 0.7275390625), (0.3134765625, 0.74609375))}, {'value': '000001050644651', 'confidence': 0.7285884022712708, 'geometry': ((0.34375, 0.7265625), (0.4970703125, 0.744140625))}, {'value': 'Terminal', 'confidence': 0.9665992259979248, 'geometry': ((0.5068359375, 0.7265625), (0.5830078125, 0.744140625))}, {'value': '-', 'confidence': 0.93905109167099, 'geometry': ((0.591796875, 0.728515625), (0.6015625, 0.7421875))}, {'value': '51523260', 'confidence': 0.9988250136375427, 'geometry': ((0.6015625, 0.7255859375), (0.6865234375, 0.7431640625))}]}], 'artefacts': []}, {'geometry': ((0.2490234375, 0.75), (0.4697265625, 0.79296875)), 'lines': [{'geometry': ((0.25, 0.75), (0.4111328125, 0.7724609375)), 'words': [{'value': 'Approval', 'confidence': 0.9914907813072205, 'geometry': ((0.25, 0.7509765625), (0.326171875, 0.7724609375))}, {'value': ':', 'confidence': 0.9201642274856567, 'geometry': ((0.3330078125, 0.7509765625), (0.3466796875, 0.7705078125))}, {'value': 'R69046', 'confidence': 0.9995259046554565, 'geometry': ((0.3427734375, 0.75), (0.4111328125, 0.7685546875))}]}, {'geometry': ((0.2490234375, 0.771484375), (0.4697265625, 0.79296875)), 'words': [{'value': 'RefNo', 'confidence': 0.9922246932983398, 'geometry': ((0.2490234375, 0.771484375), (0.3125, 0.79296875))}, {'value': '000011076745', 'confidence': 0.9994035959243774, 'geometry': ((0.34375, 0.771484375), (0.4697265625, 0.7890625))}]}], 'artefacts': []}, {'geometry': ((0.5078125, 0.748046875), (0.576171875, 0.814453125)), 'lines': [{'geometry': ((0.5078125, 0.748046875), (0.5595703125, 0.7666015625)), 'words': [{'value': 'Batch', 'confidence': 0.9954745173454285, 'geometry': ((0.5078125, 0.748046875), (0.5595703125, 0.7666015625))}]}, {'geometry': ((0.5078125, 0.7685546875), (0.552734375, 0.7880859375)), 'words': [{'value': 'Card', 'confidence': 0.9997015595436096, 'geometry': ((0.5078125, 0.7685546875), (0.552734375, 0.7880859375))}]}, {'geometry': ((0.5078125, 0.7958984375), (0.576171875, 0.814453125)), 'words': [{'value': 'Amount', 'confidence': 0.9982516169548035, 'geometry': ((0.5078125, 0.7958984375), (0.576171875, 0.814453125))}]}], 'artefacts': []}, {'geometry': ((0.6015625, 0.7470703125), (0.6669921875, 0.7646484375)), 'lines': [{'geometry': ((0.6015625, 0.7470703125), (0.6669921875, 0.7646484375)), 'words': [{'value': '000435', 'confidence': 0.9871779680252075, 'geometry': ((0.6015625, 0.7470703125), (0.6669921875, 0.7646484375))}]}], 'artefacts': []}, {'geometry': ((0.65625, 0.7685546875), (0.7373046875, 0.857421875)), 'lines': [{'geometry': ((0.6728515625, 0.7685546875), (0.7138671875, 0.783203125)), 'words': [{'value': '1641', 'confidence': 0.9989182949066162, 'geometry': ((0.6728515625, 0.7685546875), (0.7138671875, 0.783203125))}]}, {'geometry': ((0.666015625, 0.7939453125), (0.732421875, 0.8125)), 'words': [{'value': '$23.90', 'confidence': 0.9973084926605225, 'geometry': ((0.666015625, 0.7939453125), (0.732421875, 0.8125))}]}, {'geometry': ((0.65625, 0.8330078125), (0.7373046875, 0.857421875)), 'words': [{'value': '$23.90', 'confidence': 0.9831066131591797, 'geometry': ((0.65625, 0.8330078125), (0.7373046875, 0.857421875))}]}], 'artefacts': []}, {'geometry': ((0.4208984375, 0.8369140625), (0.57421875, 0.9228515625)), 'lines': [{'geometry': ((0.4345703125, 0.8369140625), (0.560546875, 0.8603515625)), 'words': [{'value': 'Net', 'confidence': 0.9999843835830688, 'geometry': ((0.4345703125, 0.8369140625), (0.4765625, 0.8603515625))}, {'value': 'Amount', 'confidence': 0.9871867895126343, 'geometry': ((0.4775390625, 0.8369140625), (0.560546875, 0.8583984375))}]}, {'geometry': ((0.4208984375, 0.8984375), (0.57421875, 0.9228515625)), 'words': [{'value': 'APPROVED', 'confidence': 0.9999109506607056, 'geometry': ((0.4208984375, 0.8984375), (0.57421875, 0.9228515625))}]}], 'artefacts': []}]}]}
让我们使用matplotlib打印输出。
synthetic_pages = result.synthesize()
plt.figure(figsize=(18, 16)) # Adjust the width and height as needed
plt.imshow(synthetic_pages[0]); plt.axis('off'); plt.show()
我们需要从JSON输出中删除与块和行相关的维度、方向、语言、几何形状等不相关的信息。我的重点仅在于提取在下方框中突出显示的与单词相关的值和几何形状的数据,而不考虑置信度。
为了继续消除JSON输出中的不相关信息。
# Define a function to remove fields recursively
def remove_fields(obj, fields):
if isinstance(obj, list):
for item in obj:
remove_fields(item, fields)
elif isinstance(obj, dict):
for key in list(obj.keys()):
if key in fields:
del obj[key]
else:
remove_fields(obj[key], fields)
# Function to remove 'geometry' key from 'blocks' and 'lines'
def remove_geometry(data):
if isinstance(data, list):
for item in data:
remove_geometry(item)
elif isinstance(data, dict):
if 'geometry' in data:
del data['geometry']
for key, value in data.items():
remove_geometry(value)
# Fields to remove
fields_to_remove = ['confidence', 'page_idx', 'dimensions', 'orientation', 'language', 'artefacts']
# Remove the specified fields
remove_fields(json_export, fields_to_remove)
# Remove 'geometry' from 'blocks' and 'lines'
for page in json_export['pages']:
for block in page['blocks']:
if 'geometry' in block:
del block['geometry']
for line in block.get('lines', []):
if 'geometry' in line:
del line['geometry']
# Convert the modified data back to JSON
modified_json = json.dumps(json_export, separators=(',', ':'))
# Print the modified JSON
print(modified_json)
随后,将输出保存到名为 OCR.txt 的文件中。
#Convert the JSON data to a string
json_export_str = str(modified_json)
# Write the JSON data to a file
with open("OCR.txt", "w") as file:
file.write(json_export_str)
得到的输出如下所示:
现在,我们准备将这些信息提供给LLM。
输入到LLM
我们将通过导入LangChain库并输入Azure OpenAI API密钥来继续。
from langchain.vectorstores import FAISS
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.prompts import PromptTemplate
from langchain.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.chat_models import AzureChatOpenAI
from langchain.chains import RetrievalQA
import os
os.environ["OPENAI_API_TYPE"] = "azure"
os.environ["OPENAI_API_VERSION"] = ""
os.environ["OPENAI_API_BASE"] = ""
os.environ["OPENAI_API_KEY"] = ""
我们加载 OCR.txt 文件,分割其内容,并将其作为包含 OpenAI embeddings 的向量插入 FAISS 数据库。
embedding_model = OpenAIEmbeddings(chunk_size=10)
OCR_Content = TextLoader('OCR.txt').load()
text_splitter = CharacterTextSplitter(chunk_overlap=100)
content = text_splitter.split_documents(OCR_Content)
faiss_db = FAISS.from_documents(content, embedding_model)
retriever = faiss_db.as_retriever(search_type="similarity", search_kwargs={"k": 4})
我们将温度设置为 0,并使用 gpt-4 部署。此外,我们建立了提示模板。在提示中,我明确说明了:
llm = AzureChatOpenAI(
temperature=0,
deployment_name="gpt-4",
)
prompt_template = """
Task: Analyze the JSON receipt data provided and group "value" entries with similar "geometry" proximity under "words," then summarize this information into one concise sentence.
JSON Data:
{context}
User questions:
{question}
Respond to the user in JSON format and include the key-value pairs:
"""
QA_PROMPT = PromptTemplate(
template=prompt_template, input_variables=['context', 'question']
)
分析提供的JSON收据数据,并在“words”下将具有相似“geometry”接近性的“value”条目分组,然后将此信息总结为一句简明扼要的句子。
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
retriever=retriever,
chain_type_kwargs={"prompt": QA_PROMPT},
verbose=True
)
question = """
Please extract the following details:
Amount,
Receipt/Invoice number,
Date & Time,
Line Items
"""
result = qa_chain({"query": question})
print(result["result"])
我们将使用RetrievalQA并使用特定的问题来提取金额、收据号码、日期和时间以及行项目等信息。
以下是输出:
它成功准确提取了金额、收据号码和收据日期与时间。需要额外的微调来改进行项目的输出。
· END ·
HAPPY LIFE
本文仅供学习交流使用,如有侵权请联系作者删除