kyc客户_使用AI OCR计算机视觉工具对kyc进行文档验证

最新推荐文章于 2022-03-21 19:32:48 发布

weixin_26731327

最新推荐文章于 2022-03-21 19:32:48 发布

阅读量1.8k

点赞数 1

文章标签： python java 人工智能 linux js ViewUI

原文链接：https://medium.com/@thavani.shiva3/document-verification-for-kyc-with-ai-ocr-computer-vision-tool-3485d85d75f6

版权

kyc客户

AI-OCR is a tool created using Deep Learning and Computer Vision. This tool is useful in the process of Document Verification & KYC for Financial Institutions.

AI-OCR是使用深度学习和计算机视觉创建的工具。该工具在金融机构的文件验证和KYC过程中很有用。

First of all, let’s begin with — What is KYC?Know Your Customer” or KYC is an important term used by businesses and refers to the process of verification of the identity of the customers and clients either before or during the start of doing business with them. Banks, digital payment companies, or any kind of financial institutions are now required by the RBI norms to have their customers KYC process completed before allowing them complete access to all services.KYC Process requires document verification and validation by an executive from the company’s side but during this COVID time it has become a difficult task as people are not willing to step off their houses, so here comes the need for E-KYC backed by AI & Computer Vision.

首先，让我们开始-什么是KYC？ “了解您的客户”或KYC是企业使用的重要术语，是指在与他们开展业务之前或过程中验证客户和客户身份的过程。 RBI规范现在要求银行，数字支付公司或任何类型的金融机构必须先完成其客户的KYC流程，然后才能完全访问所有服务。KYC流程需要公司方面的主管人员进行文件验证和确认，但在这段COVID期间，由于人们不愿离开自己的房子，这已成为一项艰巨的任务，因此，需要AI和Computer Vision支持的E-KYC。

To make the process more efficient and robust, the power of Artificial Intelligence can be leveraged and I will walk through this technique in detail in this article.

为了使该过程更加有效和健壮，可以利用人工智能的力量，我将在本文中详细介绍该技术。

How can AI help in KYC?The AI-OCR tool automatically captures, extracts, and creates an editable and searchable copy of the customer data for an efficient KYC completion.AI-OCR tool is very apt for the Document Verification process as it performs operation on extracted text using OCR and performs verification and Validation check.

人工智能如何在KYC中提供帮助？ AI-OCR工具自动捕获，提取并创建可编辑和可搜索的客户数据副本，以实现高效的KYC完成.AI-OCR工具非常适合文档验证过程，因为它使用OCR对提取的文本执行操作并执行验证和确认检查。

In this tutorial, I will take the example Aadhaar card for document verification as Aadhaar is accepted everywhere in the country.

在本教程中，我将以Aadhaar卡为例来进行文档验证，因为Aadhaar在全国各地都被接受。

Aadhaar卡的独特功能-： (Unique Features of Aadhaar card -:)

Every document has some unique features which make it different from othersAadhaar has these unique features -:1. Emblem2. GOVERNMENT OF INDIA [GOI] Symbol3. QR Code4. Design of AadhaarSo, it’s now our choice on which features we want to train our Machine Learning model. I chose Emblem, GOI Symbol, Aadhaar Card, as earlier versions of Aadhaar had a barcode instead of a QR code, so I decided to omit it out because I wanted my model to be robust and efficient at the same time.

每个文档都有一些独特的功能使其与众不同Aadhaar具有这些独特的功能-：1。标志2。印度政府[GOI]符号3。 QR Code 4。 Aadhaar的设计因此，现在我们要选择我们要训练机器学习模型的功能。我选择了Emblem，GOI Symbol，Aadhaar卡，因为早期版本的Aadhaar带有条形码而不是QR码，因此我决定将其省略，因为我希望我的模型同时具有鲁棒性和效率。

These features make the Aadhaar card distinguishable from other documents and help in validating whether the submitted document is an Aadhaar or not.

这些功能使Aadhaar卡有别于其他文档，并有助于验证提交的文档是否为Aadhaar。

方法 (METHODOLOGY)

步骤1：对输入图像进行预处理(STEP 1: Pre-Processing of Input Images)

To verify a document first it needs to be processed as it may contain noise and that can affect the validation process, so in this case image will be processed first and noise will be eliminated using Gaussian Noise Filter or any other suitable noise removal filter can be used, it all depends on the quality of input data.Now when noise is removed from the image, the Region of interest (in this case Aadhaar card) will be extracted out and this can be performed using the Canny Edge detection algorithm. We extract the region of interest to remove any irrelevant data present in the input image.

为了首先验证文档，需要对其进行处理，因为它可能包含噪声并可能影响验证过程，因此在这种情况下，将首先处理图像，并使用高斯噪声过滤器或其他合适的噪声去除过滤器将噪声消除。现在，当从图像中去除噪声时，将提取感兴趣的区域(在本例中为Aadhaar卡)，并且可以使用Canny Edge检测算法来执行此操作。我们提取感兴趣区域以删除输入图像中存在的任何不相关数据。

步骤2：提取感兴趣区域 (STEP 2: Extracting Region of Interest)

After extracting the Region of Interest, feature extraction, and object recognition technology will be applied, and features specific to the Aadhaar card will be detected and marked.For this, we need to train our custom Object detection model and I have chosen TensorFlow Object Detection API as I like the model zoo of TensorFlow Object Detection :-D

提取感兴趣区域后，将应用特征提取和对象识别技术，并将检测并标记Aadhaar卡特有的特征。 为此，我们需要训练我们的自定义对象检测模型，并且我选择了TensorFlow对象检测API，因为我喜欢TensorFlow对象检测的模型动物园:-D

步骤3：训练对象检测模型 (STEP 3: Training a Object Detection Model)

Here I am listing steps briefly for how to train a custom object detection model -:Step I: Collect a comprehensive dataset concerning the Aadhaar card, make sure that there are enough images for each Aadhaar card in the dataset. While creating a dataset one should always take care of Quality as

在这里，我简要列出了如何训练自定义对象检测模型的步骤-：步骤I：收集有关Aadhaar卡的综合数据集，并确保数据集中每个Aadhaar卡都有足够的图像。在创建数据集时，应始终注意质量，

QUALITY > QUANTITY.

质量>数量。

Step II: After creating a dataset, we need to label it, so for this purpose, I used labellerr as it is an amazing tool and fastens up the whole process of data labeling. We labeled the images in three different categories namely “emblem”, “goisymbol” and “aadhaarcard”.

第II步：创建数据集后，我们需要对其进行标记，为此，我使用了labellerr，因为它是一个了不起的工具，可以加快数据标记的整个过程。我们将图像标记为三个不同的类别，即“会徽”，“ goisymbol”和“ aadhaarcard”。

Step III: Now, all prerequisites are ready we will train our model, this step requires a proper environment setup and a lot of computational power also it is the most time-consuming step in the whole process. We will use the TensorFlow Object Detection model for core processing and configurations of “Faster R-CNN ResNet101” this model requires a bit more computational power as compared to others but the accuracy of this model compensates for the extra computational power. We divide the data-set into two parts one for training the model and the other for evaluating the model on a test data-set. This step took almost 34 hours of processing time and we will keep the loss while training the model of order 10^ (–2) so that model can predict features of the Aadhaar card more accurately as lower the loss more accurate model is trained. While processing we should keep an eye on the model training parameters and statistics at the tensor board and monitor the loss in the training graph closely. This graph tells the loss rate of the model while training and this is a service provided by TensorFlow to visualize statistics for an accurate and robust model. After this, we get our trained machine learning model in from of the frozen inference graph (.pb)which we will use to detect and recognize features on the input Aadhaar card image.

第三步：现在，所有先决条件已经准备就绪，我们将训练模型，此步骤需要适当的环境设置和大量的计算能力，这也是整个过程中最耗时的步骤。我们将使用TensorFlow对象检测模型进行“ Fast R-CNN ResNet101”的核心处理和配置，与其他模型相比，该模型需要更多的计算能力，但是该模型的准确性弥补了额外的计算能力。我们将数据集分为两部分，一部分用于训练模型，另一部分用于在测试数据集上评估模型。此步骤花费了将近34个小时的处理时间，我们将在训练10 ^(–2)阶数的模型时保持损失，以便随着训练损失越小越精确，模型可以更准确地预测Aadhaar卡的特征。在处理过程中，我们应注意张量板上的模型训练参数和统计信息，并密切监视训练图中的损失。该图说明了训练时模型的丢失率，这是TensorFlow提供的一项服务，用于可视化统计数据，从而获得准确而可靠的模型。之后，我们从冻结的推理图(.pb)中获得训练有素的机器学习模型，该模型将用于检测和识别输入的Aadhaar卡图像上的特征。

Now, our machine has learned the features of the Aadhaar card and it is ready for evaluation so let’s go ahead and see how it performs :3

现在，我们的机器已经了解了Aadhaar卡的功能，可以进行评估了，让我们继续看一下它的性能：3

步骤4：文件验证 (STEP 4: Verification of Document)

The trained model will be used to verify whether the input document is a valid Aadhaar or not if it is, we will proceed to the next step, or else the document will be declared invalid and the process will end.

训练有素的模型将用于验证输入文档是否为有效的Aadhaar(如果有效)，我们将继续进行下一步，否则该文档将被宣布为无效并且流程将结束。

步骤5：使用OCR提取数据 (STEP 5: Extracting Data using OCR)

After it is verified that the submitted document is an Aadhaar then information present on Aadhaar will be extracted by the means of Optical Character Recognition (OCR). This information will contain

在确认提交的文档是Aadhaar之后，将通过光学字符识别(OCR)提取有关Aadhaar的信息。此信息将包含

Name — XXXX DOB: XX-XX-XXXX Gender — XXXX Aadhaar Number — 0000 1111 2222

名称-XXXX DOB：XX-XX-XXXX性别-XXXX Aadhaar编号-0000 1111 2222

步骤6：验证使用OCR提取的数据 (STEP 6: Validation of Data extracted using OCR)

This information will be processed and validated by using an existing database. Name, DOB, Gender will be verified by comparing it with customer records available in the database with the concerned financial institutions.

此信息将通过使用现有数据库进行处理和验证。姓名，DOB，性别将通过与相关金融机构数据库中可用的客户记录进行比较来进行验证。

步骤7：验证Aadhaar号码 (STEP 7: Validation of Aadhaar number)

Aadhaar Number will be verified using Verhoeff Algorithm and then cross-checking with the UIDAI database. If all these conditions are met then only the document will be verified as a Valid Aadhaar Card

Aadhaar编号将使用Verhoeff算法进行验证，然后与UIDAI数据库进行交叉核对。如果满足所有这些条件，则只有该文件将被验证为有效的Aadhaar卡

Aadhaar Number fact — The actual UIDAI-Aadhaar number is 11 digits and not 12 digits. Well, do not be surprised. The first 11 digits of the 12-digit Aadhaar number displayed on your Aadhaar card are the actual UID Number and the 12th digit is the checksum associated with the Verhoeff Algorithm scheme.

Aadhaar号码事实-实际的UIDAI-Aadhaar号码是11位数字，而不是12位数字。好吧，不要惊讶。 Aadhaar卡上显示的12位Aadhaar号码的前11位是实际的UID号，而12位是与Verhoeff算法方案关联的校验和。

Verhoeff Algorithm — The Verhoeff algorithm is a checksum formula for error detection developed by the Dutch mathematician Jacobus Verhoeff and was first published in 1969 (Source — Wikipedia)

Verhoeff算法— Verhoeff算法是荷兰数学家Jacobus Verhoeff开发的用于错误检测的校验和公式，于1969年首次发布(来源— Wikipedia )

从所有验证层获得的分数汇总 (Compilation of Scores obtained from all Verification Layers)

There are three verification layers in the system, one more layer can be added if you have a legal contract with UIDAI. The three layers which are supported by this AI-OCR tool are -:I. Document template check using Computer VisionII. Cross verification of Name along with Date of Birth, Gender with the database of financial institution.III. Validation of 12-digit Aadhaar number which is based on the Verhoeff Algorithm.

系统中包含三个验证层，如果您与UIDAI有合法合同，则可以再添加一层。此AI-OCR工具支持的三层是-：I。使用Computer VisionII检查文档模板。将姓名与出生日期，性别和金融机构数据库进行交叉验证。验证基于Verhoeff算法的12位Aadhaar号码。

Here is the Process Flow of Complete Algorithm

这是完整算法的处理流程

Now, our AI-OCR and Computer Vision tool is ready. Let’s see how it will perform in different scenarios or technical term its different test cases

现在，我们的AI-OCR和计算机视觉工具已经准备就绪。让我们看看它在不同的场景或技术术语下如何表现不同的测试用例

测试用例(TEST CASES)

Scenario — I: User submits Pan card or any other document instead of Aadhaar.In this scenario, if the user submits documents except for the Aadhaar card, it will be detected by the trained machine learning model and the system will notify the user to submit the valid image of the Aadhaar again. This scenario is possible when the user submits documents in hurry and mistakenly submits the wrong document.

方案— I：用户提交Pan Card或任何其他文档，而不是Aadhaar。 在这种情况下，如果用户提交了Aadhaar卡以外的文档，则经过训练的机器学习模型将检测到该文档，并且系统将通知用户再次提交Aadhaar的有效图像。当用户匆忙提交文档并且错误地提交了错误的文档时，这种情况是可能的。

Scenario — II: The user submits an image in the format of Aadhaar card but not an actual Aadhaar card.In this scenario when someone tries to dupe the software by submitting an image in the template of Aadhaar card but not an actual Aadhaar card, this type of mischief is handled by software and labeled as an invalid Aadhaar card.

场景— II：用户提交的图像格式为Aadhaar卡，但不是实际的Aadhaar卡。 在这种情况下，当有人尝试通过在Aadhaar卡的模板中提交图像而不是实际的Aadhaar卡来伪造软件时，这种类型的恶作剧将由软件处理，并标记为无效的Aadhaar卡。

Scenario — III: User mistakenly submits the Aadhaar card of someone else instead of his/her own.In this scenario document is a valid Aadhaar card but it doesn’t belong to the user. This is checked by the software based on the information extracted from the Aadhaar card. This scenario is possible in case of user mismatch or human error.

场景-III：用户错误地提交了他人的Aadhaar卡，而不是自己的。 在这种情况下，文档是有效的Aadhaar卡，但它不属于用户。该软件根据从Aadhaar卡提取的信息进行检查。在用户不匹配或人为错误的情况下，这种情况是可能的。

Scenario — IV: Aadhaar submitted by the user belongs to him but it is not genuine.This scenario will arise when the Aadhaar number is not according to UIDAI’s Verhoeff algorithm and it is randomly created 12-digit number, as actual Aadhaar is of 11 digits and the last digit is a checksum to validate the Aadhaar number. This adds to the verification process.

场景— IV：用户提交的Aadhaar属于他，但不是真实的。 当Aadhaar编号不符合UIDAI的Verhoeff算法并且它是随机创建的12位数字时，就会出现这种情况，因为实际的Aadhaar是11位数字，最后一位是用于验证Aadhaar号码的校验和。这增加了验证过程。

Scenario — V: Aadhaar submitted by the user is genuine and it belongs to him.This scenario is an acceptable scenario and all details submitted by the user are genuine including Aadhaar number and name.

场景— V：用户提交的Aadhaar是真实的，并且属于他。 此方案是可以接受的方案，用户提交的所有详细信息都是真实的，包括Aadhaar编号和姓名。

All these scenarios are handled in the above-discussed methodology and process-flow if any of these situations arise then the system will reject the submitted document and mark it as Invalid.Scenario V is the ideal scenario but our system should be robust enough to handle all kinds of intentional or unintentional errors.

如果出现以上任何一种情况，所有上述情况都将在上述方法和流程中进行处理，然后系统将拒绝提交的文档并将其标记为无效。 方案V是理想的方案，但是我们的系统应该足够健壮，可以处理各种有意或无意的错误。

工具的结果 (SAMPLE OUTCOMES OF THE TOOL)

OCR工具的示例结果(SAMPLE OUTCOME OF THE OCR TOOL)

All these input images were freely available on google and none of them have been leaked or taken from any classified database

所有这些输入图像都可以在Google上免费获得，并且都没有泄漏或从任何分类数据库中获取。

最终结果 (Final Outcome)

This AI-OCR tool is useful for all the financial institutions as KYC has been mandated by the Reserve Bank of India (RBI) and especially in the post-COVID world where all efforts are being made to reduce Human to Human interaction, so this tool will resolve both the issues and help financial institutions to ease up the whole process efficiently.

此AI-OCR工具对所有金融机构都非常有用，因为印度储备银行(RBI)已授权KYC，尤其是在正在尽一切努力减少人与人之间互动的COVID后世界中，因此此工具将解决这两个问题，并帮助金融机构有效地简化整个过程。

使用的工具和技术 (Tools & Technologies Used)

Python — Most suitable programming language for carrying out all AI tasks.Google Cloud OCR — To extract the text from the Aadhaar card & validate it.Tensorflow — To train our ML model on Aadhaar features. Labellerr — To annotate the images for training a ModelOpenCV— To pre-process the images and make their format suitable to proceed onto training step. Docker — To containerize the whole application and deploy it on cloud platforms.

Python —最适合执行所有AI任务的编程语言。 Google Cloud OCR —从Aadhaar卡中提取文本并进行验证。 Tensorflow —在Aadhaar功能上训练我们的ML模型。 Labellerr —注释图像以训练OpenCV模型—预处理图像并使它们的格式适合进行训练步骤。 Docker —容器化整个应用程序并将其部署在云平台上。

关于我(About Me)

I am a passionate programmer who is willing to explore chores out of his comfort zone; from developing challenging large-scale software to small weekend hackathons.For my daily routine am pursuing Computer Science Engineering from Thapar Institue of information and technology TIET.Connect with me on LinkedIn

我是一个热情的程序员，愿意探索自己舒适区域以外的琐事。从开发具有挑战性的大型软件到小型周末黑客马拉松。我的日常工作是从Thapar Institue of Information and Technology TIET学习计算机科学工程。在LinkedIn上与我联系