【Python3】通过调用百度OCR实现数据批量整理

最新推荐文章于 2024-05-31 11:01:21 发布

沐浴清风z

最新推荐文章于 2024-05-31 11:01:21 发布

阅读量888

点赞数

分类专栏： Python

本文链接：https://blog.csdn.net/daliangliangliangge/article/details/102306999

版权

Python 专栏收录该内容

9 篇文章 0 订阅

订阅专栏

文章目录

目的
Python接口
- - - 1.构造请求
    - 2.百度的Python库
编写代码
尾记

目的

目前的工作当中有大量的图片需要识别，如果仅仅靠人工速度太慢，而且图片格式是固定的，这时候如果通过OCR来进行识别，速度会大大提高，识别完成之后需要自己再次确认，比起人工来说大大提高了效率，了解到目前Python中有比较多的OCR类的库，这里我们选用百度的接口实现表格识别，

Python接口

注册百度账号，来到百度OCR的控制台，百度AI

来到控制台后，需要添加自己需要用到的识别类型，百度这里还是比较多的，而且每个类型对应的都有免费次数。
在这里插入图片描述
在应用列表中，对应的有自己的API Key和 Secret Key，这些是后面识别时需要用到的

有了这个，就可以进行调用了，百度调用分两种方法，一种是通过百度自己的Python库，实现方法进行调用，二是自己构造请求，解析返回的token进行后续的处理，显然第一种更为简单，为了全面性，我们把两种方法都进行实现。百度ORC接口文档

1.构造请求

#-*-coding:utf-8 -*-
import os
import  requests
import base64
if __name__=="__main__":
	#父节点的文件夹
    last_path = os.path.abspath(os.path.dirname(os.getcwd()))
    #构建待识别图片文件夹
    data_path = last_path + r"\data\kb"
    #图片名称的list
    data_list = os.listdir(data_path)
    base_url ="https://aip.baidubce.com/oauth/2.0/token"
    grant_type ="client_credentials"
    #根据实际填写
    client_id =""
    client_secret =""
    params={
            "grant_type":grant_type,
            "client_id":client_id,
            "client_secret":client_secret
    }
    rq=requests.post(base_url,data=params).json()
    refresh_token=rq["access_token"]
    result_url=r"https://aip.baidubce.com/rest/2.0/ocr/v1/accurate_basic?access_token={}".format(refresh_token)
    headers={"Content-Type":"application/x-www-form-urlencoded"}
    with open(data_path+"\\"+data_list[0],"rb") as f:
        base64_data = base64.b64encode(f.read()).decode()
        data = {"image":base64_data}
        result_rq=requests.post(result_url,headers=headers,data=data).json()
        print(result_rq)
    print(result_url)

结果

百度的接口文档已经写得很详细了，直接根据文档构造请求，完成数据的抓取，得到结果。

2.百度的Python库

为了调用接口的方便百度自己把这些功能封装成了一个库baidu-aip，执行pip install baidu-aip即可安装，库的接口文档

#-*-coding:utf-8 -*-
import pandas as pd
import os
from aip import AipOcr
import re
config = {
#	根据实际填写
    'appId': '',
    'apiKey': '',
    'secretKey': ''
}

client = AipOcr(**config)


def get_file_content(file):
    with open(file, 'rb') as fp:
        return fp.read()


def img_to_str(image_path):
    image = get_file_content(image_path)
     #调用表格识别tableRecognition方法，这可以随调用种类的不同而不同
    result = client.tableRecognition(image, {
        'result_type': 'excel',
        },
    )
    return result
if __name__ == '__main__':
    last_path = os.path.abspath(os.path.dirname(os.getcwd()))
    data_path = last_path + r"\data\kb"
    data_path1 = last_path+ r"\data\excel"
    data_list = os.listdir(data_path)
    j=1
    for i in data_list:
        result=img_to_str(data_path+"\\"+i)
        print(result)
        break

得到结果如下，这时只需要通过pandas中的read_excel读取这个result_data就可以了
API结果

编写代码

我在实际当中是调用了百度的识别表格的接口，第一次使用，没有这一频率上的限制，导致超过了免费次数，在免费之外的收费个人感觉还是挺贵的，所以现在使用接口的时候，我都会将返回的结果保存为excel文件，然后直接读取excel，好处是只调用一次，后面做数据处理的时候也很快。

#-*-coding:utf-8 -*-
import pandas as pd
import os
from aip import AipOcr
import re
'''
author:shikailiang
function:根据已有的图片调用百度接口实现识别，然后通过pandas实现数据清洗
'''
config = {
#根据实际填写
    'appId': '',
    'apiKey': '',
    'secretKey': ''
}
client = AipOcr(**config)
def get_file_content(file):
	#读取图片文件
    with open(file, 'rb') as fp:
        return fp.read()
def img_to_str(image_path):
    image = get_file_content(image_path)
    #调用表格识别tableRecognition方法，这可以随调用种类的不同而不同
    result = client.tableRecognition(image, {
        'result_type': 'excel',
        },
    )
    return result
def istime(row0):
	#数据清洗
    which_str ="识别错误"
    which_series=row0.str.contains("七坝").sum()
    which_series2=row0.str.contains("兴隆洲").sum()
    if which_series == 1 or which_series2 == 1:
          which_str=("七坝"*which_series)+("兴隆洲"*which_series2)
    return which_str
def url_to_data(url,j):
	#把excel读入pandas中，实现数据清洗
    df=pd.read_excel(url, header = None)
    df.dropna(how="all",inplace=True)
    df=df.fillna("错误值")
    df=df[(df.iloc[:,0] !="A") & (df.iloc[:,1] !="A")]
    df=df.loc[:,df.iloc[0,:] !=1]
    row0=df.iloc[0, :]
    which_str=istime(row0)
    if which_str == "识别错误":
        df1 = pd.read_excel(url, header=None,sheet_name="header")
        row0=df1.iloc[:,0]
        which_str = istime(row0)
    try:
        time_str="+".join(row0.tolist())
        #调用正则提取时间
        time_list=re.search(r".*?(\d.*?)\+.*?(\d.*?)\+",time_str)
        if time_list == None:
            time_list = re.search(r".*?(\d.*?)\+.*?(\d.*?)$", time_str)
        time=" ".join(time_list.groups())
        dx_kb=df[(df.iloc[:,0]).str.contains("电销")].iloc[0,1]
        total_kb=df[(df.iloc[:,0]).str.contains("合计")].iloc[0,1]
        total_cy = df[(df.iloc[:, 0]).str.contains("合计")].iloc[0, 2]
        fyc_mx = ((",".join(df[(df.iloc[:, 0]).str.contains("非油船|非油")].iloc[0,1:].tolist())).replace(",错误值","")).replace("错误值","")
        fyc_xc = ((",".join(df[(df.iloc[:, 0]).str.contains("新船")].iloc[0,1:].tolist())).replace(",错误值","")).replace("错误值","")
        print(which_str,time,dx_kb,total_kb,total_cy,fyc_mx,fyc_xc)
    except:
        print("第"+str(j)+"张错了")
        df.to_excel(str(j)+".xls")
if __name__ == '__main__':
    last_path = os.path.abspath(os.path.dirname(os.getcwd()))
    data_path = last_path + r"\data\kb"
    data_path1 = last_path+ r"\data\excel"
    data_list = os.listdir(data_path)
    j=1
    for i in data_list:
        result=img_to_str(data_path+"\\"+i)
        url=result["result"]["result_data"]
        url_to_data(url,j)

如果图片是规则的，最终的结果还是满意的，可会有很多意料之外的情况，所以对于OCR这块，如果能依靠正则完成数据的提取，还是直接识别成文本，尽量不用表格识别，毕竟算法这上面对于边框的处理不是那么的完美，但文本是基本都能识别出来的

尾记

OCR这一项技术很早就出现过，当时对于OCR的处理与现在大同小异，将图片本身进行二值化处理，去除异常点，也就是去噪声，剩下的将特征本身进行归类，不同的字符对应的特征不同，在Python大量普及的今天，或许我们可以在不依靠OCR的三方库上，实现自己的语料识别，这本身就是一件有意思的事，在机器学习快速发展的今天，图像降噪等大量技术的应用正为我们自己构建OCR提供了方便，或许我们应该从相对简单的数字识别做起，实现自己的OCR。

沐浴清风z

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
【Python3】通过调用百度OCR实现数据批量整理

#-*-coding:utf-8 -*-import pandas as pdimport osimport requestsfrom aip import AipOcrimport base64from urllib.parse import urlencodeif __name__=="__main__": last_path = os.path.abspath(os...
复制链接

扫一扫