Python识别验证码的模块--- pytesser

最新推荐文章于 2024-04-28 15:16:31 发布

yedoubushishen

最新推荐文章于 2024-04-28 15:16:31 发布

阅读量6.6k

点赞数 2

pytesser识别简单的数字和英文字母还好，复杂的以及中文都无法识别的。而且该模块需要PIL库的支持。

如果要识别其他语言，需要下载相应的语言数据包放入tessdata中，然后在调用image_to_string()函数时多加一个language参数。

该博客里讲解了pytesser的安装使用、解决识别率低的问题以及通过修改源代码来识别其他语言，我改了，但是。。。。。一直报错。（不明所以脸）

pytesser的安装参考：http://www.th7.cn/Program/Python/201602/768304.shtml

接下来是转自其他博主的内容。

转自：http://blog.csdn.net/hk_jh/article/details/8961449?utm_source=tuicool&utm_medium=referral

pytesser是谷歌OCR开源项目的一个模块，在python中导入这个模块即可将图片中的文字转换成文本。

链接：https://code.google.com/p/pytesser/

pytesser 调用了 tesseract。在python中调用pytesser模块，pytesser又用tesseract识别图片中的文字。

下面是整个过程的实现步骤：

1、首先要在code.google.com下载pytesser。https://code.google.com/p/pytesser/downloads/detail?name=pytesser_v0.0.1.zip

这个是免安装的，可以放在python安装文件夹的\Lib\site-packages\ 下直接使用

pytesser里包含了tesseract.exe和英语的数据包（默认只识别英文），还有一些示例图片，所以解压缩后即可使用。

可通过以下代码测试：

[python] view plain copy 
     
 >>> from pytesser import *  
 >>> image = Image.open('fnord.tif')  # Open image object using PIL  
 >>> print image_to_string(image)     # Run tesseract.exe on image  
 fnord  
 >>> print image_file_to_string('fnord.tif')  
 fnord  

[python] view plain copy 
     
 <pre name="code" class="python">from pytesser import *   
 #im = Image.open('fnord.tif')   
 #im = Image.open('phototest.tif')   
 #im = Image.open('eurotext.tif')  
 im = Image.open('fonts_test.png')  
 text = image_to_string(im)   
 print text</pre>  
 <pre></pre>  
 <pre></pre>  
 <pre></pre>  

注：该模块需要PIL库的支持。

2、解决识别率低的问题

可以增强图片的显示效果，或者将其转换为黑白的，这样可以使其识别率提升不少：

[python] view plain copy 
     
 enhancer = ImageEnhance.Contrast(image1)  
 image2 = enhancer.enhance(4)  

可以再对image2调用 image_to_string识别

3、识别其他语言

tesseract是一个命令行下运行的程序，参数如下：

tesseract imagename outbase [-l lang] [-psm N] [configfile...]

imagename是输入的image的名字

outbase是输出的文本的名字，默认为outbase.txt

-l lang 是定义要识别的的语言，默认为英文

详见http://tesseract-ocr.googlecode.com/svn-history/r725/trunk/doc/tesseract.1.html

通过以下步骤可以识别其他语言：

（1）、下载其他语言数据包：

https://code.google.com/p/tesseract-ocr/downloads/list

将语言包放入pytesser的tessdata文件夹下

接下来修改pytesser.py的参数，下面是一个例子：

[python] view plain copy 
     
 """OCR in Python using the Tesseract engine from Google 
 http://code.google.com/p/pytesser/ 
 by Michael J.T. O'Kelly 
 V 0.0.2, 5/26/08"""  
   
 import Image  
 import subprocess  
 import os  
 import StringIO  
   
 import util  
 import errors  
   
   
 tesseract_exe_name = 'dlltest' # Name of executable to be called at command line  
 scratch_image_name = "temp.bmp" # This file must be .bmp or other Tesseract-compatible format  
 scratch_text_name_root = "temp" # Leave out the .txt extension  
 _cleanup_scratch_flag = True  # Temporary files cleaned up after OCR operation  
 _language = "" # Tesseract uses English if language is not given  
 _pagesegmode = "" # Tesseract uses fully automatic page segmentation if psm is not given (psm is available in v3.01)  
   
 _working_dir = os.getcwd()  
   
 def call_tesseract(input_filename, output_filename, language, pagesegmode):  
         """Calls external tesseract.exe on input file (restrictions on types), 
         outputting output_filename+'txt'"""  
         current_dir = os.getcwd()  
         error_stream = StringIO.StringIO()  
         try:  
                 os.chdir(_working_dir)  
                 args = [tesseract_exe_name, input_filename, output_filename]  
                 if len(language) > 0:  
                         args.append("-l")  
                         args.append(language)  
                 if len(str(pagesegmode)) > 0:  
                         args.append("-psm")  
                         args.append(str(pagesegmode))  
                 try:  
                         proc = subprocess.Popen(args)  
                 except (TypeError, AttributeError):  
                         proc = subprocess.Popen(args, shell=True)  
                 retcode = proc.wait()  
                 if retcode!=0:  
                         error_text = error_stream.getvalue()  
                         errors.check_for_errors(error_stream_text = error_text)  
         finally:  # Guarantee that we return to the original directory  
                 error_stream.close()  
                 os.chdir(current_dir)  
   
 def image_to_string(im, lang = _language, psm = _pagesegmode, cleanup = _cleanup_scratch_flag):  
         """Converts im to file, applies tesseract, and fetches resulting text. 
         If cleanup=True, delete scratch files after operation."""  
         try:  
                 util.image_to_scratch(im, scratch_image_name)  
                 call_tesseract(scratch_image_name, scratch_text_name_root, lang, psm)  
                 result = util.retrieve_result(scratch_text_name_root)  
         finally:  
                 if cleanup:  
                         util.perform_cleanup(scratch_image_name, scratch_text_name_root)  
         return result  
   
 def image_file_to_string(filename, lang = _language, psm = _pagesegmode, cleanup = _cleanup_scratch_flag, graceful_errors=True):  
         """Applies tesseract to filename; or, if image is incompatible and graceful_errors=True, 
         converts to compatible format and then applies tesseract.  Fetches resulting text. 
         If cleanup=True, delete scratch files after operation. Parameter lang specifies used language. 
         If lang is empty, English is used. Page segmentation mode parameter psm is available in Tesseract 3.01. 
         psm values are: 
         0 = Orientation and script detection (OSD) only. 
         1 = Automatic page segmentation with OSD. 
         2 = Automatic page segmentation, but no OSD, or OCR 
         3 = Fully automatic page segmentation, but no OSD. (Default) 
         4 = Assume a single column of text of variable sizes. 
         5 = Assume a single uniform block of vertically aligned text. 
         6 = Assume a single uniform block of text. 
         7 = Treat the image as a single text line. 
         8 = Treat the image as a single word. 
         9 = Treat the image as a single word in a circle. 
         10 = Treat the image as a single character."""  
         try:  
                 try:  
                         call_tesseract(filename, scratch_text_name_root, lang, psm)  
                         result = util.retrieve_result(scratch_text_name_root)  
                 except errors.Tesser_General_Exception:  
                         if graceful_errors:  
                                 im = Image.open(filename)  
                                 result = image_to_string(im, cleanup)  
                         else:  
                                 raise  
         finally:  
                 if cleanup:  
                         util.perform_cleanup(scratch_image_name, scratch_text_name_root)  
         return result  
           
   
 if __name__=='__main__':  
         im = Image.open('phototest.tif')  
         text = image_to_string(im, cleanup=False)  
         print text  
         text = image_to_string(im, psm=2, cleanup=False)  
         print text  
         try:  
                 text = image_file_to_string('fnord.tif', graceful_errors=False)  
         except errors.Tesser_General_Exception, value:  
                 print "fnord.tif is incompatible filetype.  Try graceful_errors=True"  
                 #print value  
         text = image_file_to_string('fnord.tif', graceful_errors=True, cleanup=False)  
         print "fnord.tif contents:", text  
         text = image_file_to_string('fonts_test.png', graceful_errors=True)  
         print text  
         text = image_file_to_string('fonts_test.png', lang="eng", psm=4, graceful_errors=True)  
         print text  

这个是source里面提供的，其实若只要识别其他语言只要添加一个language参数就行了，下面是我的例子：

[python] view plain copy 
     
 """OCR in Python using the Tesseract engine from Google 
 http://code.google.com/p/pytesser/ 
 by Michael J.T. O'Kelly 
 V 0.0.1, 3/10/07"""  
   
 import Image  
 import subprocess  
 import util  
 import errors  
   
 tesseract_exe_name = 'tesseract' # Name of executable to be called at command line  
 scratch_image_name = "temp.bmp" # This file must be .bmp or other Tesseract-compatible format  
 scratch_text_name_root = "temp" # Leave out the .txt extension  
 cleanup_scratch_flag = True  # Temporary files cleaned up after OCR operation  
   
 def call_tesseract(input_filename, output_filename, language):  
     """Calls external tesseract.exe on input file (restrictions on types), 
     outputting output_filename+'txt'"""  
     args = [tesseract_exe_name, input_filename, output_filename, "-l", language]  
     proc = subprocess.Popen(args)  
     retcode = proc.wait()  
     if retcode!=0:  
         errors.check_for_errors()  
   
 def image_to_string(im, cleanup = cleanup_scratch_flag, language = "eng"):  
     """Converts im to file, applies tesseract, and fetches resulting text. 
     If cleanup=True, delete scratch files after operation."""  
     try:  
         util.image_to_scratch(im, scratch_image_name)  
         call_tesseract(scratch_image_name, scratch_text_name_root,language)  
         text = util.retrieve_text(scratch_text_name_root)  
     finally:  
         if cleanup:  
             util.perform_cleanup(scratch_image_name, scratch_text_name_root)  
     return text  
   
 def image_file_to_string(filename, cleanup = cleanup_scratch_flag, graceful_errors=True, language = "eng"):  
     """Applies tesseract to filename; or, if image is incompatible and graceful_errors=True, 
     converts to compatible format and then applies tesseract.  Fetches resulting text. 
     If cleanup=True, delete scratch files after operation."""  
     try:  
         try:  
             call_tesseract(filename, scratch_text_name_root, language)  
             text = util.retrieve_text(scratch_text_name_root)  
         except errors.Tesser_General_Exception:  
             if graceful_errors:  
                 im = Image.open(filename)  
                 text = image_to_string(im, cleanup)  
             else:  
                 raise  
     finally:  
         if cleanup:  
             util.perform_cleanup(scratch_image_name, scratch_text_name_root)  
     return text  
       
   
 if __name__=='__main__':  
     im = Image.open('phototest.tif')  
     text = image_to_string(im)  
     print text  
     try:  
         text = image_file_to_string('fnord.tif', graceful_errors=False)  
     except errors.Tesser_General_Exception, value:  
         print "fnord.tif is incompatible filetype.  Try graceful_errors=True"  
         print value  
     text = image_file_to_string('fnord.tif', graceful_errors=True)  
     print "fnord.tif contents:", text  
     text = image_file_to_string('fonts_test.png', graceful_errors=True)  
     print text  

在调用image_to_string函数时，只要加上相应的language参数就可以了，如简体中文最后一个参数即为 chi_sim，繁体中文chi_tra,

也就是下载的语言包的 XXX.traineddata 文件的名字XXX，如下载的中文包是 chi_sim.traineddata，参数就是chi_sim :

[python] view plain copy 
     
 text = image_to_string(self.im, language = 'chi_sim')  

至此，图片识别就完成了。

额外附加一句：有可能中文识别出来了，但是乱码，需要相应地将text转换为你所用的中文编码方式，如：

text.decode("utf8")就可以了

yedoubushishen

关注

2
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
Python识别验证码的模块--- pytesser

pytesser识别简单的数字和英文字母还好，复杂的以及中文都无法识别的。而且该模块需要PIL库的支持。如果要识别其他语言，需要下载相应的语言数据包放入tessdata中，然后在调用image_to_string()函数时多加一个language参数。该博客里讲解了pytesser的安装使用、解决识别率低的问题以及通过修改源代码来识别其他语言，我改了，但是。。。。。一直报错。（不明所以脸）
复制链接

扫一扫