Python网络数据采集11：图像识别与文字处理

最新推荐文章于 2024-08-03 15:35:50 发布

CopperDong

最新推荐文章于 2024-08-03 15:35:50 发布

阅读量1.2k

点赞数 1

分类专栏：爬虫

本文链接：https://blog.csdn.net/QFire/article/details/78934670

版权

爬虫专栏收录该内容

20 篇文章 4 订阅

订阅专栏

用一些Python库来识别和使用在线图片中的文字。

将图像翻译成文字一般被称为光学文字识别（Optical Character Recognition, OCR）。

11.1 OCR库概述

虽然有很多库可以进行图像处理，这里只重点介绍两个库：Pillow和Tesseract。

Pillow是从Python 2.x版本的Python图像库（PIL）分出来的，支持Python 3.x版本，图像处理库

Tesseract是一个OCR库，目前有Googel赞助，目前公认最优秀、最精确的开源OCR系统。除了极高的精确度，也具有很高的灵活性。它可以通过训练识别出任何字体（只要风格不变），也可以识别出Unicode字符。

sudo apt-get install tesseract-ocr

export TESSDATA_PREFIX=/usr/local/share/ #训练的数据文件存储在哪里

11.2 处理格式规范的文字

格式规范的文字具有以下特点：

使用一个标准字体（不包含手写体）
虽然被复印或拍照,字体还是很清晰,没有多余的痕迹或污点
排列整齐,没有歪歪斜斜的字
没有超出图片范围,也没有残缺不全,或紧紧贴在图片的边缘

$tesseract text.tif textoutput | cat textoutput.txt

可以先用Python脚本对图片进行清理。利用Pillow库，创建一个阀值过滤器来去掉渐变的背景色，只把文字留下来，从而让图片更加清晰，便于Tesseract读取：

# -*- coding: utf-8 -*-
from PIL import Image
import subprocess

def cleanFile(filePath, newFilePath):
	image = Image.open(filePath)
	#对图片进行阀值过滤，然后保存
	image = image.point(lambda x: 0 if x<125 else 255)
	image.save(newFilePath)
	#调用系统的tesseract命令对图片进行OCR识别
	subprocess.call(["tesseract", newFilePath, "output"])
	# 打开文件读取结果
	outputFile = open("output.txt", 'r')
	print(outputFile.read())
	outputFile.close()

cleanFile("text_2.jpg", "text_2_clean.png")

从网站图片中抓取文字

亚马逊上http://www.amazon.com/War-Peace-Leo-Nikolayevich-Tolstoy/dp/1427030200托尔斯泰的《战争与和平》的大字号印刷版

下载图片，识别图片，最后打印每个图片的文字

# -*- coding: utf-8 -*-
import time
from urllib.request import urlretrieve
import subprocess
from selenium import webdriver

# 创建新的Selenium driver
#driver = webdriver.PhantomJS(executable_path='phantomjs-2.1.1-linux-x86_64/bin/phantomjs')
# 有时我发现PhantomJS查找元素有问题，但是Firefox没有
# 如果你运行程序的时候出现问题，去掉下面这行注释
# 用Selenium试试Firefox
driver = webdriver.Firefox(executable_path="/usr/lib/python3.4/geckodriver")

driver.get("http://www.amazon.com/War-Peace-Leo-Nikolayevich-Tolstoy/dp/1427030200")
time.sleep(2)

print("finished 1")
# 单击图书预览按钮
driver.find_element_by_id("sitbLogoImg").click()
imageList = set()

# 等待页面加载完成
time.sleep(5)
# 当向右箭头可以点击时，开始翻页
while "pointer" in driver.find_element_by_id("sitbReaderRightPageTurner").get_attribute("style"):
	driver.find_element_by_id("sitbReaderRightPageTurner").click()
	time.sleep(2)
	# 获取已加载的新页面（一次可以加载多个页面，但是重复的页面不能加载到集合中）
	pages = driver.find_element_by_xpath("//div[@class='pageImage']/div/img")
	for page in pages:
		image = page.get_attribute("src")
		imageList.add(image)
driver.quit()

# 用Tesseract处理我们收集的图片URL链接
for image in sorted(imageList):
	urlretrieve(image, "page.jpg")
	p = subprocess.Popen(["tesseract", "page.jpg", "page"], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
	p.wait()
	f = open("page.txt", "r")
	print(f.read())

11.3 读取验证码与训练Tesseract

验证码，CAPTCHA图灵测试

流行的 PHP 内容管理系统 Drupal 有一个著名的验证码模块(https://www.drupal.org/project/captcha),可以生成不同难度的验证码

训练 Tesseract
在线工具 Tesseract OCR Chopper(http://pp19dd.com/tesseract-ocr-chopper/)

我写了一个 Python 版的解决方案(https://github.com/REMitchell/tesseract-trainer)来处理同

我推荐你仔细阅读 Tesseract的文档(https://github.com/tesseract-ocr/tesseract/wiki)

11.4 获取验证码提交答案

常用的处理方法就是,首先把验证码图片下载到硬盘里,清理干净,然后用 Tesseract 处理图片,最后返回符合网站要求的识别结果。

http://pythonscraping.com/humans-only

from urllib.request import urlretrieve
from urllib.request import urlopen
from bs4 import BeautifulSoup
import subprocess
import requests
from PIL import Image
from PIL import ImageOps

def cleanImage(imagePath):
    image = Image.open(imagePath)
    image = image.point(lambda x: 0 if x<143 else 255)
    borderImage = ImageOps.expand(image,border=20,fill='white')
    borderImage.save(imagePath)

html = urlopen("http://www.pythonscraping.com/humans-only")
bsObj = BeautifulSoup(html, "html.parser")
#Gather prepopulated form values
imageLocation = bsObj.find("img", {"title": "Image CAPTCHA"})["src"]
formBuildId = bsObj.find("input", {"name":"form_build_id"})["value"]
captchaSid = bsObj.find("input", {"name":"captcha_sid"})["value"]
captchaToken = bsObj.find("input", {"name":"captcha_token"})["value"]

captchaUrl = "http://pythonscraping.com"+imageLocation
urlretrieve(captchaUrl, "captcha.jpg")
cleanImage("captcha.jpg")
p = subprocess.Popen(["tesseract", "captcha.jpg", "captcha"], stdout=
    subprocess.PIPE,stderr=subprocess.PIPE)
p.wait()
f = open("captcha.txt", "r")

#Clean any whitespace characters
captchaResponse = f.read().replace(" ", "").replace("\n", "")
print("Captcha solution attempt: "+captchaResponse)

if len(captchaResponse) == 5:
    params = {"captcha_token":captchaToken, "captcha_sid":captchaSid,   
              "form_id":"comment_node_page_form", "form_build_id": formBuildId, 
                  "captcha_response":captchaResponse, "name":"Ryan Mitchell", 
                  "subject": "I come to seek the Grail", 
                  "comment_body[und][0][value]": 
                                           "...and I am definitely not a bot"}
    r = requests.post("http://www.pythonscraping.com/comment/reply/10", 
                          data=params)
    responseObj = BeautifulSoup(r.text)
    if responseObj.find("div", {"class":"messages"}) is not None:
        print(responseObj.find("div", {"class":"messages"}).get_text())
else:
    print("There was a problem reading the CAPTCHA correctly!")