作为一名炼丹师,日常工作中免不了和大量的图像数据打交道。为此我们经常需要写一些python脚本来对这些图像进行自动化处理。python语言以语法简洁著称,但是其效率却一般。如果在工作中需要处理上万张甚至更多图片,往往花费大量时间在数据预处理上。采用多进程,可以充分发挥多核CPU的特性,大大提升程序的处理效率(Python多线程并不能发挥真正的多核CPU性能,故不推荐使用)。
以彩色图转灰度图像为例,将一个文件夹下的所有图像转为灰度图写入另一个文件夹下:
import multiprocessing
import cv2
import time
from pathlib import Path
def bgr2gray(img_file):
img=cv2.imread(str(img_file))
img_gray = cv2.cvtColor(img,cv2.COLOR_BGR2GRAY)
cv2.imwrite(str(img_file).replace("val2017","gray"),img_gray)
if __name__ == '__main__':
files = Path(r"../datasets/val2017").glob("*")
img_files = []
for file in files:
img_files.append(file)
#单进程
start = time.time()
for img_file in img_files:
bgr2gray(img_file)
end = time.time()
print(end - start, 's')
#多进程
start = time.time()
p = multiprocessing.Pool(8) #起8个进程
p.map(bgr2gray, img_files)
p.close()
p.join()
end = time.time()
print(end - start, 's')
运行结果:
前三次运行时待处理的文件夹有128张图像,可以看到多进程处理速度还不如单线程快,可能是进程的创建需要一定时间;后三次运行时待处理的文件夹有5000张图像,可以看到多进程处理速度大大快于单线程,但是处理速度并不和起的进程数成正比(处理速度应该是低于进程数乘以单进程处理速度,并受机器的性能限制,随着进程数的增加而逐渐达到饱和)。
再来看一个复杂些的例子:将一个文件夹下的图片每次取出4张做mosaic,再写入另一个文件夹并命名为0.jpg、1.jpg、2.jpg…的格式,最后两个文件夹下的图片数相等。和上个例子不同的是,每个进程需要按一定的命名格式保存图片,因此涉及到多个进程的协作。下面的程序给出使用multiprocessing库中的Process和Pool两种方法。注意看使用Pool方法时需要传入多个参数,因此和上面的例子有所区别。
import numpy as np
import cv2
from pathlib import Path
from multiprocessing import Process
from multiprocessing import Pool
from functools import partial
s = 512
PROCESS_NUM = 8
def load_image(f, img_size):
im = cv2.imread(f) # BGR
h0, w0 = im.shape[:2] # orig hw
r = img_size / max(h0, w0) # ratio
if r != 1: # if sizes are not equal
interp = cv2.INTER_LINEAR if r > 1 else cv2.INTER_AREA
im = cv2.resize(im, (int(w0 * r), int(h0 * r)), interpolation=interp)
return im, (h0, w0), im.shape[:2] # im, hw_original, hw_resized
def mosaic4(input_imgs, output_img_dir, n):
for i in range(len(input_imgs)//PROCESS_NUM):
choice = np.random.choice(len(input_imgs), 4)
yc, xc = (int(np.random.uniform(-x, 2 * s + x)) for x in [-s // 2, -s // 2]) # mosaic center x, y
img4 = np.full((s * 2, s * 2, 3), 114, dtype=np.uint8) # base image with 4 tiles
img, _, (h, w) = load_image(str(input_imgs[choice[0]]), s) # top left
x1a, y1a, x2a, y2a = max(xc - w, 0), max(yc - h, 0), xc, yc # xmin, ymin, xmax, ymax (large image)
x1b, y1b, x2b, y2b = w - (x2a - x1a), h - (y2a - y1a), w, h # xmin, ymin, xmax, ymax (small image)
img4[y1a:y2a, x1a:x2a] = img[y1b:y2b, x1b:x2b]
img, _, (h, w) = load_image(str(input_imgs[choice[1]]), s) # top right
x1a, y1a, x2a, y2a = xc, max(yc - h, 0), min(xc + w, s * 2), yc
x1b, y1b, x2b, y2b = 0, h - (y2a - y1a), min(w, x2a - x1a), h
img4[y1a:y2a, x1a:x2a] = img[y1b:y2b, x1b:x2b]
img, _, (h, w) = load_image(str(input_imgs[choice[2]]), s) # bottom left
x1a, y1a, x2a, y2a = max(xc - w, 0), yc, xc, min(s * 2, yc + h)
x1b, y1b, x2b, y2b = w - (x2a - x1a), 0, w, min(y2a - y1a, h)
img4[y1a:y2a, x1a:x2a] = img[y1b:y2b, x1b:x2b]
img, _, (h, w) = load_image(str(input_imgs[choice[3]]), s) # bottom right
x1a, y1a, x2a, y2a = xc, yc, min(xc + w, s * 2), min(s * 2, yc + h)
x1b, y1b, x2b, y2b = 0, 0, min(w, x2a - x1a), min(y2a - y1a, h)
img4[y1a:y2a, x1a:x2a] = img[y1b:y2b, x1b:x2b]
cv2.imwrite(output_img_dir+str(i+n)+".jpg", img4)
def mosaic_4(output_img_dir, img_list):
yc, xc = (int(np.random.uniform(-x, 2 * s + x)) for x in [-s // 2, -s // 2]) # mosaic center x, y
img4 = np.full((s * 2, s * 2, 3), 114, dtype=np.uint8) # base image with 4 tiles
img, _, (h, w) = load_image(img_list[0], s) # top left
x1a, y1a, x2a, y2a = max(xc - w, 0), max(yc - h, 0), xc, yc # xmin, ymin, xmax, ymax (large image)
x1b, y1b, x2b, y2b = w - (x2a - x1a), h - (y2a - y1a), w, h # xmin, ymin, xmax, ymax (small image)
img4[y1a:y2a, x1a:x2a] = img[y1b:y2b, x1b:x2b]
img, _, (h, w) = load_image(img_list[1], s) # top right
x1a, y1a, x2a, y2a = xc, max(yc - h, 0), min(xc + w, s * 2), yc
x1b, y1b, x2b, y2b = 0, h - (y2a - y1a), min(w, x2a - x1a), h
img4[y1a:y2a, x1a:x2a] = img[y1b:y2b, x1b:x2b]
img, _, (h, w) = load_image(img_list[2], s) # bottom left
x1a, y1a, x2a, y2a = max(xc - w, 0), yc, xc, min(s * 2, yc + h)
x1b, y1b, x2b, y2b = w - (x2a - x1a), 0, w, min(y2a - y1a, h)
img4[y1a:y2a, x1a:x2a] = img[y1b:y2b, x1b:x2b]
img, _, (h, w) = load_image(img_list[3], s) # bottom right
x1a, y1a, x2a, y2a = xc, yc, min(xc + w, s * 2), min(s * 2, yc + h)
x1b, y1b, x2b, y2b = 0, 0, min(w, x2a - x1a), min(y2a - y1a, h)
img4[y1a:y2a, x1a:x2a] = img[y1b:y2b, x1b:x2b]
cv2.imwrite(output_img_dir+img_list[4]+".jpg", img4)
if __name__ == '__main__':
input_img_dir = r"../datasets/val2017/"
output_img_dir = r"../datasets/mosaic/"
input_imgs = []
for img in Path(input_img_dir).glob('*'):
input_imgs.append(img)
N = len(input_imgs)
'''方法一'''
pool = []
for i in range(PROCESS_NUM):
pool.append(Process(target=mosaic4, args=(input_imgs, output_img_dir, i*N//PROCESS_NUM)))
for i in range(PROCESS_NUM):
pool[i].start()
for i in range(PROCESS_NUM):
pool[i].join()
'''方法二'''
img_lists = [] #存放每次mosaic要取的4张图片名称的列表
for i in range(N):
choice = np.random.choice(N, 4) #每次循环随机生成4个图片索引号
img_list = []
for j in range(len(choice)):
img_list.append(str(input_imgs[choice[j]])) #往列表中追加4张图片的名称
img_list.append(str(i)) #往列表中追加序号(用于保存结果编号)
img_lists.append(img_list)
p = Pool(PROCESS_NUM)
func = partial(mosaic_4, output_img_dir) #partial函数,用来解决map只能传函数名称和数据列表两个参数的问题
p.map(func, img_lists)
p.close()
p.join()
通过计时函数测量,单个进程时处理5000张图片需要134.6s,方法一起8个进程耗时47.2s,方法二起8个进程耗时25.3s,看来使用multiprocessing的Pool方法效率最高。在调试时发现,使用multiprocessing的Process方法时,不同进程开始和结束的时间有明显差异;而使用multiprocessing的Pool方法时,不同进程开始和结束的时间差异很小。推测进程池中有些比较高效的实现,使得进程调度的开销比自己在循环中创建要低。