需求说明, 需要处理一个pdf文件, 将ocr识别出来的文字, 添加到word中, 将图片作为word背景图片, 也就实现了pdf转word功能.
import java.awt.AlphaComposite;
import java.awt.Graphics2D;
import java.awt.image.BufferedImage;
import java.io.File;
import java.io.IOException;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import javax.imageio.ImageIO;
public class WaterPic {
public static void main(String[] args){
// main方法里添加一些数据, 用于标记文本的位置, 也是图片中需要扣掉部分的坐标
Map<String , Integer> map0 = new HashMap<String, Integer>();
map0.put("width",869);
map0.put("height", 254);
map0.put("horizontal", 77);
map0.put("vertical", 424);
Map<String , Integer> map1 = new HashMap<String, Integer>();
map1.put("width",786);
map1.put("height", 100);
map1.put("horizontal", 159);
map1.put("vertical", 703);
Map<String , Integer> map2 = new HashMap<String, Integer>();
map2.put("width",686);
map2.put("height", 149);
map2.put("horizontal", 260);
map2.put("vertical", 826);
Map<String , Integer> map3 = new HashMap<String, Integer>();
map3.put("width",797);
map3.put("height", 129);
map3.put("horizontal", 148);
map3.put("vertical", 998);
Map<String , Integer> map4 = new HashMap<String, Integer>();
map4.put("width",870);
map4.put("height", 99);
map4.put("horizontal", 73);
map4.put("vertical", 1128);
List<Map<String,Integer>> list = new ArrayList<Map<String, Integer>>();
list.add(map0);
list.add(map1);
list.add(map2);
list.add(map3);
list.add(map4);
// 循环遍历
WaterPic w= new WaterPic();
for( Map<String, Integer> m:list){
w.watermark("C:/Users/Administrator/Desktop/fileSource/elang.png", m.get("horizontal"), m.get("vertical"), m.get("width"), m.get("height"), 1f);
}
}
/**
*
* @Title: 构造图片
* @Description: 生成水印并返回java.awt.image.BufferedImage
* @param file
* 源文件(图片)
* @param waterFile
* 水印文件(图片)
* @param x
* 距离左上角的X偏移量
* @param y
* 距离左上角的Y偏移量
* @param alpha
* 透明度, 选择值从0.0~1.0: 完全透明~完全不透明
* @return BufferedImage
* @throws IOException
*/
public void watermark(String sourceFilePath, int x, int y, int width, int height, float alpha) {
File file = new File(sourceFilePath);
// block.png是一个白色图片, 空白的
File waterFile = new File("C:/Users/Administrator/Desktop/fileSource/block.png");
try {
// 获取底图
BufferedImage buffImg = ImageIO.read(file);
// 获取叠加层图
BufferedImage waterImg = ImageIO.read(waterFile);
// 创建Graphics2D对象,用在底图对象上绘图
Graphics2D g2d = buffImg.createGraphics();
// 在图形和图像中实现混合和透明效果
g2d.setComposite(AlphaComposite.getInstance(AlphaComposite.SRC_ATOP, alpha));
// 绘制
g2d.drawImage(waterImg, x, y, width, height, null);
g2d.dispose();// 释放图形上下文使用的系统资源
// 保存图片
int temp = sourceFilePath.lastIndexOf(".") + 1;
ImageIO.write(buffImg, sourceFilePath.substring(temp), new File(sourceFilePath));
} catch (IOException e1) {
e1.printStackTrace();
}
}
}
抠图之前是这样的
扣完之后大伙再看
然后需要做的就是将这个图片作为word的背景, 然后在word中操作, 将每段文字作为文本框放置到word中
首先获取到图片转ocr之后的hocr文件, 将格式改为html文件, 直接改后缀名就可以
html文件 长这样
<div class='ocr_page' id='page_1' title='image "/data/translate/mupdf/fileSource/elang.png"; bbox 0 0 1002 1417; ppageno 0'>
<div class='ocr_carea' id='block_1_1' title="bbox 72 0 956 406">
<p class='ocr_par' id='par_1_1' lang='rus' title="bbox 72 0 956 406">
<span class='ocr_line' id='line_1_1' title="bbox 72 0 956 406; baseline 0 1011; x_size 169.33334; x_descenders 42.333336; x_ascenders 42.333332"><span class='ocrx_word' id='word_1_1' title='bbox 72 0 956 406; x_wconf 95'> </span>
</span>
</p>
</div>
<div class='ocr_carea' id='block_1_2' title="bbox 77 424 946 678">
<p class='ocr_par' id='par_1_2' lang='rus' title="bbox 77 424 946 678">
<span class='ocr_line' id='line_1_2' title="bbox 77 424 946 448; baseline 0 -7; x_size 21; x_descenders 3; x_ascenders 5">
<span class='ocrx_word' id='word_1_2' title='bbox 77 425 246 445; x_wconf 96'>Дисциплины:</span>
<span class='ocrx_word' id='word_1_3' title='bbox 257 430 295 446; x_wconf 95'>два</span>
<span class='ocrx_word' id='word_1_4' title='bbox 309 430 450 448; x_wconf 95'>иностранных</span>
<span class='ocrx_word' id='word_1_5' title='bbox 461 430 522 444; x_wconf 96'>языка</span>
<span class='ocrx_word' id='word_1_6' title='bbox 534 424 667 446; x_wconf 95'>(английский,</span>
<span class='ocrx_word' id='word_1_7' title='bbox 680 424 789 445; x_wconf 94'>немецкий,</span>
<span class='ocrx_word' id='word_1_8' title='bbox 802 425 946 448; x_wconf 96'>французский,</span>
</span>
<span class='ocr_line' id='line_1_3' title="bbox 79 450 946 473; baseline 0.003 -6; x_size 21; x_descenders 4; x_ascenders 5">
<span class='ocrx_word' id='word_1_9' title='bbox 79 451 202 470; x_wconf 95'>испанский,</span>
<span class='ocrx_word' id='word_1_10' title='bbox 212 451 306 472; x_wconf 95'>датский,</span>
<span class='ocrx_word' id='word_1_11' title='bbox 317 452 449 473; x_wconf 95'>норвежский,</span>
<span class='ocrx_word' id='word_1_12' title='bbox 459 451 568 473; x_wconf 94'>шведский,</span>
<span class='ocrx_word' id='word_1_13' title='bbox 578 450 692 469; x_wconf 95'>китайский,</span>
<span class='ocrx_word' id='word_1_14' title='bbox 701 450 806 472; x_wconf 95'>турецкий,</span>
<span class='ocrx_word' id='word_1_15' title='bbox 815 456 878 468; x_wconf 96'>языки</span>
<span class='ocrx_word' id='word_1_16' title='bbox 888 456 946 473; x_wconf 96'>стран</span>
</span>
<span class='ocr_line' id='line_1_4' title="bbox 78 477 946 500; baseline 0.001 -7; x_size 23; x_descenders 6; x_ascenders 4">
<span class='ocrx_word' id='word_1_17' title='bbox 78 481 250 494; x_wconf 96'>постсоветского</span>
<span class='ocrx_word' id='word_1_18' title='bbox 259 477 373 500; x_wconf 96'>зарубежья</span>
<span class='ocrx_word' id='word_1_19' title='bbox 382 483 393 495; x_wconf 92'>и</span>
<span class='ocrx_word' id='word_1_20' title='bbox 401 479 442 500; x_wconf 92'>др.),</span>
<span class='ocrx_word' id='word_1_21' title='bbox 452 479 539 500; x_wconf 96'>История</span>
<span class='ocrx_word' id='word_1_22' title='bbox 548 480 723 498; x_wconf 96'>международных</span>
<span class='ocrx_word' id='word_1_23' title='bbox 731 477 857 497; x_wconf 93'>отношений,</span>
<span class='ocrx_word' id='word_1_24' title='bbox 866 478 946 494; x_wconf 91'>Консти-</span>
</span>
<span class='ocr_line' id='line_1_5' title="bbox 77 504 946 526; baseline 0.002 -8; x_size 23.95467; x_descenders 5.6516395; x_ascenders 5.5"><span class='ocrx_word' id='word_1_25' title='bbox 77 506 200 524; x_wconf 92'>туционное</span> <span class='ocrx_word' id='word_1_26' title='bbox 213 507 279 524; x_wconf 96'>право</span> <span class='ocrx_word' id='word_1_27' title='bbox 292 504 424 525; x_wconf 96'>зарубежных</span> <span class='ocrx_word' id='word_1_28' title='bbox 436 509 500 526; x_wconf 91'>стран,</span> <span class='ocrx_word' id='word_1_29' title='bbox 513 504 607 525; x_wconf 96'>Мировая</span> <span class='ocrx_word' id='word_1_30' title='bbox 619 506 733 519; x_wconf 96'>экономика</span> <span class='ocrx_word' id='word_1_31' title='bbox 746 506 757 518; x_wconf 96'>и</span> <span class='ocrx_word' id='word_1_32' title='bbox 770 507 946 525; x_wconf 96'>международные</span>
</span>
<span class='ocr_line' id='line_1_6' title="bbox 77 528 946 551; baseline 0.001 -7; x_size 23.95467; x_descenders 5.6516395; x_ascenders 5.5"><span class='ocrx_word' id='word_1_33' title='bbox 77 532 251 544; x_wconf 94'>экономические</span> <span class='ocrx_word' id='word_1_34' title='bbox 264 533 391 548; x_wconf 92'>отношения,</span> <span class='ocrx_word' id='word_1_35' title='bbox 403 531 568 546; x_wconf 96'>Экономическая</span> <span class='ocrx_word' id='word_1_36' title='bbox 580 532 714 548; x_wconf 94'>дипломатия,