涉及到的主要的包:PDFTextStream-2.2.1.jar(用了里面的FileputStream对象获得pdf的stream和RegionOutputTarget对象得到pdf里面某一区域的内容)
inputFilePath是文件路径+文件名
File file = new File(this.inputFilePath);
InputStream is = new FileInputStream(file);
stream = new PDFTextStream(is, this.inputFilePath);
StringBuffer sb = new StringBuffer();
int w = 680;
int h = 1600;
for (int i = 0; i < this.stream.getPageCnt(); i++) {
try {
if (i > 0) {
w = 580;
}
RegionOutputTarget tgt1 = new RegionOutputTarget();
tgt1.addRegion(1, 1, w, h, "all");
Page p1 = this.stream.getPage(i);
p1.pipe(tgt1);
sb.append(tgt1.getRegionText("all"));
用allRows=allTxt.split("\n");把内容的每一行变成数组的一个元素,
需要定位某个字符串key的行位置用
allRows[m].toUpperCase().contains(key.toUpperCase());
需要定位某个字符串key的列位置用
allRows[row].toUpperCase().indexOf(key.tuUpperCase());
得到某个位置的面积
getAreaValue(int startRow, int endRow, int beginPos,int endPos, String allTxt)
getAreaValue的处理思路:
用for分别获取开始行到结束行
for (int i = startRow; i <= endRow && i < allRows.length; ++i)
对每一行都截取开始列到结束列的字符串
allRows[i].substring(beginPos, tag_end);
遇到问题:
1、怎么去除中文字符
public static String pureAscii(String strTem) {
StringBuffer sb = new StringBuffer();
for (int i = 0; i < strTem.length(); i++) {
String str = strTem.substring(i, i + 1);
boolean ascii = true;
for (int j = 0; j < str.length(); j++) {
char ch = str.charAt(j);
if (ch >= 127 || ch < 0)
ascii = false;
}
if (ascii)
sb.append(strTem.substring(i, i + 1));
}
strTem = sb.toString();
return strTem;
}