读取word、pdf某些位置的值

最新推荐文章于 2024-02-29 13:19:00 发布

samir张三

最新推荐文章于 2024-02-29 13:19:00 发布

阅读量2k

点赞数 1

分类专栏：工具文章标签：工具 word pdf

本文链接：https://blog.csdn.net/u013786328/article/details/81905789

版权

针对读取word或pdf合同中特定位置值的需求，本文提出一种解决方案：复制文档，将需读取位置替换为${username}形式，然后对比原文件和修改后的文件内容，通过匹配找到值，如username=张三。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

最近有个需求是要求读word或pdf形式合同里某些位置的值，但具体读什么位置，没有明确的定位。可自定义位置。

经思考过后，采取以下方式。

1、复制一份word/pdf。

2、把复制出来的word/pdf按要求修改，只修改需要读取值的位置，按${username}形式修改。

3、把原word/pdf以及复制后修改的word/pdf 里内容通过程序全部读取出来。

4、把读取出来的内容进行比较。

5、把修改的地方进行匹配，如username=张三,并输出出来。

读取word/pdf内容，这里省略，下面只贴重点（内容比较）

原文件内容

World模板替换
的点点滴滴多多多多多多多多多多 用户：张三 多多多多多多多多多多多多多多的点点滴滴多多多多多多多多多多多多多多多多多多多多多多多多
的点点滴滴多多多多多多多多多多多多多多多多多多多多多多多多的点点滴滴多多多多多多多多多多多多多多多多多多多多多 密码：123456 多多多的点点滴滴多多多多多多多多多多多多多多多多多多多多多多多多的点点滴滴多多多多多多多多多多多多多多多多多多多多多多多多
的点点滴滴多多多公司： 三七 多多多多多多多多多多多多多多多多多多多多多的点点滴滴多多多多多多多多多多多多多多多多多多多多多多多多的点点滴滴多多多多多多多多多多多多多多多多多多多多多多多多的点点滴滴多多多多多多多多多多多多多多多多多多多多多多多    
多的点点滴滴多多多多 年龄：18 多多多多多多多多多多多多多多多多多多多多的点点滴滴多多多多多多多多多多多多多多多多多多多多多多多多的点点滴滴多多多多多多多多多多多多多多多多多多多多多多多多

修改之后文件内容

World模板替换
的点点滴滴多多多多多多多多多多 用户：${username} 多多多多多多多多多多多多多多的点点滴滴多多多多多多多多多多多多多多多多多多多多多多多多
的点点滴滴多多多多多多多多多多多多多多多多多多多多多多多多的点点滴滴多多多多多多多多多多多多多多多多多多多多多 密码：${pwd} 多多多的点点滴滴多多多多多多多多多多多多多多多多多多多多多多多多的点点滴滴多多多多多多多多多多多多多多多多多多多多多多多多
的点点滴滴多多多公司： ${company} 多多多多多多多多多多多多多多多多多多多多多的点点滴滴多多多多多多多多多多多多多多多多多多多多多多多多的点点滴滴多多多多多多多多多多多多多多多多多多多多多多多多的点点滴滴多多多多多多多多多多多多多多多多多多多多多多多    
多的点点滴滴多多多多 年龄：${age} 多多多多多多多多多多多多多多多多多多多多的点点滴滴多多多多多多多多多多多多多多多多多多多多多多多多的点点滴滴多多多多多多多多多多多多多多多多多多多多多多多多

输出内容

company------三七
pwd------123456
age------18
username------张三

代码如下：

/**
 * 
 * @author Samir
 *
 */
public class CompareUtil {
	
	private static Logger logger = LoggerFactory.getLogger(CompareUtil.class);

	/**
	 * 对比文本
	 * 列子
	 * String altbe = "World模板替换\r\n"
				+ "的点点滴滴多多多多多多多多多多 用户：${username} 多多多多多多多多多多多多多多的点点滴滴多多多多多多多多多多多多多多多多多多多多多多多多\r\n"
				+ "的点点滴滴多多多多多多多多多多多多多多多多多多多多多多多多的点点滴滴多多多多多多多多多多多多多多多多多多多多多 密码：${pwd} 多多多的点点滴滴多多多多多多多多多多多多多多多多多多多多多多多多的点点滴滴多多多多多多多多多多多多多多多多多多多多多多多多\r\n"
				+ "的点点滴滴多多多公司： ${company} 多多多多多多多多多多多多多多多多多多多多多的点点滴滴多多多多多多多多多多多多多多多多多多多多多多多多的点点滴滴多多多多多多多多多多多多多多多多多多多多多多多多的点点滴滴多多多多多多多多多多多多多多多多多多多多多多多    \r\n"
				+ "多的点点滴滴多多多多 年龄：${age} 多多多多多多多多多多多多多多多多多多多多的点点滴滴多多多多多多多多多多多多多多多多多多多多多多多多的点点滴滴多多多多多多多多多多多多多多多多多多多多多多多多\r\n"
				+ "";
		String altaf = "World模板替换\r\n" + "的点点滴滴多多多多多多多多多多 用户：张三 多多多多多多多多多多多多多多的点点滴滴多多多多多多多多多多多多多多多多多多多多多多多多\r\n"
				+ "的点点滴滴多多多多多多多多多多多多多多多多多多多多多多多多的点点滴滴多多多多多多多多多多多多多多多多多多多多多 密码：123456 多多多的点点滴滴多多多多多多多多多多多多多多多多多多多多多多多多的点点滴滴多多多多多多多多多多多多多多多多多多多多多多多多\r\n"
				+ "的点点滴滴多多多公司： 三七 多多多多多多多多多多多多多多多多多多多多多的点点滴滴多多多多多多多多多多多多多多多多多多多多多多多多的点点滴滴多多多多多多多多多多多多多多多多多多多多多多多多的点点滴滴多多多多多多多多多多多多多多多多多多多多多多多    \r\n"
				+ "多的点点滴滴多多多多 年龄：18 多多多多多多多多多多多多多多多多多多多多的点点滴滴多多多多多多多多多多多多多多多多多多多多多多多多的点点滴滴多多多多多多多多多多多多多多多多多多多多多多多多sfdsfds\r\n"
				+ "";
	 * 
	 * @param altbe
	 * @param altaf
	 * @return
	 */
	public static Map<String, String> compareText(String altbe,String altaf){
		Map<String, String> smap = new HashMap<>();
		try {
			if(StringUtil.isBlank(altbe) || StringUtil.isBlank(altaf)) {
				return smap;
			}
			Map<Integer, String> imap = new HashMap<>();
			Integer num = 0;
			List<Integer> beList = rememberSpacing(altbe);
			List<Integer> afList = rememberSpacing(altaf);
			altbe = altbe.replace(" ", "");
			altaf = altaf.replace(" ", "");
			Diff_match_patch diff_match_patch = new Diff_match_patch();
			LinkedList<Diff> t = diff_match_patch.diff_main(altbe, altaf);
			StringBuffer s1 = new StringBuffer();
			StringBuffer s2 = new StringBuffer();
			Integer indexBe = 0;
			Integer indexAf = 0;
			for (Diff diff : t) {
				StringBuffer diffTextBe = new StringBuffer(diff.text);
				StringBuffer diffTextAf = new StringBuffer(diff.text);
				if (diff.operation.toString().equalsIgnoreCase("EQUAL")) {
					addSpacing(beList, indexBe, diffTextBe);
					addSpacing(afList, indexAf, diffTextAf);
					s1.append(diffTextBe);
					s2.append(diffTextAf);
					indexBe += diffTextBe.length();
					indexAf += diffTextAf.length();
				}
				Integer[] resa = appendString2("DELETE", diff, s1, s2, beList, indexBe,smap,imap,num);
				indexBe = resa[0];
				num = resa[1];
				Integer[] resb = appendString2("INSERT", diff, s2, s1, afList, indexAf,smap,imap,num);
				indexAf = resb[0];
				num = resb[1];
			}
			System.out.println(s1.toString());
			System.out.println(s2.toString());
		} catch (Exception e) {
			logger.error("对比文本异常",e);
		}
		return smap;
	}
	
	private static List<Integer> rememberSpacing(String str) {
		List<Integer> list = new ArrayList<Integer>();
		for (int i = 0; i < str.length(); i++) {
			if (' ' == str.charAt(i)) {
				list.add(i);
			}
		}
		return list;
	}

	private static void addSpacing(List<Integer> list, Integer index, StringBuffer str) {
		for (Integer o : list) {
			if (o >= index && o < index + str.length()) {
				str.insert(o - index, ' ');
			}
		}
	}

	private static Integer[] appendString2(String type, Diff diff, StringBuffer sbOne, StringBuffer sbTwo,
			List<Integer> list, Integer i,Map<String, String> smap,Map<Integer, String> imap,Integer num) {
		Integer[] res = new Integer[2];
		Integer result = i;
		if (type.equals(diff.operation.toString())) {
			StringBuffer sb = new StringBuffer(diff.text);
			for (Integer o : list) {
				if (o >= i && o < i + sb.length()) {
					sb.insert(o - i, ' ');
				}
			}
			sbOne.append("<em class='f-required'>").append(sb).append("</em>");
			result = i + sb.length();
			num++;
			String str = getStr(sb,imap,num);
			if (num % 2 == 0 && null != imap.get(num - 1)) {
				smap.put(imap.get(num - 1), str);
			}
		}
		res[0] = result;
		res[1] = num;
		return res;
	}

	private static String getStr(StringBuffer sb,Map<Integer, String> imap,Integer num) {
		String str = sb.toString().trim();
		if (str.startsWith("${") && str.endsWith("}")) {
			str = str.substring(2, str.length() - 1);
			imap.put(num, str);
		}
		return str;
	}
	
	/*public static void main(String[] arge) {
		String altbe = "World模板替换\r\n"
				+ "的点点滴滴多多多多多多多多多多 用户：${username} 多多多多多多多多多多多多多多的点点滴滴多多多多多多多多多多多多多多多多多多多多多多多多\r\n"
				+ "的点点滴滴多多多多多多多多多多多多多多多多多多多多多多多多的点点滴滴多多多多多多多多多多多多多多多多多多多多多 密码：${pwd} 多多多的点点滴滴多多多多多多多多多多多多多多多多多多多多多多多多的点点滴滴多多多多多多多多多多多多多多多多多多多多多多多多\r\n"
				+ "的点点滴滴多多多公司： ${company} 多多多多多多多多多多多多多多多多多多多多多的点点滴滴多多多多多多多多多多多多多多多多多多多多多多多多的点点滴滴多多多多多多多多多多多多多多多多多多多多多多多多的点点滴滴多多多多多多多多多多多多多多多多多多多多多多多    \r\n"
				+ "多的点点滴滴多多多多 年龄：${age} 多多多多多多多多多多多多多多多多多多多多的点点滴滴多多多多多多多多多多多多多多多多多多多多多多多多的点点滴滴多多多多多多多多多多多多多多多多多多多多多多多多\r\n"
				+ "";
		String altaf = "World模板替换\r\n" + "的点点滴滴多多多多多多多多多多 用户：张三 多多多多多多多多多多多多多多的点点滴滴多多多多多多多多多多多多多多多多多多多多多多多多\r\n"
				+ "的点点滴滴多多多多多多多多多多多多多多多多多多多多多多多多的点点滴滴多多多多多多多多多多多多多多多多多多多多多 密码：123456 多多多的点点滴滴多多多多多多多多多多多多多多多多多多多多多多多多的点点滴滴多多多多多多多多多多多多多多多多多多多多多多多多\r\n"
				+ "的点点滴滴多多多公司： 三七 多多多多多多多多多多多多多多多多多多多多多的点点滴滴多多多多多多多多多多多多多多多多多多多多多多多多的点点滴滴多多多多多多多多多多多多多多多多多多多多多多多多的点点滴滴多多多多多多多多多多多多多多多多多多多多多多多    \r\n"
				+ "多的点点滴滴多多多多 年龄：18 多多多多多多多多多多多多多多多多多多多多的点点滴滴多多多多多多多多多多多多多多多多多多多多多多多多的点点滴滴多多多多多多多多多多多多多多多多多多多多多多多多sfdsfds\r\n"
				+ "";
		Map<String, String> map = compareText(altbe, altaf);
		Set<String> sets = map.keySet();
		for (String str : sets) {
			System.out.println(str + "------" + map.get(str));
		}
	}*/
}

/**
 * 
 * @author Samir
 *
 */
public class Diff_match_patch {

	// Defaults.
	// Set these on your diff_match_patch instance to override the defaults.

	/**
	 * Number of seconds to map a diff before giving up (0 for infinity).
	 */
	public float Diff_Timeout = 1.0f;
	/**
	 * Cost of an empty edit operation in terms of edit characters.
	 */
	public short Diff_EditCost = 4;
	/**
	 * The size beyond which the double-ended diff activates. Double-ending is twice
	 * as fast, but less accurate.
	 */
	public short Diff_DualThreshold = 32;
	/**
	 * At what point is no match declared (0.0 = perfection, 1.0 = very loose).
	 */
	public float Match_Threshold = 0.5f;
	/**
	 * How far to search for a match (0 = exact location, 1000+ = broad match). A
	 * match this many characters away from the expected location will add 1.0 to
	 * the score (0.0 is a perfect match).
	 */
	public int Match_Distance = 1000;
	/**
	 * When deleting a large block of text (over ~64 characters), how close does the
	 * contents have to match the expected contents. (0.0 = perfection, 1.0 = very
	 * loose). Note that Match_Threshold controls how closely the end points of a
	 * delete need to match.
	 */
	public float Patch_DeleteThreshold = 0.5f;
	/**
	 * Chunk size for context length.
	 */
	public short Patch_Margin = 4;
	/**
	 * The number of bits in an int.
	 */
	private int Match_MaxBits = 32;

	/**
	 * Internal class for returning results from diff_linesToChars(). Other less
	 * paranoid languages just use a three-element array.
	 */
	protected static class LinesToCharsResult {
		protected String chars1;
		protected String chars2;
		protected List<String> lineArray;

		protected LinesToCharsResult(String chars1, String chars2, List<String> lineArray) {
			this.chars1 = chars1;
			this.chars2 = chars2;
			this.lineArray = lineArray;
		}
	}

	// DIFF FUNCTIONS
	/**
	 * The data structure representing a diff is a Linked list of Diff objects:
	 * {Diff(Operation.DELETE, "Hello"), Diff(Operation.INSERT, "Goodbye"),
	 * Diff(Operation.EQUAL, " world.")} which means: delete "Hello", add "Goodbye"
	 * and keep " world."
	 */
	public enum Operation {
		DELETE, INSERT, EQUAL
	}

	/**
	 * Find the differences between two texts. Run a faster slightly less optimal
	 * diff This method allows the 'checklines' of diff_main() to be optional. Most
	 * of the time checklines is wanted, so default to true.
	 * 
	 * @param text1
	 *            Old string to be diffed.
	 * @param text2
	 *            New string to be diffed.
	 * @return Linked List of Diff objects.
	 */
	public LinkedList<Diff> diff_main(String text1, String text2) {
		return diff_main(text1, text2, true);
	}

	/**
	 * Find the differences between two texts. Simplifies the problem by stripping
	 * any common prefix or suffix off the texts before diffing.
	 * 
	 * @param text1
	 *            Old string to be diffed.
	 * @param text2
	 *            New string to be diffed.
	 * @param checklines
	 *            Speedup flag. If false, then don't run a line-level diff first to
	 *            identify the changed areas. If true, then run a faster slightly
	 *            less optimal diff
	 * @return Linked List of Diff objects.
	 */
	public LinkedList<Diff> diff_main(String text1, String text2, boolean checklines) {
		// Check for equality (speedup)
		LinkedList<Diff> diffs;
		if (text1.equals(text2)) {
			diffs = new LinkedList<Diff>();
			diffs.add(new Diff(Operation.EQUAL, text1));
			return diffs;
		}
		// Trim off common prefix (speedup)
		int commonlength = diff_commonPrefix(text1, text2);
		String commonprefix = text1.substring(0, commonlength);
		text1 = text1.substring(commonlength);
		text2 = text2.substring(commonlength);
		// Trim off common suffix (speedup)
		commonlength = diff_commonSuffix(text1, text2);
		String commonsuffix = text1.substring(text1.length() - commonlength);
		text1 = text1.substring(0, text1.length() - commonlength);
		text2 = text2.substring(0, text2.length() - commonlength);
		// Compute the diff on the middle block
		diffs = diff_compute(text1, text2, checklines);

		// Restore the prefix and suffix
		if (commonprefix.length() != 0) {
			diffs.addFirst(new Diff(Operation.EQUAL, commonprefix));
		}
		if (commonsuffix.length() != 0) {
			diffs.addLast(new Diff(Operation.EQUAL, commonsuffix));
		}
		diff_cleanupMerge(diffs);
		return diffs;
	}

	/**
	 * Find the differences between two texts. Assumes that the texts do not have
	 * any common prefix or suffix.
	 * 
	 * @param text1
	 *            Old string to be diffed.
	 * @param text2
	 *            New string to be diffed.
	 * @param checklines
	 *            Speedup flag. If false, then don't run a line-level diff first to
	 *            identify the changed areas. If true, then run a faster slightly
	 *            less optimal diff
	 * @return Linked List of Diff objects.
	 */
	protected LinkedList<Diff> diff_compute(String text1, String text2, boolean checklines) {
		LinkedList<Diff> diffs = new LinkedList<Diff>();
		if (text1.length() == 0) {
			// Just add some text (speedup)
			diffs.add(new Diff(Operation.INSERT, text2));
			return diffs;
		}
		if (text2.length() == 0) {
			// Just delete some text (speedup)
			diffs.add(new Diff(Operation.DELETE, text1));
			return diffs;
		}
		String longtext = text1.length() > text2.length() ? text1 : text2;
		String shorttext = text1.length() > text2.length() ? text2 : text1;
		int i = longtext.indexOf(shorttext);
		if (i != -1) {
			// Shorter text is inside the longer text (speedup)
			Operation op = (text1.length() > text2.length()) ? Operation.DELETE : Operation.INSERT;
			diffs.add(new Diff(op, longtext.substring(0, i)));
			diffs.add(new Diff(Operation.EQUAL, shorttext));
			diffs.add(new Diff(op, longtext.substring(i + shorttext.length())));
			return diffs;
		}
		longtext = shorttext = null; // Garbage collect.
		// Check to see if the problem can be split in two.
		String[] hm = diff_halfMatch(text1, text2);
		if (hm != null) {
			// A half-match was found, sort out the return data.
			String text1_a = hm[0];
			String text1_b = hm[1];
			String text2_a = hm[2];
			String text2_b = hm[3];
			String mid_common = hm[4];
			// Send both pairs off for separate processing.
			LinkedList<Diff> diffs_a = diff_main(text1_a, text2_a, checklines);
			LinkedList<Diff> diffs_b = diff_main(text1_b, text2_b, checklines);
			// Merge the results.
			diffs = diffs_a;
			diffs.add(new Diff(Operation.EQUAL, mid_common));
			diffs.addAll(diffs_b);
			return diffs;
		}
		// Perform a real diff.
		if (checklines && (text1.length() < 100 || text2.length() < 100)) {
			checklines = false; // Too trivial for the overhead.
		}
		List<String> linearray = null;
		if (checklines) {
			// Scan the text on a line-by-line basis first.
			LinesToCharsResult b = diff_linesToChars(text1, text2);
			text1 = b.chars1;
			text2 = b.chars2;
			linearray = b.lineArray;
		}
		diffs = diff_map(text1, text2);
		if (diffs == null) {
			// No acceptable result.
			diffs = new LinkedList<Diff>();
			diffs.add(new Diff(Operation.DELETE, text1));
			diffs.add(new Diff(Operation.INSERT, text2));
		}
		if (checklines) {
			// Convert the diff back to original text.
			diff_charsToLines(diffs, linearray);
			// Eliminate freak matches (e.g. blank lines)
			diff_cleanupSemantic(diffs);
			// Rediff any replacement blocks, this time character-by-character.
			// Add a dummy entry at the end.
			diffs.add(new Diff(Operation.EQUAL, ""));
			int count_delete = 0;
			int count_insert = 0;
			String text_delete = "";
			String text_insert = "";
			ListIterator<Diff> pointer = diffs.listIterator();
			Diff thisDiff = pointer.next();
			while (thisDiff != null) {
				switch (thisDiff.operation) {
				case INSERT:
					count_insert++;
					text_insert += thisDiff.text;
					break;
				case DELETE:
					count_delete++;
					text_delete += thisDiff.text;
					break;
				case EQUAL:
					// Upon reaching an equality, check for prior redundancies.
					if (count_delete >= 1 && count_insert >= 1) {
						// Delete the offending records and add the merged ones.
						pointer.previous();
						for (int j = 0; j < count_delete + count_insert; j++) {
							pointer.previous();
							pointer.remove();
						}
						for (Diff newDiff : diff_main(text_delete, text_insert, false)) {
							pointer.add(newDiff);
						}
					}
					count_insert = 0;
					count_delete = 0;
					text_delete = "";
					text_insert = "";
					break;
				}
				thisDiff = pointer.hasNext() ? pointer.next() : null;
			}
			diffs.removeLast(); // Remove the dummy entry at the end.
		}
		return diffs;
	}

	/**
	 * Split two texts into a list of strings. Reduce the texts to a string of
	 * hashes where each Unicode character represents one line.
	 * 
	 * @param text1
	 *            First string.
	 * @param text2
	 *            Second string.
	 * @return An object containing the encoded text1, the encoded text2 and the
	 *         List of unique strings. The zeroth element of the List of unique
	 *         strings is intentionally blank.
	 */
	protected LinesToCharsResult diff_linesToChars(String text1, String text2) {
		List<String> lineArray = new ArrayList<String>();
		Map<String, Integer> lineHash = new HashMap<String, Integer>();
		// e.g. linearray[4] == "Hello\n"
		// e.g. linehash.get("Hello\n") == 4
		// "\x00" is a valid character, but various debuggers don't like it.
		// So we'll insert a junk entry to avoid generating a null character.
		lineArray.add("");
		String chars1 = diff_linesToCharsMunge(text1, lineArray, lineHash);
		String chars2 = diff_linesToCharsMunge(text2, lineArray, lineHash);

最低0.47元/天解锁文章