软工第二次作业，论文查重

最新推荐文章于 2024-09-13 21:40:38 发布

J_Ziiro

最新推荐文章于 2024-09-13 21:40:38 发布

阅读量119

点赞数

文章标签： java

本文链接：https://blog.csdn.net/HJackzong/article/details/129375395

版权

1.Gitcode 作业地址

3121005261 GitCode

2.PSP表格

PSP2.1	Personal Software Process Stages	预估耗时（分钟）	实际耗时（分钟）
Planning	计划	30	60
Estimate	估计这个任务需要多少时间	120	120
Development	开发	90	90
Analysis	需求分析 (包括学习新技术)	90	150
Design Spec	生成设计文档	60	70
Coding Standard	设计复审	40	45
Design	代码规范 (为目前的开发制定合适的规范)	20	20
Design Review	具体设计	50	50
Coding	具体编码	250	330
Code Review	代码复审	40	60
Test	测试（自我测试，修改代码，提交修改）	40	40
Reporting	报告	120	120
Test Repor	测试报告	60	60
Size Measurement	计算工作量	30	30
Postmortem & Process Improvement Plan	事后总结, 并提出过程改进计划	30	60
Total	合计	1250	1440

3.模块接口的设计与实现过程

3.1.共三个类，六个方法

类	方法
main (主函数调试)	splitString（正则表达式+map集合分割字符串）
file (文件处理)	isChinese（pattern＋正则表达式判断是否中文）
mysimhash(SimHash模块工具类)	hash(据map集合里的字符串依据迭代，逐条算出每条hash值)
	getHamDistance （两simhash字符值海明距离计算，越小越相似）
	main 方法

3.2.类，方法之间的关系

1.file类调用文件读写工具对路径文件进行读取。
2.对读取的文件类型进行判断，非txt文件将抛出异常。
3.mysimhash类分别对两个文本进行分割，逐条算hash，加权合并；
4.main最后调用mysimhash类里的getHamDistance方法算出两字符串的海明距离

3.3关键算法Simhash的流程图与代码实现

Simhash算法简介与流程图：

simhash算法实现

在这里插入图片描述

4.模块接口部分的性能改进

4.1.性能分析图（由JProfiler11生成）

4.2.程序消耗最大的函数

	//simhash值的计算
    public BigInteger simhash(){
    	int[] v = new int[this.hashbits];
    	//分割内容放入map，依据value与权重放入map
    	Map<String, Integer> hash_Data=splitString();
    	//创建迭代器，放入分割好的字符串
    	Iterator<String> iter = hash_Data.keySet().iterator();
    	//逐条迭代
    	while(iter.hasNext()){
    		String word=(String) iter.next();
    		
    		//计算hash值，并且按照64位的每一位进行计算，结果保存在v内
    		//计算每条字符串的hash值
    		BigInteger t = this.hash(word);
    		for(int j=0;j<this.hashbits;j++){
    			BigInteger bitmask = new BigInteger("1").shiftLeft(j);
    			int weight=20;
    			if (t.and(bitmask).signum() != 0) {
    				v[j]+=weight;
//                    v[j] += hash_Data.get(word);
                } 
    			else {
    				v[j]-=weight;
//                    v[j] -= hash_Data.get(word);
                }
    		}
    	}
    	
    	//将每条字符串的hash值整合移位计算，降维，返回biginteger simhash
    	BigInteger fingerprint = new BigInteger("0");
    	StringBuffer simHashBuffer = new StringBuffer();
    	for (int i = 0; i < this.hashbits; i++) {
            if (v[i] <= 0) {
                fingerprint = fingerprint.add(new BigInteger("1").shiftLeft(i));
                simHashBuffer.append("0");
            }else{
                simHashBuffer.append("1");
            }
        }
    	this.strSimHash = simHashBuffer.toString();
        return fingerprint;
    }

4.3.性能优化

splitStrring 采用双字符串数组，先以buffered stream无规则导入一个字符串数组，再采用正则表达式分割导入另一个字符串数组，暂存可能较大，要不在缓存类做些文章？
正则表达式中，直接对中文字符的所有编码识别了一遍，并且把每个汉字看作一条字符串存入字符串数组，速度拉慢且多占空间。

5.测试模块

5.1.测试类（main）

5.1.2

5.2.结果

注：海明值越小则越相似

5.4.代码覆盖率

mysimhash类：
file类：
main类：

6.异常处理说明

设计目标: 防止地址为空文件
对应场景: 当读取到地址为空时，Catch到异常，并抛出
代码块：

		file t1 = null;
		try {
			t1 = new file("D:\\\\test\\\\check_txt\\\\orig_0.8_del.txt");
		} catch (IOException e) {
			// TODO Auto-generated catch block
			System.out.println("文件为空!!!!!");
			e.printStackTrace();
		}