之前有一个需求,有一批项目数据,需要对项目数据的标题按照一定进行清洗,清洗完之后去除重复标题的数据,得到最终结果。
已知项目数据一共有四种状态,分别为公告,预告,结果与变更。在这个需求里,公告、预告为同一规则,结果与变更为另一规则,规则关键词如下:
具体规则如下:
由此可知,我需要去除目标数据中所有规则关键词的组合词。
由于当时需求比较紧急,因此便马上根据规则写了一套很粗糙的程序。
具体思路如下:
- 由于担心在程序运行期间跑关键词组合不太方便,因此先用几个for循环依次将所有组合词跑出来。然后将这些组合词复制后赋值给一些常量数组。
- 依据规则对数据进行清洗。
- 将清洗完的数据插入result表,在插入的时候判断该表是否有相同标题的数据,若有则不插入。从而获得去重后的数据。
看起来似乎不错,但是其实第一步就没有成功,因为跑出来的关键词实在是太多了。如果如果赋值给常量数组,空间不足,项目无法启动。
因此对第一步进行改进。
- 由于无法将这些数据赋值给常量数组,所以我将这些跑出来的数据存在了表里面。程序运行的时候用一个list去接受这些表的数据就可以。
- 此时,需求方说需要保留原标题,于是启用了备用字段spare4,用来存放标题。只要在规则程序开始前,将title字段复制给spare4,针对spare4进行清洗即可。
- 同时,我发现上一个思路的最后一个步骤似乎并不需要。我只需要对清洗完的数据进行一个group by spare,progid 就可以获得最终数据。
在修改完程序之后,测试了一下效率,后来发现并不是很快,因为我将所有状态的数据的清洗都放在了一个函数里面,需要串行执行。假设我有一百万的数据,公告、预告占五十万,结果、变更占五十万。这样的话,我就需要等待前一个状态清洗完毕才可以对另一个状态进行清洗。
既然如此,为什么不多开几个线程来跑不同状态的数据呢?
于是我开了四个线程,分别跑四个状态的两类数据。
因此,初代程序如下:
/**去重流程
*1.将标题复制到spare4,并且对spare4的标题进行清洗。
* 提供参数
*2.将标题复制到另一张表,根据spare4判断是否有内容相等的标题,如果有,只复制第一个。第二个则不复制,完成去重的最后步骤。最后,从这个表中导出数据即可(暂时不做, 可自行在数据库中group by spare4)。
*/
/**
* 去重函数 progid=3
*/
@Scheduled(fixedDelay = 60*60*60*1000)
public void clean_3(){
// 该sql用于传入给cleanTheTitle。
String spare4_sql_1="select * from big_customer_data_zhengfu_2 where taskId=201907081530 and progid=3 limit ";
// 该sql用于判断何时跳出循环。
String spare4_sql_2="select title,id,contentId from big_customer_data_zhengfu_2 where taskId=201907081530 and progid=3 limit ";
// 清洗类别,如果是2、3类别的招中标信息,则type=1否则type=1;
int type=3;
// 源数据所在的表名
String table="big_customer_data_zhengfu_2";
cleanTheSpare4(spare4_sql_1,0,spare4_sql_2,type,table);
}
/**
* 去重函数 progid:=2
*/
@Scheduled(fixedDelay = 60*60*60*1000)
public void clean_2(){
//该sql用于传入给cleanTheTitle。
String spare4_sql_1="select * from big_customer_data_zhengfu_2 where taskId=201907081530 and progid=2 limit ";
// 该sql用于判断何时跳出循环。
String spare4_sql_2="select title,id,contentId from big_customer_data_zhengfu_2 where taskId=201907081530 and progid=2 limit ";
// 清洗类别,如果是2、3类别的招中标信息,则type=1否则type=1;
int type=2;
// 源数据所在的表名
String table="big_customer_data_zhengfu_2";
cleanTheSpare4(spare4_sql_1,0,spare4_sql_2,type,table);
}
/**
* 去重函数 1
*/
@Scheduled(fixedDelay = 60*60*60*1000)
public void clean_1(){
String spare4_sql_11="select * from big_customer_data_zhengfu_2 where taskId=201907081530 and progid=1 limit ";
// 该sql用于判断何时跳出循环。
String spare4_sql_22="select title,id,contentId from big_customer_data_zhengfu_2 where taskId=201907081530 and progid=1 limit ";
// 清洗类别,如果是2、3类别的招中标信息,则type=1否则type=1;
// 源数据所在的表名
String table="big_customer_data_zhengfu_2";
int type_1=1;
cleanTheSpare4(spare4_sql_11,0,spare4_sql_22,type_1,table);
}
/**
* 去重函数 0
*/
@Scheduled(fixedDelay = 60*60*60*1000)
public void clean_0(){
String spare4_sql_11="select * from big_customer_data_zhengfu_2 where taskId=201907081530 and progid=0 limit ";
// 该sql用于判断何时跳出循环。
String spare4_sql_22="select title,id,contentId from big_customer_data_zhengfu_2 where taskId=201907081530 and progid=0 limit ";
// 源数据所在的表名
String table="big_customer_data_zhengfu_2";
// 清洗类别,如果是2、3类别的招中标信息,则type=1否则type=1;
int type_1=0;
cleanTheSpare4(spare4_sql_11,0,spare4_sql_22,type_1,table);
}
/**
* 将方法抽象出来,便于直接调用。
* 需要输入表名,taskId,progid。
* 不同的progid有不同的清洗方式
* @param table
* @param taskId
* @param progid
*/
public void cleanAbstract(String table,int taskId,int progid){
String spare4Sql1="select * from" +table+" where taskId="+taskId+"and progid="+progid+" limit ";
// 该sql用于判断何时跳出循环。
String spare4Sql2="selectid from "+table+" where taskId="+taskId+"and progid="+progid+" limit ";
// 源数据所在的表名
cleanTheSpare4(spare4Sql1,0,spare4Sql2,progid,table);
}
/**
* 将标题复制到spare4,并且对spare4的标题进行清洗。
*/
public void cleanTheSpare4(String sql_1,int startnum,String sql_2,int type,String table){
int k=0;
int num=1000;
int start=startnum;
while(true){
// 循环取出1000条数据,对每一条数据遍历关键词
String sql=sql_1+start+","+num;
List<Map<String, Object>> WORD = shujuzuJdbcTemplate.queryForList(sql_2+start+","+num);
cleanTheTitle(sql,type,table);
start+=1000;
log.info("类型"+type+"执行到了:"+start);
if(WORD.size()<1000){
break;
}
}
log.info("类型"+type+"全部结束了!");
}
/**
* 对标题进行去重
* 先根据传入的sql获取到list。再根据第一规则去去重。classify代表的是哪一种分类,1表示公告、预告,2表示结果、变更。
* sql="select * from 表 order by 重复的字段,ID",
* 获取到对应的title,去数据库中查该title的数据。并获取该list。
* 最后在走一遍整体去重。
* @param sql
*/
public void cleanTheTitle(String sql,int classify,String table) {
// 获取所需要的数据
saveTheTitle(sql,table);
if(classify==1||classify==0){
ruleFirstTwoTwo(sql,table);
}
if(classify==2 ||classify==3){
ruleSecondTwoTwo(sql,table);
}
}
/**目前未执行。
* 复制,并且去重。
*/
public void cloneAndClean(String sql_1,String sql_2,String sql_3){
// 把符合条件的数据存在数据库中,并打上标签。
int k=0;
int num=1000;
int start=0;
while(true){
// 循环取出1000条数据,对每一条数据遍历,如果没有重复的title,那就插入
List<Map<String, Object>> WORD = shujuzuJdbcTemplate.queryForList(sql_1+start+","+num);
for(Map<String,Object> map:WORD){
//判断是否有该标题 若有就不弄了
if(map.get("spare4")!=null){
if(shujuzuJdbcTemplate.queryForList(sql_2,map.get("spare4")).size()==0){
// 存入
shujuzuJdbcTemplate.update(sql_3,map.get("id"));
}
}
}
start+=1000;
log.info("执行到了:"+start);
if(WORD.size()<1000){
break;
}
}
log.info("复制结束");
}
/**
*针对progid=0||progid=1的数据进行去重
* 获取要去重的列表cleanList
* 调用clean_it函数进行相应去重
* @param sql
* @param table
*/
public void ruleFirstTwoTwo(String sql,String table){
List<Map<String, Object>> cleanList = shujuzuJdbcTemplate.queryForList(sql);
clean_it(cleanList,table,"first_xbd_bdx","first_xbd_bdx");
cleanList=shujuzuJdbcTemplate.queryForList(sql);
clean_it(cleanList,table,"first_abd_abbd","first_abd_abbd");
cleanList=shujuzuJdbcTemplate.queryForList(sql);
clean_it(cleanList,table,"first_bbbd","first_bbbd");
cleanList=shujuzuJdbcTemplate.queryForList(sql);
clean_it(cleanList,table,"first_bbbed","first_bbbed");
cleanList=shujuzuJdbcTemplate.queryForList(sql);
clean_it(cleanList,table,"first_bbd_bbed","first_bbd_bbed");
cleanList=shujuzuJdbcTemplate.queryForList(sql);
clean_it(cleanList,table,"first_bd_bed","first_bd_bed");
String str="";
cleanList=shujuzuJdbcTemplate.queryForList(sql);
for(Map<String, Object> clean:cleanList){
str=clean.get("spare4").toString();
str=str.replaceAll("\\[.*?]","");
str=str.replaceAll("\\【.*?】","");
str=str.replaceAll(":","");
str=str.replaceAll(":","");
str=str.replaceAll("\\[|\\]","");
str=str.replaceAll("\\【|\\】","");
str=str.replaceAll("_","");
str=str.replaceAll("-","");
str=str.replaceAll("\\.","");
if(!clean.get("spare4").equals(str)){
shujuzuJdbcTemplate.update("update "+table+" set spare4=? where id=?",str,clean.get("id").toString());
}
}
// log.info("words_x结束了");
}
/**
* 针对变更、结果的次序二
* 1顺次:删除“cdx”“xcd”的完全匹配的词组;
* 2顺次:删除“abcd”的完全匹配的词组;
* 3顺次:删除“abbcd”的完全匹配的词组;
* 4顺次:删除“bbbcd”“bbbed”的完全匹配的词组;
* 5顺次:删除“bbcd”“bbced”的完全匹配的词组;
* 6顺次:删除“bcd”“bced”的完全匹配的词组;
* 7顺次:删除“ccd”“cced”的完全匹配的词组;
* 8顺次:删除“cc”“cd”“cec”“ced”的完全匹配的词组;
* 9顺次:删除“x”;
*/
public void saveTheTitle(String sql,String table){
// 获取数据
String str="";
List<Map<String, Object>> cleanList = shujuzuJdbcTemplate.queryForList(sql);
// 将每一个对象的title存在spare1。
for(Map<String, Object> clean:cleanList){
if(null!=clean.get("title")){
str=clean.get("title").toString();
shujuzuJdbcTemplate.update("update "+table+" set spare4=? where id=?",str,clean.get("id").toString());
}
}
}
public void clean_it( List<Map<String, Object>> cleanList,String table,String colunmName,String tableName){
String str=null;
List<Map<String, Object>> cleanWords = shujuzuJdbcTemplate.queryForList("select "+colunmName+" from "+tableName);
for(Map<String, Object> clean:cleanList){
if( null!=clean.get("spare4")){
str=clean.get("spare4").toString();
if(str!= null ||str!= ""){
for(Map<String, Object> cleanwords:cleanWords){
if (str.contains(cleanwords.get(colunmName).toString())){
str=str.replaceAll(cleanwords.get(colunmName).toString(),"");
shujuzuJdbcTemplate.update("update "+table+" set spare4=? where id=?",str,clean.get("id").toString());
}
}
}
}
}
}
/**
* 针对progid=2||progid=3的数据进行去重
* 获取要去重的列表cleanList
* 调用clean_it函数进行相应去重
* @param sql
* @param table
*/
public void ruleSecondTwoTwo(String sql,String table){
List<Map<String, Object>> cleanList = shujuzuJdbcTemplate.queryForList(sql);
clean_it(cleanList,table,"second_cdx_xcd","second_cdx_xcd");
cleanList = shujuzuJdbcTemplate.queryForList(sql);
clean_it(cleanList,table,"second_abcd","second_abcd");
cleanList = shujuzuJdbcTemplate.queryForList(sql);
clean_it(cleanList,table,"second_abbcd","second_abbcd");
cleanList = shujuzuJdbcTemplate.queryForList(sql);
clean_it(cleanList,table,"second_bbbcd","second_bbbcd");
cleanList = shujuzuJdbcTemplate.queryForList(sql);
clean_it(cleanList,table,"second_bbbed","second_bbbed");
cleanList = shujuzuJdbcTemplate.queryForList(sql);
clean_it(cleanList,table,"second_bbcd","second_bbcd");
cleanList = shujuzuJdbcTemplate.queryForList(sql);
clean_it(cleanList,table,"second_bbced","second_bbced");
cleanList = shujuzuJdbcTemplate.queryForList(sql);
clean_it(cleanList,table,"second_bcd","second_bcd");
cleanList = shujuzuJdbcTemplate.queryForList(sql);
clean_it(cleanList,table,"second_bced","second_bced");
cleanList = shujuzuJdbcTemplate.queryForList(sql);
clean_it(cleanList,table,"second_ccd","second_ccd");
cleanList = shujuzuJdbcTemplate.queryForList(sql);
clean_it(cleanList,table,"second_cc","second_cc");
cleanList = shujuzuJdbcTemplate.queryForList(sql);
clean_it(cleanList,table,"second_cd","second_cd");
cleanList = shujuzuJdbcTemplate.queryForList(sql);
clean_it(cleanList,table,"second_cec","second_cec");
cleanList = shujuzuJdbcTemplate.queryForList(sql);
clean_it(cleanList,table,"second_ced","second_ced");
List<Map<String, Object>> WORDS_X = shujuzuJdbcTemplate.queryForList("select words_x from words_x");
cleanList=shujuzuJdbcTemplate.queryForList(sql);
String str="";
for(Map<String, Object> clean:cleanList){
if(null!=clean.get("spare4")){
str=clean.get("spare4").toString();
str=str.replaceAll("\\[.*?]","");
str=str.replaceAll("\\【.*?】","");
str=str.replaceAll(":","");
str=str.replaceAll(":","");
str=str.replaceAll("\\[|\\]","");
str=str.replaceAll("\\【|\\】","");
str=str.replaceAll("_","");
str=str.replaceAll("-","");
str=str.replaceAll("\\.","");
if(!clean.get("spare4").equals(str)){
shujuzuJdbcTemplate.update("update "+table+" set spare4=? where id=?",str,clean.get("id").toString());
}
}
}
}
测试了一下,如果是在本地跑数据的话,速度很慢,大概十五分钟跑完两万的数据。
正好那时候遇上一个需求,需要对标题进行去重,数据量为八十万。按这么来算的话,就需要十个小时。
这谁顶得住呀。必须得放服务器,肯定得放服务器。
将程序放在服务器上跑了一批测试数据,大概三分钟跑完两万数据。我们组只有30%的带宽,不过速度看起来还可以。按这么来算的话,八十万数据就需要两个小时,还是等得起的。
理论上来说,这个去重的程序并不是简单地对数据库进行读写,规则较为复杂,关键词组太多,这都是程序运行缓慢的原因。但是,它实在是太慢了!必须得收拾它,肯定得收拾它!
怎么收拾呢?首先我们来看看我上面贴出来的破代码。其实上面的程序还算可以看,因为我将部分重复步骤的代码提出为了函数,否则,那将更加的又臭又长。
好吧,我承认,它现在看起来也是又臭又长。
你看,这里,每一次匹配出组合之前,我都要去数据库拿关键词,在清洗完这一批关键词之后,我都要对spare4进行更新。
public void clean_it( List<Map<String, Object>> cleanList,String table,String colunmName,String tableName){
String str=null;
List<Map<String, Object>> cleanWords = shujuzuJdbcTemplate.queryForList("select "+colunmName+" from "+tableName);
for(Map<String, Object> clean:cleanList){
if( null!=clean.get("spare4")){
str=clean.get("spare4").toString();
if(str!= null ||str!= ""){
for(Map<String, Object> cleanwords:cleanWords){
if (str.contains(cleanwords.get(colunmName).toString())){
str=str.replaceAll(cleanwords.get(colunmName).toString(),"");
shujuzuJdbcTemplate.update("update "+table+" set spare4=? where id=?",str,clean.get("id").toString());
}
}
}
}
}
}
而一个数据,需要经过七组或者九组关键词的清洗,而一批数据为1000条,100万数据,那就是1000批数据。这么下来,IO操作就太多了。
所以,从这一段代码,就可以发现两处需要改进的地方。
- 首先,每一次清洗的时候都需要获取关键词,这一点没有必要。可以在程序刚启动的时候获取到关键词,赋值给String数组,后续直接匹配数组即可。
- 其次,每一次关键词的清洗,都需要对spare4进行update操作,没有必要。完全可以用一个string 变量去接收spare4的值,对该变量进行清洗,在清洗结束的时候再进行update操作即可。那既然想到这里,该点还可以继续优化。可以直接用变量去接收title的值,对变量进行清洗,清洗完后将该值更新给spare4字段。因此,清洗启动前也无需将title赋值到spare4中了。
此外,刚才说到我将跑出来的数据全部都放在数据库里面了,存了多个表,那其它表中的数据想要直接用我这个程序进行清洗,岂不是还需要将我的表给复制到对应的数据库中去?那也太麻烦了吧。
你瞧,这么多个表,多麻烦呀。
想到这里,我突然又纳闷了,我当初为何要存这么多个表呢?
噢。是因为规则清洗是有步骤的,所以用到关键词的时候,只需要依次获取对应表中的数据即可。
可是,既然有顺序,那我将关键词在一个表中按顺序存储不就得了???何须这么多表?
一共有两类数据,对应两类规则,那我存两个表就得了。
若是觉得两个表也不合适,那直接加一个字段,存0、1,用来区分两类规则的关键词即可,这样就只需要一个表就可以了。反正检索数据的时候,按id来排序,又不会乱序,为什么不呢?
所以又有了一个改进:
- 将数据库中的关键词表存在一个表中,增加一个type字段,用以区分两种类别的关键词,这样的话,程序只需要获取一次数据,也只需要进行一次清洗,无需对多个数组进行匹配。
那么,又一个问题来了。若是我偏偏不想将关键词组存在数据库中呢?
也行,不存在数据库中,那么就放在内存中呗,从内存中取数据总比走磁盘快吧。
因此,在进行清洗之前,先将关键词组合跑出来就行。
private String[] GONG_YU_GAO_A = {"定点", "协议", "单一来源", "单一", "资格", "资格预审", "竞争", "竞争性", "公开"};
private String[] GONG_YU_GAO_B = {"招标", "采购", "磋商", "入围", "谈判", "议价", "询价", "比价", "询比价", "比选", "项目"};
private String[] GONG_YU_GAO_D = {"公告", "公示", "信息"};
private String[] GONG_YU_GAO_E = {"的"};
private String[] BIAN_RESULT_A = {"定点", "协议", "单一来源", "单一", "资格", "资格预审", "竞争", "竞争性", "公开"};
private String[] BIAN_RESULT_B = {"招标", "采购", "磋商", "入围", "谈判", "议价", "询价", "比价", "询比价", "比选", "项目"};
private String[] BIAN_RESULT_C = {"合同", "成交", "结果", "中标", "变更", "候选", "候选人", "成交人", "中标人"};
private String[] BIAN_RESULT_D = {"公告", "公示", "结果"};
private String[] BIAN_RESULT_E = {"的"};
private String[] ALL_X = {":", ":", "\\[", "\\]", "\\【", "\\】", "_", "-", "."};
private List<String> GONG_YU_GAO_bdx = make(GONG_YU_GAO_B, GONG_YU_GAO_D, ALL_X);
private List<String> GONG_YU_GAO_xbd = make(ALL_X, GONG_YU_GAO_B, GONG_YU_GAO_D);
private List<String> GONG_YU_GAO_abd = make(GONG_YU_GAO_A, GONG_YU_GAO_B, GONG_YU_GAO_D);
private List<String> GONG_YU_GAO_abbd = make(GONG_YU_GAO_A, GONG_YU_GAO_B, GONG_YU_GAO_B, GONG_YU_GAO_D);
private List<String> GONG_YU_GAO_bbbd = make(GONG_YU_GAO_B, GONG_YU_GAO_B, GONG_YU_GAO_B, GONG_YU_GAO_D);
private List<String> GONG_YU_GAO_bbbed = make(GONG_YU_GAO_B, GONG_YU_GAO_B, GONG_YU_GAO_B, GONG_YU_GAO_E, GONG_YU_GAO_D);
private List<String> GONG_YU_GAO_bbd = make(GONG_YU_GAO_B, GONG_YU_GAO_B, GONG_YU_GAO_D);
private List<String> GONG_YU_GAO_bbed = make(GONG_YU_GAO_B, GONG_YU_GAO_B, GONG_YU_GAO_E, GONG_YU_GAO_D);
private List<String> GONG_YU_GAO_bd = make(GONG_YU_GAO_B, GONG_YU_GAO_D);
private List<String> GONG_YU_GAO_bed = make(GONG_YU_GAO_B, GONG_YU_GAO_E, GONG_YU_GAO_D);
private List<String> BIAN_RESULT_cdx = make(BIAN_RESULT_C, BIAN_RESULT_D, ALL_X);
private List<String> BIAN_RESULT_xcd = make(BIAN_RESULT_C, BIAN_RESULT_D, ALL_X);
private List<String> BIAN_RESULT_abcd = make(BIAN_RESULT_A, BIAN_RESULT_B, BIAN_RESULT_C, BIAN_RESULT_D);
private List<String> BIAN_RESULT_abbcd = make(BIAN_RESULT_A, BIAN_RESULT_B, BIAN_RESULT_B, BIAN_RESULT_C, BIAN_RESULT_D);
private List<String> BIAN_RESULT_bbbcd = make(BIAN_RESULT_B, BIAN_RESULT_B, BIAN_RESULT_B, BIAN_RESULT_C, BIAN_RESULT_D);
private List<String> BIAN_RESULT_bbbed = make(BIAN_RESULT_B, BIAN_RESULT_B, BIAN_RESULT_B, BIAN_RESULT_E, BIAN_RESULT_D);
private List<String> BIAN_RESULT_bbcd = make(BIAN_RESULT_B, BIAN_RESULT_B, BIAN_RESULT_C, BIAN_RESULT_D);
private List<String> BIAN_RESULT_bbced = make(BIAN_RESULT_B, BIAN_RESULT_B, BIAN_RESULT_C, BIAN_RESULT_E, BIAN_RESULT_D);
private List<String> BIAN_RESULT_bcd = make(BIAN_RESULT_B, BIAN_RESULT_C, BIAN_RESULT_D);
private List<String> BIAN_RESULT_bced = make(BIAN_RESULT_B, BIAN_RESULT_C, BIAN_RESULT_E, BIAN_RESULT_D);
private List<String> BIAN_RESULT_ccd = make(BIAN_RESULT_C, BIAN_RESULT_C, BIAN_RESULT_D);
private List<String> BIAN_RESULT_cced = make(BIAN_RESULT_C, BIAN_RESULT_C, BIAN_RESULT_E, BIAN_RESULT_D);
private List<String> BIAN_RESULT_cc = make(BIAN_RESULT_C, BIAN_RESULT_C);
private List<String> BIAN_RESULT_cd = make(BIAN_RESULT_C, BIAN_RESULT_D);
private List<String> BIAN_RESULT_cec = make(BIAN_RESULT_C, BIAN_RESULT_E, BIAN_RESULT_C);
make函数如下:
这样就不存在一开始的无法启动项目的问题了。
改进后的标题去重完整代码如下:
public class CleanTitleService {
private String[] GONG_YU_GAO_A = {"定点", "协议", "单一来源", "单一", "资格", "资格预审", "竞争", "竞争性", "公开"};
private String[] GONG_YU_GAO_B = {"招标", "采购", "磋商", "入围", "谈判", "议价", "询价", "比价", "询比价", "比选", "项目"};
private String[] GONG_YU_GAO_D = {"公告", "公示", "信息"};
private String[] GONG_YU_GAO_E = {"的"};
private String[] BIAN_RESULT_A = {"定点", "协议", "单一来源", "单一", "资格", "资格预审", "竞争", "竞争性", "公开"};
private String[] BIAN_RESULT_B = {"招标", "采购", "磋商", "入围", "谈判", "议价", "询价", "比价", "询比价", "比选", "项目"};
private String[] BIAN_RESULT_C = {"合同", "成交", "结果", "中标", "变更", "候选", "候选人", "成交人", "中标人"};
private String[] BIAN_RESULT_D = {"公告", "公示", "结果"};
private String[] BIAN_RESULT_E = {"的"};
private String[] ALL_X = {":", ":", "\\[", "\\]", "\\【", "\\】", "_", "-", "."};
private List<String> GONG_YU_GAO_bdx = make(GONG_YU_GAO_B, GONG_YU_GAO_D, ALL_X);
private List<String> GONG_YU_GAO_xbd = make(ALL_X, GONG_YU_GAO_B, GONG_YU_GAO_D);
private List<String> GONG_YU_GAO_abd = make(GONG_YU_GAO_A, GONG_YU_GAO_B, GONG_YU_GAO_D);
private List<String> GONG_YU_GAO_abbd = make(GONG_YU_GAO_A, GONG_YU_GAO_B, GONG_YU_GAO_B, GONG_YU_GAO_D);
private List<String> GONG_YU_GAO_bbbd = make(GONG_YU_GAO_B, GONG_YU_GAO_B, GONG_YU_GAO_B, GONG_YU_GAO_D);
private List<String> GONG_YU_GAO_bbbed = make(GONG_YU_GAO_B, GONG_YU_GAO_B, GONG_YU_GAO_B, GONG_YU_GAO_E, GONG_YU_GAO_D);
private List<String> GONG_YU_GAO_bbd = make(GONG_YU_GAO_B, GONG_YU_GAO_B, GONG_YU_GAO_D);
private List<String> GONG_YU_GAO_bbed = make(GONG_YU_GAO_B, GONG_YU_GAO_B, GONG_YU_GAO_E, GONG_YU_GAO_D);
private List<String> GONG_YU_GAO_bd = make(GONG_YU_GAO_B, GONG_YU_GAO_D);
private List<String> GONG_YU_GAO_bed = make(GONG_YU_GAO_B, GONG_YU_GAO_E, GONG_YU_GAO_D);
private List<String> BIAN_RESULT_cdx = make(BIAN_RESULT_C, BIAN_RESULT_D, ALL_X);
private List<String> BIAN_RESULT_xcd = make(BIAN_RESULT_C, BIAN_RESULT_D, ALL_X);
private List<String> BIAN_RESULT_abcd = make(BIAN_RESULT_A, BIAN_RESULT_B, BIAN_RESULT_C, BIAN_RESULT_D);
private List<String> BIAN_RESULT_abbcd = make(BIAN_RESULT_A, BIAN_RESULT_B, BIAN_RESULT_B, BIAN_RESULT_C, BIAN_RESULT_D);
private List<String> BIAN_RESULT_bbbcd = make(BIAN_RESULT_B, BIAN_RESULT_B, BIAN_RESULT_B, BIAN_RESULT_C, BIAN_RESULT_D);
private List<String> BIAN_RESULT_bbbed = make(BIAN_RESULT_B, BIAN_RESULT_B, BIAN_RESULT_B, BIAN_RESULT_E, BIAN_RESULT_D);
private List<String> BIAN_RESULT_bbcd = make(BIAN_RESULT_B, BIAN_RESULT_B, BIAN_RESULT_C, BIAN_RESULT_D);
private List<String> BIAN_RESULT_bbced = make(BIAN_RESULT_B, BIAN_RESULT_B, BIAN_RESULT_C, BIAN_RESULT_E, BIAN_RESULT_D);
private List<String> BIAN_RESULT_bcd = make(BIAN_RESULT_B, BIAN_RESULT_C, BIAN_RESULT_D);
private List<String> BIAN_RESULT_bced = make(BIAN_RESULT_B, BIAN_RESULT_C, BIAN_RESULT_E, BIAN_RESULT_D);
private List<String> BIAN_RESULT_ccd = make(BIAN_RESULT_C, BIAN_RESULT_C, BIAN_RESULT_D);
private List<String> BIAN_RESULT_cced = make(BIAN_RESULT_C, BIAN_RESULT_C, BIAN_RESULT_E, BIAN_RESULT_D);
private List<String> BIAN_RESULT_cc = make(BIAN_RESULT_C, BIAN_RESULT_C);
private List<String> BIAN_RESULT_cd = make(BIAN_RESULT_C, BIAN_RESULT_D);
private List<String> BIAN_RESULT_cec = make(BIAN_RESULT_C, BIAN_RESULT_E, BIAN_RESULT_C);
@Autowired
@Qualifier("shujuzuJdbcTemplate")
private JdbcTemplate shujuzuJdbcTemplate;
// @Scheduled(fixedDelay = 60*60*60*1000)
public void goToCheck() {
log.info("去重开始");
Integer limit = 1000;
int num=0;
String taskId = "201907081530";
String table=" big_customer_data ";
while (true) {
List<Map<String, Object>> maps = shujuzuJdbcTemplate.queryForList("select id,contentId,title,progid from "+table+" where taskId = ? AND spare4 is null order by id limit ?", taskId, limit);
for (Map<String, Object> map : maps) {
shujuzuJdbcTemplate.update("update "+table+" set spare4 = ? where id = ?", gongTitle(map.get("title").toString(), map.get("progid").toString()), map.get("id"));
}
num+=1000;
log.info("执行到了:"+num);
if (maps.size() < limit) {
log.info("去重结束");
break;
}
}
}
private String gongTitle(String orgTitle, String progid) {
if (orgTitle == null || orgTitle.equals("")) {
return "";
}
Integer sum = 0;
if (progid.equals("0") || progid.equals("1")) {
sum = 7;
} else if (progid.equals("2") || progid.equals("3")) {
sum = 9;
}
boolean have = false;
for (Integer i = 1; i <= sum; i++) {
boolean x=(sum==7&&i==7)||(sum==9&&i==9);
if(x){
orgTitle=deleteX(orgTitle);
}
else{
for (String s : getKeys(i, progid)) {
if (orgTitle.contains(s)) {
orgTitle = orgTitle.replaceAll(s, "");
have = true;
break;
}
}
}
if (have) {
break;
}
}
return orgTitle;
}
private List<String> getKeys(Integer i, String progid) {
List<String> keys = new ArrayList<>();
if (progid.equals("0") | progid.equals("1")) {
switch (i) {
case 1:
keys.addAll(GONG_YU_GAO_bdx);
keys.addAll(GONG_YU_GAO_xbd);
break;
case 2:
keys.addAll(GONG_YU_GAO_abd);
break;
case 3:
keys.addAll(GONG_YU_GAO_abbd);
break;
case 4:
keys.addAll(GONG_YU_GAO_bbbd);
keys.addAll(GONG_YU_GAO_bbbed);
break;
case 5:
keys.addAll(GONG_YU_GAO_bbd);
keys.addAll(GONG_YU_GAO_bbed);
break;
case 6:
keys.addAll(GONG_YU_GAO_bd);
keys.addAll(GONG_YU_GAO_bed);
break;
case 7:
Collections.addAll(keys, ALL_X);
break;
default:
break;
}
} else if (progid.equals("2") | progid.equals("3")) {
switch (i) {
case 1:
keys.addAll(BIAN_RESULT_cdx);
keys.addAll(BIAN_RESULT_xcd);
break;
case 2:
keys.addAll(BIAN_RESULT_abcd);
break;
case 3:
keys.addAll(BIAN_RESULT_abbcd);
break;
case 4:
keys.addAll(BIAN_RESULT_bbbcd);
keys.addAll(BIAN_RESULT_bbbed);
break;
case 5:
keys.addAll(BIAN_RESULT_bbcd);
keys.addAll(BIAN_RESULT_bbced);
break;
case 6:
keys.addAll(BIAN_RESULT_bcd);
keys.addAll(BIAN_RESULT_bced);
break;
case 7:
keys.addAll(BIAN_RESULT_ccd);
keys.addAll(BIAN_RESULT_cced);
break;
case 8:
keys.addAll(BIAN_RESULT_cc);
keys.addAll(BIAN_RESULT_cd);
keys.addAll(BIAN_RESULT_cec);
break;
case 9:
Collections.addAll(keys, ALL_X);
break;
default:
break;
}
}
return keys;
}
private String deleteX(String str){
str=str.replaceAll("\\[.*?]","");
str=str.replaceAll("\\【.*?】","");
str=str.replaceAll(":","");
str=str.replaceAll(":","");
str=str.replaceAll("\\[|\\]","");
str=str.replaceAll("\\【|\\】","");
str=str.replaceAll("_","");
str=str.replaceAll("-","");
str=str.replaceAll("\\.","");
return str;
}
private List<String> make(String[] stringOne, String[] stringTwo) {
List<String> stringSet = new ArrayList<>();
for (String s1 : stringOne) {
for (String s2 : stringTwo) {
stringSet.add(s1 + s2);
}
}
return stringSet;
}
private List<String> make(String[] stringOne, String[] stringTwo, String[] stringThree) {
List<String> stringSet = new ArrayList<>();
for (String s1 : stringOne) {
for (String s2 : stringTwo) {
for (String s3 : stringThree) {
stringSet.add(s1 + s2 + s3);
}
}
}
return stringSet;
}
private List<String> make(String[] stringOne, String[] stringTwo, String[] stringThree, String[] stringFor) {
List<String> stringSet = new ArrayList<>();
for (String s1 : stringOne) {
for (String s2 : stringTwo) {
for (String s3 : stringThree) {
for (String s4 : stringFor) {
stringSet.add(s1 + s2 + s3 + s4);
}
}
}
}
return stringSet;
}
private List<String> make(String[] stringOne, String[] stringTwo, String[] stringThree, String[] stringFor, String[] stringFive) {
List<String> stringSet = new ArrayList<>();
for (String s1 : stringOne) {
for (String s2 : stringTwo) {
for (String s3 : stringThree) {
for (String s4 : stringFor) {
for (String s5 : stringFive) {
stringSet.add(s1 + s2 + s3 + s4 + s5);
}
}
}
}
}
return stringSet;
}
}
嗯,的确,这段代码读起来会比第一版舒服一些,少了数据库表的限制。
不过,对于上文提到的,如果只是在原版代码上进行改进,在数据库里面只保留一个表,这样的话代码也会变得很简洁,读起来也会很好。
用第一版的同一批数据对这段代码进行了测试,如果是在本地跑,两万数据需要9分钟。如果需要是放在服务器,则一分钟就跑完了。
放在表格里对比一下:
(两万数据) | 本地 | 服务器 |
第一版 | 15分钟 | 9分钟 |
第二版 | 3分钟 | 1分钟 |
也就是说,更新之后的代码,跑八十万数据,放在服务器,只需要40分钟即可,缩短三分之二的时间。还是很可观的。
不过,第二版代码目前只起了一个线程,如果将起四个线程,因为项目数据中,基本上是公告和结果两类数据平分天下,因此花费的时间则需折半,也久是说只要二十分钟就可以将八十万数据清洗完成,还是很舒服的。
好啦,今天的优化就到这里,所以来总结一下:
- 首先,读内存肯定比走磁盘快,所以根据自身情况来选择是否需要在数据库中建立额外的表来达到目的。
- 其次,尽量减少IO次数,比如第一版程序,每匹配完一组关键词就对spare4字段进行update,这种方式建议不要使用,能最后进行update就最后进行。
- 再者,程序能放服务器就放服务器,本地跑程序肯定不急服务器的带宽,有服务器不用是笨蛋。
- 最后,要多去思考,问题的解决方式肯定有多种,各有千秋。