今天看到这样一个题:
请统计出以下这段文字中,出现频率最高的二元字符(两个字符)的组合。
(举例:在字符串“1252336528952”中,二元字符组合“52”出现3次,频率最高。)
oneofthecentralresultsofairesearchinthe1970swasthattoachievegoodperformanceaisystemsmusthav
elargeamountsofknowledgeknowledgeispowertheslogangoeshumansclearlyusevastamountsofknowledge
andifaiistoachieveitslongtermgoalsaisystemsmustalsousevastamountssincehandcodinglargeamount
sofknowledgeintoasystemisslowtediousanderrorpronemachinelearningtechniqueshavebeendeveloped
toautomaticallyacquireknowledgeoftenintheformofifthenrulesproductionsunfortunatelythishasof
tenledtoautilityproblemminton1988bthelearninghascausedanoverallslowdowninthesystemforexampl
einmanysystemslearnedrulesareusedtoreducethenumberofbasicstepsthesystemtakesinordertosolvep
roblemsbypruningthesystemssearchspaceforinstancebutinordertodetermineateachstepwhichrulesar
eapplicablethesystemmustmatchthemagainstitscurrentsituationusingcurrenttechniquesthematcher
slowsdownasmoreandmorerulesareacquiredsoeachsteptakeslongerandlongerthisectcanoutweighthere
ductioninthenumberofstepstakensothatthenetresultisaslowdownthishasbeenobservedinseveralrece
ntsystemsminton1988aetzioni1990tambeetal1990cohen1990ofcoursetheproblemofslowdownfromincrea
singmatchcostisnotrestrictedtosystemsinwhichthepurposeofrulesistoreducethenumberofproblemso
lvingstepsasystemacquiringnewrulesforanypurposecanslowdowniftherulessignicantlyincreasethem
atchcostandintuitivelyoneexpectsthatthemoreproductionsthereareinasystemthehigherthetotalmat
chcostwillbethethesisofthisresearchisthatwecansolvethisprobleminabroadclassofsystemsbyimpro
vingthematchalgorithmtheyuseinessenceouraimistoenablethescalingupofthenumberofrulesinproduc
tionsystemsweadvancethestateoftheartinproductionmatchalgorithmsdevelopinganimprovedmatchalg
orithmwhoseperformancescaleswellonasignicantlybroaderclassofsystemsthanexistingalgorithmsfu
rthermorewedemonstratethatbyusingthisimprovedmatchalgorithmwecanreduceoravoidtheutilityprob
leminalargeclassofmachinelearningsystems
我觉的这道题很有意思,以前写过c语言的实现,现在用java来解决这个问题,直接看代码:
package com.company;
import java.io.*;
import java.util.*;
import java.util.List;
/**
* * @projectName test
* * @title Test3
* * @package com.company
* * @description 查找字符串中出现最多的子类
* * @author IT_CREAT
* * @date 2020 2020/5/24/024 22:32
* * @version 1.0.0
*/
public class Test3 {
/**
* 待测试的字符串
*/
public static String testTtr = "oneofthecentralresultsofairesearchinthe1970swasthattoachievegoodper" +
"formanceaisystemsmusthavelargeamountsofknowledgeknowledgeispowertheslogangoeshumansclearly" +
"usevastamountsofknowledgeandifaiistoachieveitslongtermgoalsaisystemsmustalsousevastamounts" +
"sincehandcodinglargeamountsofknowledgeintoasystemisslowtediousanderrorpronemachinelearning" +
"techniqueshavebeendevelopedtoautomaticallyacquireknowledgeoftenintheformofifthenrulesprodu" +
"ctionsunfortunatelythishasoftenledtoautilityproblemminton1988bthelearninghascausedanoveral" +
"lslowdowninthesystemforexampleinmanysystemslearnedrulesareusedtoreducethenumberofbasicstep" +
"sthesystemtakesinordertosolveproblemsbypruningthesystemssearchspaceforinstancebutinorderto" +
"determineateachstepwhichrulesareapplicablethesystemmustmatchthemagainstitscurrentsituation" +
"usingcurrenttechniquesthematcherslowsdownasmoreandmorerulesareacquiredsoeachsteptakeslonge" +
"randlongerthisectcanoutweighthereductioninthenumberofstepstakensothatthenetresultisaslowdo" +
"wnthishasbeenobservedinseveralrecentsystemsminton1988aetzioni1990tambeetal1990cohen1990ofc" +
"oursetheproblemofslowdownfromincreasingmatchcostisnotrestrictedtosystemsinwhichthepurposeo" +
"frulesistoreducethenumberofproblemsolvingstepsasystemacquiringnewrulesforanypurposecanslow" +
"downiftherulessignicantlyincreasethematchcostandintuitivelyoneexpectsthatthemoreproduction" +
"sthereareinasystemthehigherthetotalmatchcostwillbethethesisofthisresearchisthatwecansolvet" +
"hisprobleminabroadclassofsystemsbyimprovingthematchalgorithmtheyuseinessenceouraimistoenab" +
"lethescalingupofthenumberofrulesinproductionsystemsweadvancethestateoftheartinproductionma" +
"tchalgorithmsdevelopinganimprovedmatchalgorithmwhoseperformancescaleswellonasignicantlybro" +
"aderclassofsystemsthanexistingalgorithmsfurthermorewedemonstratethatbyusingthisimprovedmat" +
"chalgorithmwecanreduceoravoidtheutilityprobleminalargeclassofmachinelearningsystems";
/**
* 用作返回map的key
*/
public enum ReturnKey {
COUNT, SUBSTRINGS
}
/**
* 找出文本文件中出现最多的字串的集合
*
* @param chainNumber 连续多少个字符算一个字串,也就是字串这个单词的长度
* @param filePath 需要读取文件路径
* @return 出现最多的字串的集合和次数
*/
public static Map<ReturnKey, Object> searchMostSubstringsByFile(int chainNumber, String filePath) {
List<String> mostSubstrings = new ArrayList<>();
Map<ReturnKey, Object> returnMap = new LinkedHashMap<>(2);
returnMap.put(ReturnKey.COUNT, 0);
returnMap.put(ReturnKey.SUBSTRINGS, mostSubstrings);
if (strIsEmpty(filePath)) {
return returnMap;
}
File file = new File(filePath);
if (file.exists()) {
FileReader fileReader = null;
try {
fileReader = new FileReader(file);
char[] readChar = new char[1024];
StringBuilder waitParsingStr = new StringBuilder();
int readLength = 0;
while ((readLength = fileReader.read(readChar)) != -1) {
waitParsingStr.append(readChar, 0, readLength);
}
return searchMostSubstrings(chainNumber, waitParsingStr.toString());
} catch (IOException e) {
System.out.println(e.getMessage());
} finally {
try {
if (fileReader != null) {
fileReader.close();
}
} catch (IOException e) {
System.out.println(e.getMessage());
}
}
}
return returnMap;
}
/**
* 找出字符产中出现做多的字串集合
*
* @param chainNumber 连续多少个字符算一个字串,也就是字串这个单词的长度
* @param waitParsingStr 需要被解析的字符串
* @return 出现最多的字串的集合和次数
*/
public static Map<ReturnKey, Object> searchMostSubstrings(int chainNumber, String waitParsingStr) {
//需要返回的查找出来的最多的字串的集合
List<String> mostSubstrings = new ArrayList<>();
Map<ReturnKey, Object> returnMap = new LinkedHashMap<>(2);
returnMap.put(ReturnKey.COUNT, 0);
returnMap.put(ReturnKey.SUBSTRINGS, mostSubstrings);
//等待解析的字符串的长度
int waitParsingStrSize = waitParsingStr.length();
System.out.println("待解析字符串大小 : " + waitParsingStrSize + " , 待解析字符串内容 : " + waitParsingStr);
if (strIsEmpty(waitParsingStr) || chainNumber > waitParsingStrSize) {
return returnMap;
}
//最多字串的数量
int mostSubstringCount = 0;
//解析出来的所有字串的集合
Set<String> substrings = new HashSet<>();
//从字符串开头每个字符开始循环解析
for (int i = 0; i < waitParsingStrSize; i++) {
//如果查找字串所在的最后的索引小于待解析的字符串则取出该子字符串
if (i + (chainNumber - 1) < waitParsingStrSize) {
String substr = waitParsingStr.substring(i, i + chainNumber);
//如果字串集合中已经包含了本次获取到的字串则跳出进行下一次字串解析
if (substrings.contains(substr)) {
continue;
}
substrings.add(substr);
//获得字串在待解析字符串中出现的次数
int substrCount = countStr(waitParsingStr, substr);
//如果当前获得的字串的数量大于之前出现的最大字串的数量,则清除之前的字串,添加当前的字串
if (substrCount > mostSubstringCount) {
mostSubstrings.clear();
mostSubstrings.add(substr);
} else if (substrCount == mostSubstringCount) {//如果当前获得的字串的数量等于之前出现的最大字串的数量,则添加当前的字串
mostSubstrings.add(substr);
}
//比较获取当前字串最大的次数进行临时赋值
mostSubstringCount = Math.max(substrCount, mostSubstringCount);
}
}
returnMap.put(ReturnKey.COUNT, mostSubstringCount);
return returnMap;
}
/**
* @param str 原字符串
* @param sToFind 需要查找的字符串
* @return 返回在原字符串中sToFind出现的次数
*/
private static int countStr(String str, String sToFind) {
int num = 0;
while (str.contains(sToFind)) {
str = str.substring(str.indexOf(sToFind) + sToFind.length());
num++;
}
return num;
}
/**
* 判断字符串是否为空
*
* @param str 需要判断的字符串
* @return boolean值,为空返回true,不为空返回true
*/
private static boolean strIsEmpty(String str) {
return str == null || str.isEmpty();
}
public static void main(String[] args) {
Map<ReturnKey, Object> returnKeyObjectMap1 = searchMostSubstrings(2, testTtr);
System.out.println("字符串中出现子串出现最多的次数是 : " + returnKeyObjectMap1.get(ReturnKey.COUNT));
System.out.println("字符串中出现最多的子串集合是 : " + returnKeyObjectMap1.get(ReturnKey.SUBSTRINGS));
Map<ReturnKey, Object> returnKeyObjectMap2 = searchMostSubstringsByFile(2, "C:\\Users\\Administrator\\Desktop\\test\\src\\com\\company\\test.txt");
System.out.println("字符串中出现子串出现最多的次数是 : " + returnKeyObjectMap2.get(ReturnKey.COUNT));
System.out.println("字符串中出现最多的子串集合是 : " + returnKeyObjectMap2.get(ReturnKey.SUBSTRINGS));
}
}
效果是这样的:
待解析字符串大小 : 1860 , 待解析字符串内容 : oneofthecentralresultsofairesearchinthe1970swasthattoachievegoodperformanceaisystemsmusthavelargeamountsofknowledgeknowledgeispowertheslogangoeshumansclearlyusevastamountsofknowledgeandifaiistoachieveitslongtermgoalsaisystemsmustalsousevastamountssincehandcodinglargeamountsofknowledgeintoasystemisslowtediousanderrorpronemachinelearningtechniqueshavebeendevelopedtoautomaticallyacquireknowledgeoftenintheformofifthenrulesproductionsunfortunatelythishasoftenledtoautilityproblemminton1988bthelearninghascausedanoverallslowdowninthesystemforexampleinmanysystemslearnedrulesareusedtoreducethenumberofbasicstepsthesystemtakesinordertosolveproblemsbypruningthesystemssearchspaceforinstancebutinordertodetermineateachstepwhichrulesareapplicablethesystemmustmatchthemagainstitscurrentsituationusingcurrenttechniquesthematcherslowsdownasmoreandmorerulesareacquiredsoeachsteptakeslongerandlongerthisectcanoutweighthereductioninthenumberofstepstakensothatthenetresultisaslowdownthishasbeenobservedinseveralrecentsystemsminton1988aetzioni1990tambeetal1990cohen1990ofcoursetheproblemofslowdownfromincreasingmatchcostisnotrestrictedtosystemsinwhichthepurposeofrulesistoreducethenumberofproblemsolvingstepsasystemacquiringnewrulesforanypurposecanslowdowniftherulessignicantlyincreasethematchcostandintuitivelyoneexpectsthatthemoreproductionsthereareinasystemthehigherthetotalmatchcostwillbethethesisofthisresearchisthatwecansolvethisprobleminabroadclassofsystemsbyimprovingthematchalgorithmtheyuseinessenceouraimistoenablethescalingupofthenumberofrulesinproductionsystemsweadvancethestateoftheartinproductionmatchalgorithmsdevelopinganimprovedmatchalgorithmwhoseperformancescaleswellonasignicantlybroaderclassofsystemsthanexistingalgorithmsfurthermorewedemonstratethatbyusingthisimprovedmatchalgorithmwecanreduceoravoidtheutilityprobleminalargeclassofmachinelearningsystems
字符串中出现子串出现最多的次数是 : 53
字符串中出现最多的子串集合是 : [th]
待解析字符串大小 : 1860 , 待解析字符串内容 : oneofthecentralresultsofairesearchinthe1970swasthattoachievegoodperformanceaisystemsmusthavelargeamountsofknowledgeknowledgeispowertheslogangoeshumansclearlyusevastamountsofknowledgeandifaiistoachieveitslongtermgoalsaisystemsmustalsousevastamountssincehandcodinglargeamountsofknowledgeintoasystemisslowtediousanderrorpronemachinelearningtechniqueshavebeendevelopedtoautomaticallyacquireknowledgeoftenintheformofifthenrulesproductionsunfortunatelythishasoftenledtoautilityproblemminton1988bthelearninghascausedanoverallslowdowninthesystemforexampleinmanysystemslearnedrulesareusedtoreducethenumberofbasicstepsthesystemtakesinordertosolveproblemsbypruningthesystemssearchspaceforinstancebutinordertodetermineateachstepwhichrulesareapplicablethesystemmustmatchthemagainstitscurrentsituationusingcurrenttechniquesthematcherslowsdownasmoreandmorerulesareacquiredsoeachsteptakeslongerandlongerthisectcanoutweighthereductioninthenumberofstepstakensothatthenetresultisaslowdownthishasbeenobservedinseveralrecentsystemsminton1988aetzioni1990tambeetal1990cohen1990ofcoursetheproblemofslowdownfromincreasingmatchcostisnotrestrictedtosystemsinwhichthepurposeofrulesistoreducethenumberofproblemsolvingstepsasystemacquiringnewrulesforanypurposecanslowdowniftherulessignicantlyincreasethematchcostandintuitivelyoneexpectsthatthemoreproductionsthereareinasystemthehigherthetotalmatchcostwillbethethesisofthisresearchisthatwecansolvethisprobleminabroadclassofsystemsbyimprovingthematchalgorithmtheyuseinessenceouraimistoenablethescalingupofthenumberofrulesinproductionsystemsweadvancethestateoftheartinproductionmatchalgorithmsdevelopinganimprovedmatchalgorithmwhoseperformancescaleswellonasignicantlybroaderclassofsystemsthanexistingalgorithmsfurthermorewedemonstratethatbyusingthisimprovedmatchalgorithmwecanreduceoravoidtheutilityprobleminalargeclassofmachinelearningsystems
字符串中出现子串出现最多的次数是 : 53
字符串中出现最多的子串集合是 : [th]
本次编写的代码,可以通过指定组合字符的个数,不管是题中给出的2个还是更多或者更少,都可以查找出来,同时也可以查找文本文件中出现最多的组合字符。