今天在搜索判断字符是否是中文的时候看到一道面试题目:
题目 写道
编写一个截取字符串的函数,输入为一个字符串和字节数,输出为按字节截取的字符串。 但是要保证汉字不被截半个,如“我ABC”4,应该截为“我AB”,输入“我ABC汉DEF”,6,应该输出为“我ABC”而不是“我ABC+汉的半个”。
我写了一个较共用的,可以截取一个字符串中的任意一段,支持编码:
public static boolean isChinseseChar(char ch) {
Pattern p = Pattern.compile("[\\u4e00-\\u9fa5]");
return p.matcher(ch+"").find();
}
/**
* 编写一个截取字符串的函数,输入为一个字符串和字节数,输出为按字节截取的字符串。
* 但是要保证汉字不被截半个,如“我ABC”4,应该截为“我AB”,输入“我ABC汉DEF”,6,应该输出为“我ABC”而不是“我ABC+汉的半个”。
* @param source 源字符串
* @param startPos 开始字节数
* @param endPos 结束字节数
* @param charset 编码类型
* @return 截取的字符串
* @throws UnsupportedEncodingException
*/
public static String cutString(String source,int startPos, int endPos, String charset) throws UnsupportedEncodingException{
byte[] bs = source.getBytes(charset);
if(startPos > bs.length)
throw new RuntimeException("startPos大于字符" + source + "的总共字节数!");
if(endPos > bs.length)
endPos = bs.length;
int offset = 0;
int factor = "编".getBytes(charset).length;
int start = -1, end = -1;
for(int i=0; i<source.length(); i++) {
if(offset == startPos)//offset = startPos,其实位置的字节数正好是一个非中文字开头
start = i;
if(offset > startPos && start == -1)//offset已经大于startPos但是还没找到offset=startPos的位置,说明开始的是一个中文字符
start = i;
if(isChinseseChar(source.charAt(i)))
offset += factor;
else
offset ++;
if(offset == endPos){
end = i+1;
}
if(offset > endPos && end == -1) {
end = i;
}
if(start != -1 && end != -1)
break;
}
if(start >= end)
return "";
return source.substring(start,end);
}
public static String cutString(String source,int startPos,int endPos) throws UnsupportedEncodingException{
return cutString(source,startPos,endPos,Charset.defaultCharset().toString());
}
public static String cutString(String source,int endPos) throws UnsupportedEncodingException{
return cutString(source,0,endPos,Charset.defaultCharset().toString());
}
public static String cutString(String source,int endPos, String charset) throws UnsupportedEncodingException{
return cutString(source,0,endPos,charset);
}
/**
* @param args
* @throws IOException
*/
public static void main(String[] args) throws IOException {
String source = "我是abc";
System.out.println(source+"[0,4](gbk): "+cutString(source,0,4,"gbk"));
System.out.println(source+"[0,4](utf-8): "+cutString(source,0,4,"utf-8"));
source = "adf12我是abc";
System.out.println(source+"[0,2](gbk): "+cutString(source,0,2,"gbk"));
System.out.println(source+"[0,2](utf-8): "+cutString(source,0,2,"utf-8"));
source = "add她fdf2我是abc";
System.out.println(source+"[0,11](gbk): "+cutString(source,0,11,"gbk"));
System.out.println(source+"[0,11](utf-8): "+cutString(source,0,11,"utf-8"));
System.out.println(source+"[10,11](gbk): "+cutString(source,10,11,"gbk"));
System.out.println(source+"[10,11](utf-8): "+cutString(source,10,11,"utf-8"));
System.out.println(source+"[8,12](gbk): "+cutString(source,8,12,"gbk"));
System.out.println(source+"[8,12](utf-8): "+cutString(source,8,12,"utf-8"));
System.out.println(source+"[9,11](gbk): "+cutString(source,9,11,"gbk"));
System.out.println(source+"[9,11](utf-8): "+cutString(source,9,11,"utf-8"));
System.out.println(source+"[0,20](gbk): "+cutString(source,0,20,"gbk"));
System.out.println(source+"[0,20](utf-8): "+cutString(source,0,20,"utf-8"));
}
打印结果:
写道
我是abc[0,4](gbk): 我是
我是abc[0,4](utf-8): 我
adf12我是abc[0,2](gbk): ad
adf12我是abc[0,2](utf-8): ad
add她fdf2我是abc[0,11](gbk): add她fdf2我
add她fdf2我是abc[0,11](utf-8): add她fdf2
add她fdf2我是abc[10,11](gbk):
add她fdf2我是abc[10,11](utf-8):
add她fdf2我是abc[8,12](gbk): 2我
add她fdf2我是abc[8,12](utf-8): f2
add她fdf2我是abc[9,11](gbk): 我
add她fdf2我是abc[9,11](utf-8): 2
add她fdf2我是abc[0,20](gbk): add她fdf2我是abc
add她fdf2我是abc[0,20](utf-8): add她fdf2我是abc
我是abc[0,4](utf-8): 我
adf12我是abc[0,2](gbk): ad
adf12我是abc[0,2](utf-8): ad
add她fdf2我是abc[0,11](gbk): add她fdf2我
add她fdf2我是abc[0,11](utf-8): add她fdf2
add她fdf2我是abc[10,11](gbk):
add她fdf2我是abc[10,11](utf-8):
add她fdf2我是abc[8,12](gbk): 2我
add她fdf2我是abc[8,12](utf-8): f2
add她fdf2我是abc[9,11](gbk): 我
add她fdf2我是abc[9,11](utf-8): 2
add她fdf2我是abc[0,20](gbk): add她fdf2我是abc
add她fdf2我是abc[0,20](utf-8): add她fdf2我是abc