大数据处理过程中,需要对数据进行清洗转换的操作,往往会遇到需要将一定格式的字符串提取出来,作为其中的变量值,其分隔符可能是固定的,也可能不固定的。如有些格式为:数据标识:日期:分类:{唯一标识},在数据解析转换过程,只需要提取出日期与唯一标识时,通常会采用固定格式遍历的方式进行,那么是否可以通过设定一串具有相同格式的规则去提取里面的数值呢?是否可采用占位符的方式标记每一个值的具体位置,顺序遍历将数值提取出来?
如上图所示,将20220809转换成程序变量day,将1130002转换成id,具备有依据已知的规则对比字符串进行自动变量赋值,提取需要的数据。
思考:1. 采用占位符的方式,将变化的内容转换成变量,如product:${day}:文具:{${id}}
2.顺序解析占位符,记录占位符前后的字符,同时标识占位置前后是否有开始或结束字符,形成占位符列表
3.顺序解析字符串,依次判断字符是否在占位符列表中,依据前后占位符标识特征提取出对应的数值
4.以占位变量名为Map的key,将提取出来的数值按key与value方式存储.
定义规则实体:
public class RuleInstEntity {
private String fieldCode; //转换成的变量标识
private byte[] startIndPrefix; //开始截取的字节组标识
private byte[] endIndPrefix;//结束截取的字节组标识
private int flag=1; //0 无开始标识即该串在开始位置 1 有开始标识和结束标识 2 无结束标识即该串在结束位置
}
解析占位符规则实现函数(product:${day}:文具:{${id}}转换为List<RuleInstEntity>):
public static List<RuleInstEntity> setRuleInstEntity(String varPattern){
List<RuleInstEntity> ruleInstEntities=new ArrayList<>();
byte[] varBytes=varPattern.getBytes();
int beg = -1;
int start= -1;
int end=-1;
for(int i=0;i<varBytes.length;i++){
if(beg==-1) {
if (varBytes[i] == '$') {
beg = i;
}
}
else {
if (beg >= 0&&i==beg+1) {
if (varBytes[i] != '{') {
beg = -1;
start= -1;
}
else{
start=i;
}
}
if(start>=0&&varBytes[i]=='}'){
end=i;
byte[] varFieldCode=ByteUtils.getFliedBytes(start+1,end,varBytes);
RuleInstEntity ruleInstEntity=new RuleInstEntity();
ruleInstEntity.setFieldCode(new String(varFieldCode));
if(start==1) {
ruleInstEntity.setStartIndPrefix(new byte[0]);
ruleInstEntity.setFlag(0);
}
else{
byte[] indexPrefix=ByteUtils.getFliedBytes(start-2,start-1,varBytes);
ruleInstEntity.setStartIndPrefix(indexPrefix);
}
if(end+2<varBytes.length) {
byte[] endPrefix = ByteUtils.getFliedBytes(end + 1, end + 2, varBytes);
ruleInstEntity.setEndIndPrefix(endPrefix);
}
else {
ruleInstEntity.setEndIndPrefix(new byte[0]);
ruleInstEntity.setFlag(2);
}
ruleInstEntities.add(ruleInstEntity);
beg = -1;
start= -1;
}
}
}
return ruleInstEntities;
}
格式字符串反向提取,如:product:20220809:文具:{11300022},转化为:day->20220809,id->11300022
public static Map<String,String> getVarMap(List<RuleInstEntity> ruleInstEntities,String varString){
Map<String,String> varMap=new HashMap<>();
byte[] bytes=varString.getBytes();
int j=0;
byte[] sBytesPrefix=ruleInstEntities.get(j).getStartIndPrefix();
byte[] eBytesPrefix=ruleInstEntities.get(j).getEndIndPrefix();
int flag=ruleInstEntities.get(j).getFlag();
String fieldCode=ruleInstEntities.get(j).getFieldCode();
int ik=0;
int start=-1;
int end=-1;
int isk=0;
for(int i=0;i<bytes.length;){
if(flag==0){ //无前缀标识
if(start==-1) start=0;
if(bytes[i]==eBytesPrefix[ik]){
ik++;
if (ik >= eBytesPrefix.length) {
end = i;
varMap.put(fieldCode, ByteUtils.getFliedString(start, end, bytes));
ik = 0;
start = -1;
j++;
if(j<ruleInstEntities.size()) {
sBytesPrefix = ruleInstEntities.get(j).getStartIndPrefix();
eBytesPrefix = ruleInstEntities.get(j).getEndIndPrefix();
flag = ruleInstEntities.get(j).getFlag();
fieldCode=ruleInstEntities.get(j).getFieldCode();
}
else{
break;
}
//end=-1;
}
else{
i++;
}
}
else {
start=-1;
ik=0;
i++;
}
}
else if(flag==1){
if (isk < sBytesPrefix.length&&bytes[i] == sBytesPrefix[isk]) {
if (isk == 0) {
start = i;
}
isk++;
i++;
}
else {
if (isk >= sBytesPrefix.length) {
if (ik<eBytesPrefix.length&&bytes[i] == eBytesPrefix[ik]) {
ik++;
if(ik>=eBytesPrefix.length){
end = i;
varMap.put(fieldCode, ByteUtils.getFliedString(start+sBytesPrefix.length, end, bytes));
ik = 0;
start = -1;
j++;
isk=0;
if(j<ruleInstEntities.size()) {
sBytesPrefix = ruleInstEntities.get(j).getStartIndPrefix();
eBytesPrefix = ruleInstEntities.get(j).getEndIndPrefix();
flag = ruleInstEntities.get(j).getFlag();
fieldCode=ruleInstEntities.get(j).getFieldCode();
}
else{
break;
}
}
}
else{
if(ik>0){
start=-1;
isk=0;
ik=0;
}
i++;
}
}
else{
start=-1;
isk=0;
ik=0;
i++;
}
}
}
else if(flag==2){ //无后缀标识
if (isk<sBytesPrefix.length&&bytes[i] == sBytesPrefix[isk]) {
if (isk == 0) {
start = i;
}
isk++;
if (isk >= sBytesPrefix.length) {
varMap.put(fieldCode, ByteUtils.getFliedString(start+1, bytes.length-1, bytes));
break;
//end=-1;
}
else{
i++;
}
}
else{
start=-1;
isk=0;
i++;
}
}
}
return varMap;
}
其中getFliedString函数的实现
public static String getFliedString(int curPos,int endPos,byte[] values){
byte[] copy = getFliedBytes(curPos,endPos,values);
return new String(copy);
}
public static byte[] getFliedBytes(int curPos,int endPos,byte[] values){
byte[] copy = new byte[endPos - curPos];
System.arraycopy(values, curPos, copy, 0, endPos - curPos);
return copy;
}
最终得到需要的结果: