长文本的数据脱敏问题_文字脱敏-CSDN博客

本文链接：https://blog.csdn.net/q1204989437/article/details/140182224

结构化数据的脱敏

在实际开发过程中，数据脱敏是一个很常见的需求。对于结构化的数据，比如说数据库中的各种个人信息字段，脱敏很方便，可以使用Hutool中的信息脱敏工具-DesensitizedUtil来进行脱敏。

首先要引入相关依赖：

<dependency>
  <groupId>cn.hutool</groupId>
  <artifactId>hutool-all</artifactId>
  <version>5.8.16</version>
</dependency>

比如对中文姓名脱敏：

//输出结果：张*
String name = "张三";
System.out.println(DesensitizedUtil.chineseName(name));

以及对电话号码的脱敏：

//输出结果：138****5678
String phone = "13812345678";
System.out.println(DesensitizedUtil.mobilePhone(phone));

但是这些工具本质上都是将字符串中的部分敏感信息用"*"进行替代，仅限于字符串已经是固定格式时，才可以使用这些工具。

非结构化数据的长文本脱敏

而对于长文本，比如：

大家好，我叫段秀英，我的电话号码是：18135881345，家庭住址是：华南山西省晋城市西青区

就无法直接使用Hutool工具来进行数据脱敏了，但是对于电话号码以及电子邮箱这些格式较为固定的信息，可以采用正则表达式的方式进行数据脱敏。

正则表达式匹配+Hutool脱敏

首先定义相关信息的正则表达式，然后匹配到相关信息在文本当中的位置，然后利用Hutool工具进行脱敏

//由于手机号码可能有多种不同格式，可能用列表来存储不同的格式
List<String> phonePatternList = Arrays.asList("\\(?(\\d{3})\\)?[-. ]?(\\d{3})[-. ]?(\\d{4})[1-9]?");
List<Pattern> phoneRegexList = new ArrayList<>();
String text = "大家好，我叫段秀英，我的电话号码是：18135881345，家庭住址是：华南山西省晋城市西青区";
//循环遍历所有正则表达式
for (String phonePattern : phonePatternList) {
    Pattern phoneRegex = Pattern.compile(phonePattern);
    phoneRegexList.add(phoneRegex);
}
//对所有模式进行匹配并脱敏
for (Pattern phoneRegex : phoneRegexList){
    Matcher phoneMatcher = phoneRegex.matcher(text);
    while (phoneMatcher.find()) {
        //利用DesensitizedUtil对电话进行脱敏
        text = phoneMatcher.replaceAll(DesensitizedUtil.mobilePhone(phoneMatcher.group()));
    }
}
System.out.println(text);
//输出结果：
//大家好，我叫段秀英，我的电话号码是：181****1345，家庭住址是：华南山西省晋城市西青区

可优化的点

考虑到Pattern.compile(phonePattern)这个方法的执行是比较耗性能的，如果每一次都重新编译一次，太浪费性能了，因此我们可以仅在第一次调用该方法的时候进行compile，然后将compile好的pattern用一个静态变量存好（相当于缓存），以后每次调用该方法，就直接从静态变量中去取。

public class TextDesensitization {
    //由于手机号码可能有多种不同格式，可能用列表来存储不同的格式
    public static List<String> phonePatternList = Arrays.asList("\\(?(\\d{3})\\)?[-. ]?(\\d{3})[-. ]?(\\d{4})[1-9]?");
    public static List<Pattern> phoneRegexList = new ArrayList<>();
    public static void main(String[] args) {
        String text = "大家好，我叫段秀英，我的电话号码是：18135881345，家庭住址是：华南山西省晋城市西青区";
        long start = System.currentTimeMillis();
        for (int i = 0; i < 10000; i++){
            desensitization(text);
        }
        long end = System.currentTimeMillis();
        System.out.println("脱敏时间：" + (end - start));
    }

    public static void desensitization(String text) {
        //只有第一次循环时，初始化正则表达式
        if (phoneRegexList.isEmpty()){
            //循环遍历所有正则表达式
            for (String phonePattern : phonePatternList) {
                Pattern phoneRegex = Pattern.compile(phonePattern);
                phoneRegexList.add(phoneRegex);
            }
        }
        //对所有模式进行匹配并脱敏
        for (Pattern phoneRegex : phoneRegexList){
            Matcher phoneMatcher = phoneRegex.matcher(text);
            while (phoneMatcher.find()) {
                //利用DesensitizedUtil对电话进行脱敏
                text = phoneMatcher.replaceAll(DesensitizedUtil.mobilePhone(phoneMatcher.group()));
            }
        }
        System.out.println(text);
    }
}

对于上述代码，同一段文本，利用优化后的代码，脱敏10000次，消耗的时间是93ms，不优化时，消耗时间是50000ms左右。性能差别还是很大的。

长文本中的纯文字敏感信息脱敏

上述内容解决了长文本中数字类型的个人信息的脱敏，但是解决不了人名，地名等信息的脱敏，因为这些信息嵌入到一段文本中时，就不像数字信息那么好用正则来进行匹配。此时，可以引入自然语言处理的工具，在这里我是用的是Hanlp（GitHub - enguangzhang/hanlp-portable: hanlp-portable）来进行人名和地名的脱敏（如果是Python程序的开发，可选择的工具就更多了）。

首先需要去引入相关依赖

<dependency>
  <groupId>com.hankcs</groupId>
  <artifactId>hanlp</artifactId>
  <version>portable-1.8.4</version>
</dependency>

大概思路是：首先利用Hanlp进行文本的分词和词性识别，然后将我们想脱敏的词进行掩盖。具体的词性标注集可以通过该项目的readme文件中去查看。这这里我们对人名"nr"和地名"ns"进行脱敏，代码如下：

//对句子进行词性分类
Segment segment = HanLP.newSegment();
List<Term> termList = segment.seg(text);

StringBuilder desensitizedText = new StringBuilder();
for (Term term : termList) {
    String word = term.toString();
    //获取词性在字符串中的位置
    int index = term.toString().lastIndexOf('/');
    //如果句子包含地名或者人名就进行脱敏
    if (word.contains("ns")) {
        text = text.replaceAll(word.substring(0, index), DesensitizedUtil.address(word.substring(0, index), word.length() - index));
    } else if (word.contains("nr")) {
        text = text.replaceAll(word.substring(0, index), DesensitizedUtil.chineseName(word.substring(0, index)));

    }
}

完整的脱敏代码：

public class TextDesensitization {
    //由于手机号码可能有多种不同格式，可能用列表来存储不同的格式
    public static List<String> phonePatternList = Arrays.asList("\\(?(\\d{3})\\)?[-. ]?(\\d{3})[-. ]?(\\d{4})[1-9]?");
    public static List<Pattern> phoneRegexList = new ArrayList<>();
    public static void main(String[] args) {
        String text = "大家好，我叫段秀英，我的电话号码是：18135881345，家庭住址是：华南山西省晋城市西青区";

        desensitization(text);

    }

    public static void desensitization(String text) {
        //只有第一次循环时，初始化正则表达式
        if (phoneRegexList.isEmpty()){
            //循环遍历所有正则表达式
            for (String phonePattern : phonePatternList) {
                Pattern phoneRegex = Pattern.compile(phonePattern);
                phoneRegexList.add(phoneRegex);
            }
        }

        //对所有模式进行匹配并脱敏
        for (Pattern phoneRegex : phoneRegexList){
            Matcher phoneMatcher = phoneRegex.matcher(text);
            while (phoneMatcher.find()) {
                //利用DesensitizedUtil对电话进行脱敏
                text = phoneMatcher.replaceAll(DesensitizedUtil.mobilePhone(phoneMatcher.group()));
            }
        }

        //对句子进行词性分类
        Segment segment = HanLP.newSegment();
        List<Term> termList = segment.seg(text);

        StringBuilder desensitizedText = new StringBuilder();
        for (Term term : termList) {
            String word = term.toString();
            //获取词性在字符串中的位置
            int index = term.toString().lastIndexOf('/');
            //如果句子包含地名或者人名就进行脱敏
            if (word.contains("ns")) {
                text = text.replaceAll(word.substring(0, index), DesensitizedUtil.address(word.substring(0, index), word.length() - index));
            } else if (word.contains("nr")) {
                text = text.replaceAll(word.substring(0, index), DesensitizedUtil.chineseName(word.substring(0, index)));

            }
        }
        System.out.println(text);
    }
}

运行结果：

大家好，我叫段**，我的电话号码是：181****1345，家庭住址是：***********

这样可以大致满足我们对长文本中的敏感信息的脱敏，不过Hanlp仍然会有不少识别不出来我们想要的信息的情况，工具的作者也提供了很多配置选项，来权衡效率和准确率。