文本挖掘和文本分析与nlp_如何在NLP中保护文本表示的隐私

本文探讨了在自然语言处理(NLP)中如何保护文本挖掘和分析过程中的文本表示隐私,关注如何在不泄露敏感信息的情况下进行文本处理。
摘要由CSDN通过智能技术生成

文本挖掘和文本分析与nlp

问题概述 (Problem overview)

Recently, we have been experiencing numerous breakthroughs in Natural Language Processing (NLP) owing to the evolution of Deep Learning (DL). The successes emerged from word2vec, or distributed word representation, which is capable of projecting discrete words into vector space. Such mappings have revolutionized the understanding and manipulation of syntactic and semantic relations among words. One famous example is we can play the following equations of word embeddings in word2vec:

最近,由于深度学习(DL)的发展,我们在自然语言处理(NLP)中经历了许多突破。 成功来自于word2vec(即分布式单词表示),它能够将离散单词投射到向量空间中。 这种映射彻底改变了单词间句法和语义关系的理解和操纵。 一个著名的例子是我们可以在word2vec中播放以下单词嵌入方程:

Powered by this technique, a myriad of NLP tasks have achieved human parity and are widely deployed on commercial systems [2,3].

在这种技术的支持下,无数的NLP任务已经实现了人类平等,并广泛部署在商业系统上[2,3]。

The core of the accomplishments is representation learning, which is able to extract the necessary information, such as semantics, sentiment, intent, etc, required by task. However, because of the over-parameterization, DL models also memorize certain unnecessary but sensitive attributes, such as gender, age, location, etc.

成就的核心是表示学习,它能够提取任务所需的必要信息,例如语义,情感,意图等。 但是,由于参数过多,DL模型还存储了某些不必要但敏感的属性,例如性别,年龄,位置等。

The private information can be explored by malicious parties in different settings. Firstly, cloud AI services have been widespread. Users can easily annotate their unlabelled datasets via cloud AI platforms such as Microsoft Cognitive Services, Google Cloud API, etc. However, if eavesdroppers intercept the immediate representation of users’ inputs from cloud AI services, they can perform some reverse engineering to obtain the original text. Considering privacy concerns, users are unwilling to upload their data to servers. Instead, they can transmit their extracted representations to servers. Nevertheless, the input representation after the embedding layer or the intermediate hidden representation may still carry sensitive information that can be exploited for adversarial usages. It has been justified that an attacker can recover private variables with higher than chance accuracy, using only the hidden representation [4,5]. Such an attack would occur in scenarios where end-users send their learned representations to the cloud for grammar correction, translation, or text analysis tasks, as shown in Fig.1.

恶意方可以在不同的环境中浏览私人信息。 首先,云AI服务已经普及。 用户可以通过Microsoft认知服务,Google Cloud API等云AI平台轻松注释其未标记的数据集。但是,如果窃听者截取了来自云AI服务的用户输入的直接表示,则他们可以执行一些反向工程以获得原始内容。文本。 考虑到隐私问题,用户不愿意将其数据上传到服务器。 相反,他们可以将其提取的表示传输到服务器。 但是,嵌入层之后的输入表示或中间隐藏表示仍可能携带敏感信息,可用于对抗性使用。 已经证明,攻击者可以仅使用隐藏表示[4,5]来以高于偶然性的准确性来恢复私有变量。 如图1所示,在最终用户将其学习的表示形式发送到云中进行语法校正,翻译或文本分析任务的情况下,会发生这种攻击。

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值