论文阅读:《AliMe Chat: A Sequence to Sequence and Rerank based Chatbot Engine》

原文:http://www.sohu.com/a/229801262_100118081

AliMe聊天:基于序列到序列和重排的聊天机器人引擎

AliMe Chat: A Sequence to Sequence and Rerank based Chatbot Engine

阿里巴巴集团

Alibaba Group

【摘要】我们推出了AliMe聊天,一个开放域聊天机器人引擎,它将信息检索(IR)的联合结果和序列到序列模型(Seq2Seq)的生成模型相结合。AliMe聊天使用基于序列到序列的重排模型来优化联合结果。大量的实验证明我们的引擎比IR和生成模型都要出色。我们推出了AliMe聊天,以实现真实的工业应用,并获得比其他公共聊天机器人更好的结果。

1 引言

在过去的几年里,聊天机器人蓬勃发展,例如微软的小冰、苹果的Siri、谷歌的谷歌助手。与传统的应用程序不同,用户通过简单和结构化的语言与他们进行交互(例如,“提交”,“取消”,“预定”等。),聊天机器人允许用户使用自然语言、文本或语音(甚至图像)与他们进行交流。

我们正在努力使机器人能够在电子商务领域回答客户的问题。目前,我们的机器人每天服务数百的客户问题(主要是中文,也有一些英文)。其中大多数是与商业有关的,但也有大约5%是以聊天为方向的(数千个)。为了提供更好的用户体验,建立一个开放域的聊天机器人引擎是必要的。

用于构建开放域的聊天机器人的常用技术包括IR模型(Ji et al., 2014; Yan et al., 2016b)和生成模型((Bahdanau et al., 2015; Sutskever et al., 2014; Vinyals and Le, 2015)。给定一个问题,前者在问答(QA)知识库中检索最接近的问题,并接受成对的答案,后者根据预先训练的序列到序列模型生成答案。通常,IR模型无法处理在QA库中不相近的长尾问题,并且生成模型可能产生不一样的或没有意义的答案(Li et al., 2016; Serban et al., 2016)。

为了解决这些问题,我们提出了一种集成IR和生成模型的混合方法。在我们的方法中,我们使用序列到序列的重排模型来优化联合结果。具体地说,对于一个问题,我们首先使用一个IR模型来检索一组QA对,并使用它们作为候选答案,然后使用序列到序列模型对候选答案进行重排:如果最高的候选所得分高于一个确定的阈值,它将被选为答案。否则答案将由生成模型提供(详细过程见图1)。

图1:我们的混合方法的概述。

我们的论文做出如下贡献:

• 我们提出了一种新颖的混合方法,利用一个序列到序列模型来优化IR和生成模型的联合结果。

• 我们进行了一系列的实验来评估这种方法。结果表明,我们方法在IR和生成两方面表现都很出色。

• 我们将我们的聊天机器人引擎和一个公共的聊天机器人进行了比较。证据表明我们的引擎性能更好。

• 我们启用了AliMe聊天,用于真实的工业应用。

论文的其余部分结构如下:第2节介绍我们的混合方法,接着在第3节中的实验,相关的工作在第4节,第5节总结我们的工作。

2基于序列到序列的重排模型

我们在图1中概述了我们的方法。首先,我们从在线客服中的聊天记录中构建了一个QA知识库。基于这个QA知识库,我们开发了三种模型:IR模型、生成模型和重排模型。有两点值得注意:(1)这三种模式都是基于单词的(即需要分词):IR模型的输入特征是单词,而生成模型和重排模型的输入特征是单词向量表示,这些都是使用fasttext(Bojanowski et al., 2016)预先训练和在两个模型中进行进一步调整的。(2)我们的生成模型和重排模型是建立于同一序列到序列结构,前者生成输出,后者则是岁输入问题的候选答案进行评分。给定一个输入问题q和一个阈值T,我们的方法的过程如下:

• 首先,我们使用IR模型检索一组k候选QA对。

• 第二,我们将每个问题q与其候选答案ri配对 ,并为每一个问答对使用重排模型中的计分公式Eqn. 2计算置信度得分o(ri) = s(q, ri)。

• 第三,我们考虑答案r使用最大分数o(r) = max o(ri):如果o(r) ≥T,采取答案r;否则输出一个基于生成模型的回复r’。

在这里,阈值T是通过实证研究得到的,在第3.2节中讨论。

2.1 QA知识库

我们在2016 -01- 01和2016 -06-01之间使用我们在线客服中心的聊天记录作为我们原始的数据源(客户和员工之间的对话)。我们通过将每个问题与一个相邻近的答案配对来构建QA对。在需要的时候,我们把连续的问题(或答案)连接在一起。之后,我们过滤出包含业务相关关键字的QA对。最后,我们获得了9164,834个QA对。

2.2 IR模型

我们的检索模型采用搜索技术为每个输入找到最相似的问题,然后获得相匹配的答案。通过分词,我们为全部9,164,834个问题构建了一个反向索引,将每个单词映射到包含该单词的一系列问题中。给定一个问题,我们将其划分为一组单词,删除停用词,将设置扩展为它们的同义词,并使用细化的集合来调用一组QA候选对。然后我们使用BM25 (Robertson et al., 2009) 来计算输入问题和检索到的问题之间的相似度,并采用最相似的匹配的答案作为答案。

2.3 生成模型

我们的生成模型建立在序列到序列结构上 (Bahdanau et al., 2015)。Let θi = {y1, y2, · · · , yi−1, ci},通过积分公式Eqn.1在位置i上生成一个单词yi的概率,f是一个计算概率的非线性函数,si-1是位置i-1的输出的隐藏状态,ci是上下文向量,取决于(h1, h2, · · · , hm),输入序列的隐藏状态:,αij =a(si−1, hj )由一个对准模型给出,该模型计算j位置的输入与i-1的输出的匹配程度 (Bahdanau et al., 2015)。如图2所示,其中i=3,m=4。

图2:序列到序列模型,我们的模型主要针对中文。

我们选择门循环单元(GRU)作为循环神经网络(RNN)。下面讨论一些重要的实现。

Bucketing and padding.为了处理不同长度的问题和答案,我们采用了在Tensorflow中提出的存储桶机制。我们使用五个存储桶 (5, 5), (5, 10), (10, 15), (20, 30),(45, 60)来容纳不同长度的QA对,例如,长度为4的问题和一个长度为8的答案将被放在存储桶(5, 10),并在需要时使用特殊符号“PAD”填充问题和答案。

Softmax over sampled words.为了加快训练过程,我们将softmax应用于一组抽样词汇(目标单词和512随机单词),而不是整个集合。这个想法与(Jean et al., 2014)的重要性抽样策略相似。

Beam search decoder.在解码相位,我们使用定向搜索,在每个时刻t维持top-k (k = 10) 输出序列,而不是贪婪搜索,每个时刻t只保留一个,以使我们的生成更加合理。

2.4 序列到序列重排模型

我们的重排模型使用同样的序列到序列模型来为输入问题的候选答案进行评分。具体地说,我们选择的是平均概率,在积分公式Eqn.2中表示为sMean-Prob,作为我们的积分函数(一个候选答案被视为一个单词序列w1, w2, … , wn)。我们还尝试过平均互熵和调和平均值,但它们的性能较差。

3实验

在我们的实验中,我们首先使用评分标准平均概率考察了序列到序列模型的有效性;然后我们评估了IR, Generation, IR + Rerank,IR + Rerank + Generation(我们的方法)的有效性;我们还对我们的方法和和一个基准聊天机器人引擎进行了在线A/B测试;最后,我们将我们的引擎与一个公共聊天机器人进行了比较。

为了评估,我们让业务分析人员回答每个测试问题的答案(两个实验的分析师与其他公共聊天机器人进行比较,一个用于其他实验),并标记为三个等级标签:“0”是不合适的,“1”表示这个答案只适用于某些情况,“2”表示答案是合适的。为了确定一个答案是否合适,我们定义了五个评价规则,即“语法正确”,“语义相关”、“口语流利”、“语境独立”和“不过度推广”。

一个答案只有当它满足所有的规则将会被标记为合适的,中立的如果它满足前三,并打破后两者中的任何一个,否则不合适。

我们使用top-1精度(Ptop1)作为标准,因为某些方法的输出可以不止一个(例如,IR)。该指标衡量的是 top-1候选是否适合或中立,并按一下方法计算Ptop1 = (Nsuitable+Nneutral)/Ntotal,Nsuitable表示被标记为合适的问题的数量(其他符号的定义相似)。

3.1 评估重排模型

我们首先比较了两个序列到序列模型((Cho et al., 2014)提出的基础模型,在第2.4节中提出的一个),在3个平均标准(平均概率,平均互熵和调和平均数)使用一组随机抽取的的500个问题。表1显示了Ptop1的结果,这表明带有sMean-Prob 的序列到序列模型具有最好的性能。我们在重排模型中使用。

表1:不同的重排模型的比较。

3.2 评估候选方法

我们使用600个问题集合对以下四种方法的有效性进行了评估:IR, Generation, IR + Rerank, IR + Rerank + Generation。结果如图3所示。显然,建议的方法(IR + Rerank + Generation)具有最佳的 top-1精度:自信度得分阈值T=0.19,Ptop1 =60.01%。在这里,得分高于0.19(虚线的左边,600个中的535个)的问题,使用重排来回答,剩下的通过生成来处理。其他三种选择的Ptop1分别为47.11%、52.02%和56.23%。请注意,如果使用更高的阈值(例如0.48),或者以不同的方式进行重排和生成更多数据,那么可以实现更高的Ptop1。我们使用较低的阈值,因为序列到序列生成的不可空性和可变性:随着Ptop1的下降,我们获得了给更多的可控性和可解释性。

图3: 候选方法的Top-1精度。

3.3 线上 A/B测试

我们在AliMe聊天,我们的在线聊天机器人引擎中实现了建议的方法,并对新的和现有的IR方法进行了A/B测试(问题同样分布在两种方法中)。我们随机抽取了2136个QA对,其中1089个问题由IR回答和1047有混合方法处理,并比较了它们的 top-1精度。如表2所示,新方法的Ptop1为60.36%,远远高于IR基线(40.86%)。

表2:在A/B测试与IR模型的比较。

3.4 与公共聊天机器人的比较

为了进一步评估我们的方法,我们将其与一个公共可用的聊天机器人进行了比较。我们从1047个测试问题中选择了878个(用于A/B测试),通过删除与我们聊天机器人相关的问题,并使用它来测试公共的。为了将他们的答案与我们的答案进行比较,两名业务分析人员被要求对每个测试问题选择一个更好的回答。表3显示了两位分析师的平均结果,很明显,我们的聊天机器人性能更好(878个问题中有37.64%的表现更好,18.84%的情况更糟)。分析师之间的Kappa指数为0.71,显示出实质性的一致。

表3:与另一个聊天机器人比较。

3.5 在线服务

我们在聊天机器人引擎中部署了我们的方法。对于在线服务,重排对运行时性能非常重要:如果K候选QA对是异步排序,那么迎亲必须等待最后的排名,当QPS(每秒的问题)很高时,它会变得更糟。我们的解决方案是每个K QA对捆绑在一起,把它变成一个k×n矩阵(窗体顶端n是k个QA对的级联的最大长度,在需要时使用填充),然后利用并行矩阵乘法重排模型加速计算。在我们的实验中,与异步方式相比,批处理方法有助于节省41%的处理时间。具体地说,超过75%的问题花了不到150ms的时间进行了重排,不到200ms的时间生成。此外,我们的引擎能够在5个服务实例的集群上支持42个峰值QPS,每个服务实例在Intel Xeon e5 - 2430的服务器上保留2个核心和4G内存。这是我们的方法适用于工业机器人。

我们推出AliMe聊天作为在线服务,并将其集成到AliMe帮助中,这是我们在电子商务领域的智能助手,它不仅支持聊天,还支持客户服务(如退货),购物指南和生活帮助(如预定航班)。我们在图4中显示了聊天服务生成的示例聊天对话框。

图4:AliMe聊天的一个示例聊天对话框。

4 相关工作

封闭域对话系统通常使用基于规则模板的方法(Williams and Zweig, 2016; Wen et al., 2016),和对话状态跟踪(Henderson, 2015; Wang and Lemon, 2013; Mrksic et al., 2015)。不同的是,开放域的聊天机器人通常采用数据驱动技术。常用的包括IR和序列到序列生成。基于IR的技术主要集中于从QA只是据中寻找接近的问题,例如,(Isbell et al., 2000),(Ji et al., 2014),(Yan et al., 2016b)。最近的一项研究(Yan et al., 2016a)尝试了一种基于神经网络的匹配方法。通常,基于IR模型很难处理长尾问题。基于序列到序列的生成模型通常在QA知识库或会话语料库上进行培训,并用于为每个输入生成一个答案。在这个方向上,基于RNN的序列到序列模型被证明是有效的(Cho et al., 2014;Sutskever et al., 2014; Ritter et al., 2011; Shang et al., 2015; Sordoni et al., 2015; Serban et al., 2016)。在(Sutskever et al., 2014)中提出了一个基础的序列到序列模型,并得到了(Bahdanau et al .,2015)的关注。此外Sordoni et al. (2015) 考虑了上下文信息, Li et al. (2016)试图让序列到序列模型通过附加一个多样性促进目标函数来产生多样化的答案。尽管有许多优点,序列到序列生成模型仍然可能产生不一致或毫无意义的答案。

我们的工作结合了基于IR和基于生成的模型,我们的工作不同于另一种最近的组合方法(Song et al., 2016),他们使用IR模型来重排检索和生成答案的联合。此外,我们发现,序列到序列重排方法有助于大幅度提高IR结果。

5 结论

在本文中,我提出了一种结合IR和生成模型的基于序列到序列的重排方法。我们已经进行了一系列评估,以评估我们提出的方法的有效性。结果表明,我们的混合方法优于两种模型。

我们在一个工业聊天机器人中实现了这个新方法,并发布了一个在线服务。

有许多有趣的问题有待进一步探讨。其中一个是上下文,这对于会话系统中的多循环交互至关重要。目前,我们使用一个简单的策略来整合上下文:给定一个问题,如果IR模型检索不到3个候选项,我们就将其与之前的问题进行增强,并在此将连接发送到IR引擎。我们已经尝试了其他与上下文先关的技术,例如上下文敏感模型(Sordoni et al., 2015),神经对话模型(Sutskever et al., 2014),但它们在我们的方案中并没有很好地扩展。我们仍然在探索可扩展的上下文感知方法。此外,我们还在拟人化,即让我们的聊天机器人充满角色和情感。

论文下载链接:

http://www.aclweb.org/anthology/P/P17/P17-2079.pdf

展开阅读全文

Make a Sequence

04-06

Your company's next product will be a new game, which is a three-dimensional variant of the classic game "Tic-Tac-Toe". Two players place balls in a three-dimensional space (board), and try to make a sequence of a certain length.nnPeople believe that it is fun to play the game, but they still cannot fix the values of some parameters of the game. For example, what size of the board makes the game most exciting? Parameters currently under discussion are the board size (we call it n in the following) and the length of the sequence (m). In order to determine these parameter values, you are requested to write a computer simulator of the game.nnYou can see several snapshots of the game in Figures 3-5. These figures correspond to the three datasets given in the Sample Input.nnnnHere are the precise rules of the game.nn 1. Two players, Black and White, play alternately. Black plays first.nn 2. There are n * n vertical pegs. Each peg can accommodate up to n balls. A peg can be specified by its x- and y-coordinates (1 <= x, y <= n). A ball on a peg can be specified by its z-coordinate (1 <= z <= n). At the beginning of a game, there are no balls on any of the pegs.nnnn 3. On his turn, a player chooses one of n * n pegs, and puts a ball of his color onto the peg. The ball follows the law of gravity. That is, the ball stays just above the top-most ball on the same peg or on the floor (if there are no balls on the peg). Speaking differently, a player can choose x- and y-coordinates of the ball, but he cannot choose its z-coordinate.nn 4. The objective of the game is to make an m-sequence. If a player makes an m-sequence or longer of his color, he wins. An m-sequence is a row of m consecutive balls of the same color. For example, black balls in positions (5, 1, 2), (5, 2, 2) and (5, 3, 2) form a 3-sequence. A sequence can be horizontal, vertical, or diagonal. Precisely speaking, there are 13 possible directions to make a sequence, categorized as follows.nnnn (a) One-dimensional axes. For example, (3, 1, 2), (4, 1, 2) and (5, 1, 2) is a 3-sequence. There are three directions in this category.nn (b) Two-dimensional diagonals. For example, (2, 3, 1), (3, 3, 2) and (4, 3, 3) is a 3-sequence. There are six directions in this category.nn (c) Three-dimensional diagonals. For example, (5, 1, 3), (4, 2, 4) and (3, 3, 5) is a 3- sequence. There are four directions in this category.nn Note that we do not distinguish between opposite directions.nnAs the evaluation process of the game, people have been playing the game several times changing the parameter values. You are given the records of these games. It is your job to write a computer program which determines the winner of each recorded game.nnSince it is difficult for a human to find three-dimensional sequences, players often do not notice the end of the game, and continue to play uselessly. In these cases, moves after the end of the game, i.e. after the winner is determined, should be ignored. For example, after a player won making an m-sequence, players may make additional m-sequences. In this case, all m-sequences but the first should be ignored, and the winner of the game is unchanged.nnA game does not necessarily end with the victory of one of the players. If there are no pegs left to put a ball on, the game ends with a draw. Moreover, people may quit a game before making any m-sequence. In such cases also, the game ends with a draw.nnInputnnThe input consists of multiple datasets each corresponding to the record of a game. A dataset starts with a line containing three positive integers n, m, and p separated by a space. The relations 3 <= m <= n <= 7 and 1 <= p <= n^3 hold between them. n and m are the parameter values of the game as described above. p is the number of moves in the game.nnThe rest of the dataset is p lines each containing two positive integers x and y. Each of these lines describes a move, i.e. the player on turn puts his ball on the peg specified. You can assume that 1 <= x <= n and 1 <= y <= n. You can also assume that at most n balls are put on a peg throughout a game.nnThe end of the input is indicated by a line with three zeros separated by a space.nnOutputnnFor each dataset, a line describing the winner and the number of moves until the game ends should be output. The winner is either "Black" or "White". A single space should be inserted between the winner and the number of moves. No other extra characters are allowed in the output.nnIn case of a draw, the output line should be "Draw".nnSample Inputnn3 3 3n1 1n1 1n1 1n3 3 7n2 2n1 3n1 1n2 3n2 1n3 3n3 1n4 3 15n1 1n2 2n1 1n3 3n3 3n1 1n3 3n3 3n4 4n1 1n4 4n4 4n4 4n4 1n2 2n0 0 0nSample OutputnnDrawnWhite 6nBlack 15 问答

AT – sequence

10-02

DescriptionnnAT-sequences are command strings used to control modems. An AT-sequence is a string of no more than 500 characters consisting of the prefix 'AT' followed by some commands immediately one after another without any separators. Here is a simplified description of the modem commands.nnEach command consists of a header and an optional numeric value (non-negative integer containing no more than 4 digits). If there is a value in the command, it is separated from the header by the character '='.nnThe header itself also consists of two parts. The first, mandatory, part of a header has one of the following forms:nnone or two upper-case Latin letters;none of the symbols '&', '#', '@', followed by one upper-case Latin letter.nThe second, optional part of a header is a non-negative integer containing no more than 20 digits.nnSpaces are not allowed in the command notation. If the command notation contains only one or two Latin letters, this command can't precede another one with the notation started by a letter.nnThe task is to determine all commands included in the structure of a correct AT-sequence.nnInputnnThe input contains one line with the source AT-sequence.nnOutputnnEach line of the output contains one command from the AT-sequence given in the input file. The commands must be listed in the same order as in the AT-sequence. If AT-sequence have no commands you must generate nothing.nnSample InputnnATDP2934564&H0S0=4nSample OutputnnDP2934564n&H0nS0=4 问答

Complete the Sequence

09-08

You probably know those quizzes in Sunday magazines: given the sequence 1, 2, 3, 4, 5, what is the next number? Sometimes it is very easy to answer, sometimes it could be pretty hard. Because these "sequence problems" are very popular, ACM wants to implement them into the "Free Time" section of their new WAP portal.nACM programmers have noticed that some of the quizzes can be solved by describing the sequence by polynomials. For example, the sequence 1, 2, 3, 4, 5 can be easily understood as a trivial polynomial. The next number is 6. But even more complex sequences, like 1, 2, 4, 7, 11, can be described by a polynomial. In this case, 1/2.n^2-1/2.n+1 can be used. Note that even if the members of the sequence are integers, polynomial coefficients may be any real numbers.nnPolynomial is an expression in the following form:nnP(n) = aD.n^D+aD-1.n^D-1+...+a1.n+a0nn. If aD <> 0, the number D is called a degree of the polynomial. Note that constant function P(n) = C can be considered as polynomial of degree 0, and the zero function P(n) = 0 is usually defined to have degree -1.nnnInputnnThere is a single positive integer T on the first line of input. It stands for the number of test cases to follow. Each test case consists of two lines. First line of each test case contains two integer numbers S and C separated by a single space, 1 <= S < 100, 1 <= C < 100, (S+C) <= 100. The first number, S, stands for the length of the given sequence, the second number, C is the amount of numbers you are to find to complete the sequence.nnThe second line of each test case contains S integer numbers X1, X2, ... XS separated by a space. These numbers form the given sequence. The sequence can always be described by a polynomial P(n) such that for every i, Xi = P(i). Among these polynomials, we can find the polynomial Pmin with the lowest possible degree. This polynomial should be used for completing the sequence.nnnOutputnnFor every test case, your program must print a single line containing C integer numbers, separated by a space. These numbers are the values completing the sequence according to the polynomial of the lowest possible degree. In other words, you are to print values Pmin(S+1), Pmin(S+2), .... Pmin(S+C).nnIt is guaranteed that the results Pmin(S+i) will be non-negative and will fit into the standard integer type.nnnSample Inputnn4n6 3n1 2 3 4 5 6n8 2n1 2 4 7 11 16 22 29n10 2n1 1 1 1 1 1 1 1 1 2n1 10n3nnnSample Outputnn7 8 9n37 46n11 56n3 3 3 3 3 3 3 3 3 3 问答

Play With Sequence

12-10

Problem DescriptionnWhen the girl was solving GSSX, a serious of tough problems about data structure on SPOJ, something intriguing once again comes to GYZ's mind. That is, for a changing sequences, how to count how many elements in a specific range efficiently. nnWithout any beneficial idea, as usual, GYZ asks her friend, CLJ for help. But this time, unfortunately, CLJ is playing a gal-game at present, does not have sparse time. nnSo now , it is your turn... nnCause the original problem is not as easy as first glance, let's examine a simplified one: nnyou are given a sequence A[1], A[2],..., A[N]. On this sequence you have to apply M operations: Add all the elements whose value are in range [l, r] with d or, ask for a query how many element are in range [l, r]. nn nnInputnThere are only one test case, Process until the end of the file. The first line of each case contains two numbers, N, M, described as above. And then start from the second line, have N numbers described the sequence's initial value. nn( 1≤ N ≤ 250,000, M ≤ 50,000), |A[i]|≤ 1,000,000,000 .) nnThe following M lines described the operation: nnC l r d: Add all the element whose value are in range [l, r] with d. (Redeclare: Not its Position! .. ) Q l r: ask for a query how many elements, whose value are in range [l, r]. nn( l ≤ r, |l|,|r|,|d|≤ 1,000,000,000 ) nnWe guarantee every elements are suits 32-integer, and will not cause overflow, even during the running-time. (.. but still be careful ;) Besides, all the test-data are generated randomly. nn nnOutputnFor each query, print the result. Examplen nnSample Inputn10 10n10 4 -5 8 8 3 0 -2 4 7nC -9 8 2nC -4 10 -3nC -10 0 5nQ -9 -1nC -9 -5 8nC -7 4 3nQ -2 7nC -10 -3 2nC -4 -1 -6nQ 7 10n nnSample Outputn1n10n4 问答

EKG Sequence

10-07

DescriptionnnThe EKG sequence is a sequence of positive integers generated as follows: The first two numbers of the sequence are 1 and 2. Each successive entry is the smallest positive integer not already used that shares a factor with the preceding term. So, the third entry in the sequence is 4 (being the smallest even number not yet used). The next number is 6 and the next is 3. The first few numbers of this sequence are given below. n1, 2, 4, 6, 3, 9, 12, 8, 10, 5, 15, 18, 14, 7, 21, 24, 16, 20, 22, 11, 33, 27nnThe sequence gets its name from its rather erratic fluctuations. The sequence has a couple of interesting,but non-trivial, properties. One is that all positive integers will eventually appear in the sequence.Another is that all primes appear in increasing order. Your job here is to find the position in the sequence of a given integer.nInputnnInput consists of a number of test cases. Each case will be a line containing a single integer n, 1 <= n <= 300000. An input of 0 follows the last test case. Note that the portion of the EKG sequence that contains all integers <= 300,000 will not contain an integer >1,000,000.nOutputnnEach test case should produce one line of output of the form: nThe number n appears in location p. nwhere n is the number given and p is the position of n in the EKG sequence. You are guaranteed that p will be no larger than 1,000,000.nSample Inputnn12n21n2n33n100000n299977n0nSample OutputnnThe number 12 appears in location 7.nThe number 21 appears in location 15.nThe number 2 appears in location 2.nThe number 33 appears in location 21.nThe number 100000 appears in location 97110.nThe number 299977 appears in location 584871. 问答

K-Anonymous Sequence

11-28

DescriptionnnThe explosively increasing network data in various application domains has raised privacy concerns for the individuals involved. Recent studies show that simply removing the identities of nodes before publishing the graph/social network data does not guarantee privacy. The structure of the graph itself, along with its basic form the degree of nodes, can reveal the identities of individuals.nnTo address this issue, we study a specific graph-anonymization problem. We call a graph k-anonymous if for every node v, there exist at least k-1 other nodes in the graph with the same degree as v. And we are interested in achieving k-anonymous on a graph with the minimum number of graph-modification operations.nnWe simplify the problem. Pick n nodes out of the entire graph G and list their degrees in ascending order. We define a sequence k-anonymous if for every element s, there exist at least k-1 other elements in the sequence equal to s. To let the given sequence k-anonymous, you could do one operation only—decrease some of the numbers in the sequence. And we define the cost of the modification the sum of the difference of all numbers you modified. e.g. sequence 2, 2, 3, 4, 4, 5, 5, with k=3, can be modified to 2, 2, 2, 4, 4, 4, 4, which satisfy 3-anonymous property and the cost of the modification will be |3-2| + |5-4| + |5-4| = 3.nGive a sequence with n numbers in ascending order and k, we want to know the modification with minimal cost among all modifications which adjust the sequence k-anonymous.nnInputnnThe first line of the input file contains a single integer T (1 ≤ T ≤ 20) – the number of tests in the input file. Each test starts with a line containing two numbers n (2 ≤ n ≤ 500000) – the amount of numbers in the sequence and k (2 ≤ k ≤ n). It is followed by a line with n integer numbers—the degree sequence in ascending order. And every number s in the sequence is in the range [0, 500000].nnOutputnnFor each test, output one line containing a single integer—the minimal cost.nnSample Inputnn2n7 3n2 2 3 4 4 5 5n6 2n0 3 3 4 8 9nSample Outputnn3n5 问答

没有更多推荐了,返回首页