机器学习实验偏差偏大怎么办_自然学习处理能力是否比机器学习偏差更大的威胁...

最新推荐文章于 2022-08-14 20:02:02 发布

weixin_26748251

最新推荐文章于 2022-08-14 20:02:02 发布

阅读量447

点赞数

文章标签：机器学习人工智能深度学习 python 大数据

原文链接：https://medium.com/swlh/are-natural-learning-processing-capabilities-a-bigger-threat-than-machine-learning-bias-3c0c7774e863

版权

机器学习实验偏差偏大怎么办

Microsoft licensed GPT-3 [0]. So my answer is yes. GPT-3 and forthcoming Natural Language Processing (NLP) models can create a bigger bias problem by generating believable “alternative facts.”

Microsoft许可了GPT-3 [ 0] 。所以我的回答是。 GPT-3和即将推出的自然语言处理(NLP)模型可以通过生成可信的“替代事实”来产生更大的偏差问题。

I read Dr. Hinton’s tweet yesterday. He estimates 4.4 trillion parameters to obtain the … the answer to life… ). I think he used this graph from:

我昨天看了欣顿医生的推文。他估算了4.4万亿个参数来获得……生命的答案…… )。我认为他使用了以下图表：

Image for post — Source: Language Models are Few-Shot Learners [1].

I may be wrong on guessing how Dr. Hinton derived 4.4 trillion parameters from the above graph. However, given the number of significant figures given (four), it implies a standard error of +/-0.0005 trillion (or +/- 500 million).

在猜测欣顿博士如何从上图中得出4.4万亿个参数时，我可能是错的。但是，考虑到给出的有效数字数量(四个)，这意味着+/- 0.0005万亿(或+/- 5亿)标准误差。

I will take a risk and state that given the number of significant figures and Dr. Hinton’s background in cognitive brain science, he is “tongue in cheek,” giving the 4.4 trillion parameters an answer to life….

我要冒个险，并指出，鉴于重要人物的数量以及欣顿博士在认知脑科学领域的背景，他“ ton不休”，为4.4万亿个参数提供了生命的答案……。

Aww. Ok, maybe “answer to life…” is a better indicator of his “tongue in cheek” sentiment.

w 好吧，也许“生命的答案…… ”可以更好地表明他“ ton舌”的情绪。

6万亿是更好的估计8-; j [2] (Six Trillion is a better estimate 8-;j [2])

Notice how I used an emoji to show my sentiment.

请注意，我是如何使用表情符号来表达自己的情感的。

In an earlier blog, I estimated (from the above graph):

在较早的博客中，我估计(根据上图)：

I stated that:

我说：

GPT-3 has approximately 185 billion parameters. In contrast, the human brain has approximately 86 billion neurons with on the average 7,000 synapses per neuron [3];

GPT-3大约有1850亿个参数。相比之下，人脑大约有860亿个神经元，每个神经元平均有7,000个突触[3]。

Using the above data, we can guess that the human brain has about 60 trillion parameters or about 300x more parameters than GPT-3.

使用以上数据，我们可以猜测出人脑的参数大约是GPT-3的60万亿或多300倍。

If 10% of the human brain capacity is needed for natural language tasks, then the human brain has about 30x more parameters than GPT-3.

如果自然语言任务需要人类大脑容量的10％，那么人类大脑的参数大约是GPT-3的30倍。

Using this naive model, we need an NLTG of 6 trillion parameters comparable to humans’ natural language ability [5].

使用这种幼稚的模型，我们需要一个与人类自然语言能力相当的6万亿个NLTG参数[5]。

8-| [3].

8 | [3]。

I need to correct myself. The data from the above graph [2] shows that six trillion parameters for an NLP (Natural Language Processing) model require a human reader 130 seconds or more to detect it was artificially created [4].

我需要纠正自己。上图[2]的数据显示，用于NLP(自然语言处理)模型的6万亿个参数需要人类阅读器130秒或更长时间才能检测到它是人为创建的[4] 。

In my humble opinion (IMHO), GPT-3 is almost passing the written Turing Test [4].

以我的拙见(IMHO)， GPT-3几乎已通过书面图灵测试[4]。

GPT-3 is quite impressive in some areas, and still clearly subhuman in others. I hope that with a better understanding of its strengths and weaknesses, we software engineers will be better equipped to use modern language models in real products [4].

GPT-3在某些领域给人留下了深刻的印象，而在另一些领域却仍然明显不如人类。我希望通过更好地了解其优缺点，我们的软件工程师将能够更好地在实际产品中使用现代语言模型[4]。

So I still stand behind the statement:

因此，我仍然支持以下声明：

Does it matter if a model understands what it is doing (Artificial General Intelligence) so long as the ordinary consumer has difficulty differentiating between human or generated text [5]?

只要普通消费者难以区分人为文本或生成的文本，模型是否能够理解其行为(人工通用情报)是否重要[5]？

好的，为什么在此博客中显示“ Microsoft许可的GPT-3”？ (Ok, why is “Microsoft licensed GPT-3” in this blog?)

I reference “Microsoft licensed GPT-3” in this blog because it indicates the “generation of believable alternative facts” capability exists right now.

我在此博客中引用了“ Microsoft许可的GPT-3”，因为它表明“可靠的替代事实的产生”功能目前存在。

Futurists point to achieving AGI (Artificial General Intelligence) as either the great bane or/and of grand benefit to humankind [6,7,8].

未来主义者指出，实现AGI (人工通用情报)既是人类的祸根，又是人类的巨大利益[6,7,8] 。

I want to strongly emphasize that we can be in danger before achieving AGI (Artificial General Intelligence). Because of what it means for the bleak capability of “generation of believable alternative facts.”

我要特别强调，在实现AGI (人工通用情报)之前，我们可能会处于危险之中。因为这意味着惨淡 “产生可信的替代事实”的能力。

We don’t have to wait to deal with AGI issues in the future. We have to deal with Machine Learning issues, such as face recognition and text generation, yesterday.

将来我们不必等待处理AGI问题。昨天，我们必须处理机器学习问题，例如人脸识别和文本生成。

Microsoft Azure和巨魔 (Microsoft Azure and Trolls)

About 2 years ago:

大约2年前：

I think we should watch out for drones. I think automated drones are potentially dangerous in a lot of ways. The computation onboard unmanned weapons isn’t efficient enough to do something useful right now. But in five or ten years, I can imagine that a drone could have onboard computation sufficient enough that it could actually be useful. You can see that drones are already getting used in warfare, but they’re [still human-controlled]. There’s no reason why they couldn’t be carrying some kind of learning system and be reasonably effective. So that’s something that I worry about a fair bit. — John Langford, Principal Researcher At Microsoft. September 2018 [8].

我认为我们应该提防无人机。 我认为自动无人机在很多方面都有潜在的危险。 无人机上的计算效率不足以完成有用的操作 马上。 但是在五到十年内，我可以想象无人机可以进行足够的机载计算，以至于实际上是有用的。 您可以看到无人驾驶飞机已经在战争中使用，但它们仍是人为控制的。 他们没有理由不能携带某种学习系统并且相当有效。 这就是我担心的一件事。 —微软首席研究员John Langford 。 2018年9月[8]。

I agree; we should worry about and avoid a Terminator-like future [9]. However., I think text generation is more a near-term threat to facts and maybe the “free-world.”

我同意; 我们应该担心并避免类似终结者的未来[9]。但是，我认为文本生成更多地是对事实的短期威胁，也许是对“自由世界”的威胁。

The scope of commercial and creative potential that can be unlocked through the GPT-3 model is profound, with genuinely novel capabilities — most of which we haven’t even imagined yet.

可以通过GPT-3模型释放的商业和创造潜力的范围是巨大的，具有真正的新颖功能-我们甚至还没有想到其中的大多数功能。

…

…

That future will be what we make of it — and I believe that we’re on the right track Sep 22, 2020, | Kevin Scott — Executive Vice President and Chief Technology Officer, Microsoft [10].

未来将是我们的未来-我相信我们走在正确的道路上2020年9月22日，| 凯文·斯科特(Kevin Scott)-微软执行副总裁兼首席技术官[10]。

I have no argument with the above statement. However, I would like to know how the text-generation capability will not be misused. I guess it could be shut down like Tay [11].

我对上述说法没有异议。但是，我想知道如何不会滥用文本生成功能。我想它可能像Tay [11]一样关闭。

Who decides what misuse is? Microsoft? Congress? Senate? President (of USA)? Web trolls? (Yes, the last one could be considered silly- sort-of).

谁决定什么是滥用？微软？国会？参议院？ (美国总统)？网络巨魔？ (是的，最后一个可能被认为是愚蠢的)。

I know that humans are hired to filter out content, given the company's content-acceptable policies [12].

我知道，鉴于公司的内容可接受政策，我们雇用了人来过滤内容[12] 。

I could imagine a well-funded troll-farm (Russian government or whomever(s)) distributing “alternative facts” through their Microsoft Azure accounts.

我可以想象一个资金充裕的巨魔农场(俄罗斯政府或其他任何人)通过其Microsoft Azure帐户分发“替代事实”。

好消息 (Good News)

What gives me hope for the bleak “generation of believable alternative facts” is the existence of useful plagiarism tools.

使我对惨淡的“令人信服的替代事实的产生”充满希望的是有用的useful窃工具的存在。

We need an alternative/actual detector equivalent to the spam/no-spam detector.

我们需要一个等效于垃圾邮件/非垃圾邮件检测器的替代/实际检测器。

Also, I think it will be less effort to create good NLP-based fact-checkers than text general “alternative fact” tools. At least as long as the “fact” population outways the “alternative fact” population.

同样，我认为创建基于NLP的好的事实检查器比使用文本通用的“替代事实”工具要少。至少只要“事实”人口超过“替代事实”人口。

I just thought of an excellent Sentiment Analysis tool — distinguishing “fact” (zero pinocchios) and “alternative fact” (one or more pinocchios).

我只是想到了一种出色的情感分析工具-区分“事实”(零皮诺奇)和“替代事实”(一个或多个皮诺奇)。

If the “fact” population falls below the “alternative fact” population, then we will need AGI.

如果“事实”人口低于“替代事实”人口，那么我们将需要AGI 。

坏消息 (Bad News)

[ed: 9/29/2020]

[ed：9/29/2020]

In 2019, a Center on Terrorism, Extremism, and Counterterrorism (CTEC) investigation discovered that OpenAI’s GPT-2 language model, one of the most sophisticated generative AI at the time, could be trained to produce convincing extremist manifestos.

2019年，恐怖主义，极端主义和反恐中心(CTEC)进行的一项调查发现，OpenAI的GPT-2语言模型是当时最复杂的生成型AI之一，可以训练以产生令人信服的极端主义宣言。

In 2020, OpenAI developed GPT-3, a third-generation neural language model capable of sophisticated natural language generation. So good is GPT-3 that it is difficult to distinguish the quality of the text generated by the model from that written by a human — which is both beneficial and risky. OpenAI researchers and engineers warned of GPT -3’s potential dangers and called for research to mitigate risk. [13,14] .

2020年，OpenAI开发了第三代神经语言模型GPT-3 ，该模型能够生成复杂的自然语言。 GPT-3是如此出色，以至于很难区分模型生成的文本的质量和人类编写的文本的质量，既有益又有风险。 OpenAI研究人员和工程师警告了GPT -3的潜在危险，并呼吁进行研究以减轻风险。 [13,14] 。