


A little over a year ago, a research team at Carnegie Mellon Univeristy launched reCAPTCHA, a plug-in CAPTCHA service for web sites that serves the dual purpose of fighting spam bots and helping the Internet Archive and other clients to make sense of digitized print content.

一年多以前,卡内基·梅隆大学的研究团队推出了reCAPTCHA ,这是一种用于网站的CAPTCHA插件服务,其双重目的是打击垃圾邮件机器人,并帮助Internet Archive和其他客户理解数字化印刷内容。

CAPTCHAs, those hard to read images web sites sometimes ask you to enter before submitting form data, can be an effective way to combat spam, but they’re also tremendous time sinks. Each day on the web people are confronted with a whopping 200 million CAPTCHA images, and deciphering them consumes 500,000 hours. The reCAPTCHA system makes brilliant use of that time to put people to work reading scanned text that optical recognition software (OCR) had difficulty in understanding.

验证码是那些很难读懂图像的网站,有时会要求您在提交表单数据之前输入该信息,这是对抗垃圾邮件的有效方法,但它们也浪费了很多时间。 每天,人们在网络上都要面对多达2亿张CAPTCHA图像,而解密它们要花费50万小时。 reCAPTCHA系统充分利用了这段时间,使人们可以阅读光学识别软件(OCR)难以理解的扫描文本。

The service, which is now employed by 40,000 web sites, uses a simple technique to get people to help in figuring out unknown scanned words. Each reCAPTCHA box presents users with two words — one that the system knows to be correct (a control word) and one that is unknown. If the user gets the control word correct, the system can assume that the other word also has a high likelihood of being correct. If enough users enter the same thing for that word, it can be used as a control word.

这项服务现已被40,000个网站所采用,它使用一种简单的技术来使人们帮助找出未知的扫描词。 每个reCAPTCHA框都会向用户显示两个单词-一个系统知道正确的单词(一个控制单词),另一个未知。 如果用户正确输入了控制字,则系统可以假定另一个字也很可能是正确的。 如果有足够多的用户为该词输入相同的内容,则可以将其用作控制词。


Of those 200 million daily CAPTCHAs, reCAPTCHA serves about 4 million, which is “the equivalent of 1500 people working full-time and transcribing 60 words per minute,” according to a report in this month’s Science. The service, which is free for web sites to use, has deciphered 440 million words for clients over the past year.

根据本月的《科学》杂志的一份报告,在每天2亿个验证码中,reCAPTCHA服务约400万人,“相当于1500名全职工作并每分钟记录60个单词的人”。 该服务是免费的,可供网站使用,在过去的一年中,该服务已为客户破解了4.4亿个单词。

According to Ars Technica, reCAPTCHA is also very accurate. In a test that used a random sample of 250 New York Times articles from different time periods, OCR software managed just 84% accuracy on its own. When combined with reCAPTCHA, though, the accuracy rating shot up to 99.1%. That, says Ars, is comparable to professional transcription services where they employee two transcription experts whose work is verified by a third party.

根据Ars Technica的说法 ,reCAPTCHA也非常准确。 在一项使用来自不同时间段的250份《纽约时报》文章的随机样本的测试中,OCR软件自己仅管理84%的准确性。 但是,与reCAPTCHA结合使用时,准确率高达99.1%。 Ars说,这可与专业转录服务媲美,在专业转录服务中,他们雇用了两名转录专家,他们的工作均经过第三方验证。

It’s easy to see how reCAPTCHA’s use of the crowd is far more cost effective. Further, Ars reports that software designed to crack CAPTCHA images fails on reCAPTCHA, likely because the letter distortions on scanned images are not the result of “clean mathematical transformation,” and thus are hard for a computer to correct.

很容易看出reCAPTCHA对人群的使用如何更具成本效益。 此外,Ars报告说,旨在破解CAPTCHA图像的软件在reCAPTCHA上失败,这可能是因为扫描图像上的字母失真不是“纯数学变换”的结果,因此计算机难以校正。

reCAPTCHA is a simply brilliant use of essentially wasted time, and I’m pleased to hear that it’s working. When I first wrote about the program last year for ReadWriteWeb I noted that in college one of my classes was part of a project to digitize old maritime journals. We used expensive overhead scanners and fancy OCR software, but even so most of our time was spent correcting mistakes that the software had made. The reCAPTCHA system would have been a welcome addition to our work back then.

reCAPTCHA是对本质上浪费时间的简单使用,我很高兴听到它在起作用。 去年,当我第一次为ReadWriteWeb编写有关该程序的文章时我注意到在大学里,我的一堂课是对旧海事期刊进行数字化的项目的一部分。 我们使用了昂贵的高架扫描仪和精美的OCR软件,但即使如此,我们大部分时间还是花在了纠正软件所犯的错误上。 那时reCAPTCHA系统是我们工作的可喜补充。

翻译自: https://www.sitepoint.com/recaptcha-awesome-use-of-wasted-time-that-works/


  • 0
  • 0
    觉得还不错? 一键收藏
  • 0


  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助




当前余额3.43前往充值 >
领取后你会自动成为博主和红包主的粉丝 规则
钱包余额 0


