如何使用机器学习去预测数据_机器学习数据您真的有权使用它吗

最新推荐文章于 2024-05-25 20:49:11 发布

杨_明

最新推荐文章于 2024-05-25 20:49:11 发布

阅读量859

点赞数

文章标签：机器学习 python 人工智能大数据数据分析

原文链接：https://medium.com/swlh/machine-learning-data-do-you-really-have-rights-to-use-it-6aea047ef1d

版权

本文探讨了在机器学习中如何合法有效地利用数据进行预测，并引用了一篇源自Medium的文章，强调数据使用权的重要性。

摘要由CSDN通过智能技术生成

如何使用机器学习去预测数据

Organizations using machine learning systems require data to train their systems. But where does that data come from? And can they get into trouble if they don’t have the rights to use that data? The short answer is yes; they can get into trouble if they aren’t careful.

使用机器学习系统的组织需要数据来训练他们的系统。但是这些数据从何而来？如果他们无权使用这些数据，他们会遇到麻烦吗？简短的答案是肯定的。如果不小心，他们可能会遇到麻烦。

A few recent cases show the risks associated with companies using personal information for training AI systems allegedly without authorization. First, Burke v. Clearview AI, Inc., a class action filed in federal district court in San Diego at the end of February 2020, involves a company, Clearview, accused of “scraping” thousands of sites to obtain three billion images of faces of individuals used for training AI algorithms for facial recognition and identification purposes. “Scraping” refers to the process of automated processes scanning the content of websites, collecting certain content from them, storing that content, and using it later for the collecting company’s own purposes. The basis for the complaint is that Clearview AI failed to obtain consent to use the scraped images. Moreover, given the vast scale of the scraping — obtaining three billion images — the risk to privacy is tremendous.

最近的一些案例表明，涉嫌未经授权使用公司的个人信息来培训AI系统的公司会面临风险。首先， Burke诉Clearview AI，Inc .是2020年2月底在圣地亚哥联邦地方法院提起的集体诉讼，其中涉及Clearview公司，被指控“刮擦”数千个网站以获得30亿张脸部图像用于训练AI算法以进行面部识别和识别的个人。 “抓取”是指自动化过程的过程，该过程扫描网站的内容，从网站中收集某些内容，存储该内容，然后将其用于收集公司自己的目的。投诉的依据是Clearview AI无法获得使用刮擦图像的同意。此外，鉴于抓取的规模很大(获得30亿张图像)，隐私风险巨大。

In Stein v. Clarifai, Inc., filed earlier in February, the plaintiffs’ class action complaint filed in Illinois state court claims that investors in Clarifai, founders in the dating site OKCupid, used their access to OKCupid’s database of profile photographs to transfer the database to Clarifai. Clarifai then supposedly used the photos to train its algorithms used for analyzing images and videos, including for purposes of facial recognition. Clarifai is the defendant in this case and will have to fight claims that it wasn’t entitled to take the OKCupid photos without notifying the dating site’s users and obtaining consent. OKCupid is potentially a target too. It wasn’t clear if plaintiffs are saying that OKCupid’s management approved the access to its database, but if it did, the plaintiffs may have claims against OKCupid as well.

在2月初提交的Stein诉Clarifai，Inc .案中，在伊利诺伊州法院提起的原告集体诉讼投诉称，约会网站OKCupid的创始人Clarifai的投资者使用他们对OKCupid的个人资料照片数据库的访问权来转移数据库到Clarifai。据说，Clarifai使用这些照片来训练其用于分析图像和视频的算法，包括用于面部识别的算法。 Clarifai是此案的被告，必须与声称没有通知约会网站的用户并征得其同意而无权拍摄OKCupid照片的主张作斗争。 OKCupid也可能是目标。目前尚不清楚原告是否说OKCupid的管理层批准了对其数据库的访问，但如果批准，原告也可能对OKCupid提出索赔。

Dinerstein v. Google, LLC, is a case involving questions of the right to use data for AI training purposes. This case involves Google’s use of supposedly de-identified electronic health records (EHR) from the University of Chicago and the University of California San Francisco medical centers to train Google’s medical AI systems to assist with the development of a variety of AI services, including assistance with medical diagnoses. In a class action complaint filed in late June 2019, a patient at the University of Chicago’s medical center, on behalf of a putative class, alleged injury from the sharing of medical records by the University. According to the plaintiffs, while the EHR data was supposedly de-identified, Google collects huge amounts of data to profile people, including geolocation data from Android phones, and Google can, therefore, reidentify individual patients from the de-identified data.

Dinerstein诉Google，LLC案，涉及使用数据进行AI培训的权利问题。此案涉及Google使用芝加哥大学和加利福尼亚大学旧金山医疗中心所谓的身份不明电子病历(EHR)来培训Google的医疗AI系统，以协助开发各种AI服务，包括协助进行医学诊断。在2019年6月下旬提起的集体诉讼投诉中，芝加哥大学医疗中心的一名患者代表一个假定的班级，指称该大学共享病历导致受伤。据原告称，虽然EHR数据被取消了身份识别，但Google收集了大量数据以对个人进行简介，包括来自Android手机的地理位置数据，因此Google可以从该未识别出的数据中重新识别个别患者。

I am skeptical that Google intended to combine the EHR data with other data, and it isn’t clear what the plaintiffs think Google was going to do with the reidentified data. For instance, there was no allegation about pushing medical condition-specific ads to the Android users. Moreover, just because an Android phone was in the E.R. when an E.R. medical record was created doesn’t mean that the two are linked. For instance, the patient may be a child, and the Android user was the child’s parent. Moreover, it isn’t even clear that the level of geolocation precision can link an Android user to the department/room where the record was created.

我对Google是否打算将EHR数据与其他数据结合起来表示怀疑，目前尚不清楚原告认为Google将如何处理重新识别的数据。例如，没有关于向Android用户推送特定医疗状况广告的指控。而且，仅仅因为在创建ER医疗记录时Android手机在ER中，并不意味着两者是相互关联的。例如，患者可能是孩子，而Android用户是孩子的父母。而且，甚至还不清楚地理位置的精确度是否可以将Android用户链接到创建记录的部门/房间。

Regardless, organizations should consider sources of AI training data in their risk management plans. They should obtain any needed consents from data subjects. Certain business models, like scraping the public web for photos, is especially problematic. Under the European Union’s General Data Protection Regulation, companies that scrape public Internet sites for personal information at least have to inform individual data subjects that they have collected their personal data and provide a mechanism for opting out. While GDPR does not generally apply in the United States, companies should consider that kind of mechanism to avoid liability in the United States.

无论如何，组织应在其风险管理计划中考虑AI培训数据的来源。他们应征得数据主体的任何必要同意。某些商业模式，例如刮擦公共网络上的照片，尤其成问题。根据欧盟的《通用数据保护条例》，刮取公共互联网站点上的个人信息的公司至少必须告知个人数据主体他们已收集了他们的个人数据并提供了退出机制。尽管GDPR通常不适用于美国，但公司应考虑采用这种机制来避免在美国承担责任。

Moreover, if original consents occurred for one use, they should analyze whether it is necessary to reconsent the data subject for AI training purposes. Deidentification of personal data will help, although the personal data source may want to be made whole in case there are lawsuits stemming from the data using the organization’s use of that data. Also, organizations should be careful of bridging contexts — using data from one source and combining it with data from another source, thereby potentially reidentifying data subjects and violating their privacy. These measures will reduce liability risks associated with personal data sharing.

此外，如果原始同意是一次性使用，则他们应分析是否有必要出于AI培训目的重新同意数据主体。个人数据的去识别将有所帮助，尽管在使用组织对数据的使用引起的数据诉讼的情况下，个人数据源可能希望完整。此外，组织应注意桥接上下文-使用一个来源的数据并将其与另一个来源的数据组合，从而可能重新标识数据主体并侵犯其隐私。这些措施将减少与个人数据共享相关的责任风险。

Stephen S. Wu is a shareholder with Silicon Valley Law Group in San Jose, California. He advises clients on a wide range of issues, including transactions, compliance, liability, security, and privacy matters regarding the latest technologies in areas such as robotics, artificial intelligence, automated transportation, the Internet of Things, and Big Data. He has authored or co-authored several books, book chapters, and articles and is a frequent speaker on advanced technology and data protection legal topics.

Stephen S. Wu是位于加利福尼亚州圣何塞的硅谷法律集团的股东。他为客户提供广泛的建议，包括有关机器人技术，人工智能，自动运输，物联网和大数据等领域最新技术的交易，合规性，责任，安全性和隐私问题。他撰写或合着了多本书，书籍章节和文章，并且经常就先进技术和数据保护法律主题发表演讲。

Follow him on Twitter and LinkedIn.

在Twitter和LinkedIn上关注他。

Originally published at https://www.airoboticslaw.com.

最初发布在https://www.airoboticslaw.com 。

翻译自: https://medium.com/swlh/machine-learning-data-do-you-really-have-rights-to-use-it-6aea047ef1d

如何使用机器学习去预测数据

杨_明

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
如何使用机器学习去预测数据_机器学习数据您真的有权使用它吗

如何使用机器学习去预测数据Organizations using machine learning systems require data to train their systems. But where does that data come from? And can they get into trouble if they don’t have the rights to use th...
复制链接

扫一扫