Nearly two years on from the revelation of the Facebook-Cambridge Analytica scandal, you’d hope that Facebook has taken steps to stop bad actors from exploiting data that we share online. While Facebook has taken some of these steps, one loophole still exists. It’s a loophole which allows bad actors to see incredibly private information about users, information that the user may not share with their closest friends. That loophole is scraping.

从Facebook-Cambridge Analytica丑闻的揭露开始将近两年,您希望Facebook已采取措施阻止不良行为者利用我们在线共享的数据。 尽管Facebook采取了其中一些步骤,但仍然存在一个漏洞。 这是一个漏洞,使坏演员可以看到有关用户的难以置信的私人信息,这些信息可能是用户可能不会与最亲密的朋友共享的。 那个漏洞正在刮ing。

What is Scraping?


Scraping is simply the act of taking public information from websites. By public information, we’re talking about the sort of information that’s accessible to anyone who views the site. If you wanted to store weather data, you could scrape a weather site. If you wanted to store sports results, you could scrape match reports. If the data is publicly available, chances are that you can scrape it.

爬取只是从网站获取公共信息的行为。 通过公共信息,我们正在谈论查看该网站的任何人都可以访问的信息类型。 如果要存储天气数据,则可以抓取天气站点。 如果您想存储运动成绩,则可以抓取比赛报告。 如果数据是公开可用的,则很可能会对其进行抓取。

In the above examples, you could scrape the data manually. That is, you could visit all the pages whose data you wanted to store, and copy it into a file. This isn’t how scraping is normally done however.

在以上示例中,您可以手动抓取数据。 也就是说,您可以访问要存储其数据的所有页面,然后将其复制到文件中。 但是,这通常不是完成刮操作的方式。

Typically, people code bots which scrape web pages for them. These bots can visit a huge number of sites and monitor them 24/7, to ensure that they capture any data which is displayed on these sites.

通常,人们编写代码的漫游器会为他们抓取网页。 这些漫游器可以访问大量站点并全天候24/7监控它们,以确保它们捕获这些站点上显示的任何数据。

What Do Bots Actually Scrape?


The most common scraping bots actually power search engines. These bots scrape sites, looking for all the other sites which the original site links to. If the bot can find links to other sites, it then scrapes these too. The bot looks for sites which the new site links to, and so on. The process carries on and on, until the bots have found every site available on the internet (or at least every site that is linked to by at least one other).

最常见的抓取机器人实际上是为搜索引擎提供动力的。 这些漫游器会抓取网站,寻找原始网站链接到的所有其他网站。 如果该漫游器可以找到指向其他站点的链接,则它也会将其抓取。 该漫游器会查找新站点链接到的站点,依此类推。 该过程一直进行下去,直到僵尸程序找到Internet上的每个可用站点(或至少每个彼此链接的站点)为止。

From this data, search engines such as Google and Bing are able to build comprehensive databases of sites, and use these to deliver search results. Every time you make a search, the search engine is calling upon masses of data which it has gained from scraping site.

根据这些数据,诸如Google和Bing之类的搜索引擎可以构建站点的综合数据库,并使用它们来提供搜索结果。 每次您进行搜索时,搜索引擎都会调用其从抓取站点获得的大量数据。

This is a fairly benign use of scraping. Here, scraping is being employed in a way which benefits everyone involved. The search engines (Google, Bing, etc.) benefit because they can deliver relevant sites to users. Users benefit because they can search for sites on these engines. The sites benefit because search engines afford them greater visibility.

这是刮板的一种良性用途。 在这里,采用刮削的方式使每个参与人员受益。 搜索引擎(Google,Bing等)可以从中受益,因为它们可以为用户提供相关的网站。 用户可以从中受益,因为他们可以在这些引擎上搜索站点。 这些站点受益匪浅,因为搜索引擎为它们提供了更大的知名度。

From Good to Bad


Not all uses of scraping are as benevolent though. Just as scraping can be used to create all-encompassing search engines, scraping can be used to mine huge troves of personal data.

不过,并非所有的刮刮用法都那么仁慈。 就像抓取可用于创建无所不包的搜索引擎一样,抓取可用于挖掘大量的个人数据。

One such way of mining personal data is to scrape social media sites, such as Facebook. Scraping users’ profile pages can give basic information about them, who they’re friends with, and what photos they’ve posted.

挖掘个人数据的一种方法是抓取社交媒体网站,例如Facebook。 搜寻用户的个人资料页面可以提供有关他们,与谁成为朋友以及他们发布了哪些照片的基本信息。

Facebook realise the potential harm to user privacy from allowing anyone to scrape profiles. For this reason, most elements of a typical Facebook profile are set to private, meaning that they can’t be viewed by anyone who that user hasn’t added as a friend. If you try to scrape a random person’s Facebook account, you may not be able to pull much information other than their name, their profile picture, and any old posts on their timeline which haven’t been made private.

Facebook允许任何人抓取个人资料,从而意识到对用户隐私的潜在危害。 因此,典型Facebook个人资料的大多数元素都设置为私人,这意味着该用户尚未添加为朋友的任何人都无法查看它们。 如果您尝试抓取随机人的Facebook帐户,则除了他们的姓名,个人资料图片以及时间轴上所有未公开的旧帖子外,您可能无法提取其他信息。

Facebook’s attempts to prevent profile scraping are praiseworthy, but they don’t go far enough. This is because some of the most valuable information that users create while using Facebook products isn’t surfaced on their profile at all.

Facebook阻止个人资料刮取的尝试值得称赞,但是还远远不够。 这是因为用户在使用Facebook产品时创建的一些最有价值的信息根本没有出现在他们的个人资料中。

Pages & Groups


Facebook pages and groups are two products which many of us are familiar with. By liking pages, we can express affinity for certain brands or causes, and add their content to our timelines. By joining groups we can become part of online communities, and share with others that share our interests or identities.

Facebook 页面群组是我们许多人熟悉的两种产品。 通过喜欢页面,我们可以表达对某些品牌或原因的兴趣,并将其内容添加到我们的时间表中。 通过加入团体,我们可以成为在线社区的一部分,并与拥有我们兴趣或身份的其他人共享。

The sheer number of pages and groups that exist on Facebook is testament to the value they bring to people. The fact that there are so many pages and groups on Facebook also means that there is a wealth of data to be gained from knowing who likes what page, and who is part of what group.

Facebook上庞大的页面和群组数量证明了它们带给人们的价值。 Facebook上有很多页面和群组的事实也意味着,知道谁喜欢哪个页面以及谁属于哪个群组,就可以获得大量数据。

Some of the pages we like, or groups that we’re members of, are fairly benign. If an advertiser wants to see if I like cycling, they don’t need to see that I like fan pages for a number of professional cyclists to work this out. They can simply rely on any of the many cycling-related ‘interest audiences’ which Facebook offers them for targeting.

我们喜欢的某些页面或我们所属的组是相当不错的。 如果广告商想看看我是否喜欢骑自行车,就不需要看到我喜欢众多专业自行车手的粉丝页面来解决这个问题。 他们可以简单地依靠Facebook提供给他们的许多与自行车相关的“兴趣受众”中的任何一个。

But what if an advertiser wants to target someone based on much more personal attributes? What if an advertiser wants to target someone based on their sexuality, their religion, or their ethnicity?

但是,如果广告客户希望根据更多的个人属性来定位某人,该怎么办? 如果广告客户希望根据其性,宗教或种族来定位某人,该怎么办?

A quick look at the groups and pages that exist on Facebook show that there are huge numbers of them which appeal to people with these kinds of attributes. If you’re LGBTQ you might like the LGBTQ Nation page, if you’re Muslim you might be a member of United Muslims, and if you’re black you might be a fan of Black Lives Matter.

快速浏览一下Facebook上存在的群组和页面,就会发现其中的大量页面和页面吸引了具有此类属性的人们。 如果您是LGBTQ,则可能会喜欢LGBTQ Nation页面,如果您是穆斯林,则可能是联合穆斯林的成员;如果您是黑人,则可能是Black Lives Matter的粉丝。

If a bad actor had access to which pages and groups you followed, they’d be able to deduce a great deal about what sort of person you are. So, can bad actors access this data?

如果一个坏演员可以访问您关注的页面和群组,则他们可以推断出您是什么样的人。 那么,坏演员可以访问此数据吗?

Scraping Groups & Pages


For every page, there is a list of people that like that page. For every group, there is a list of people that like that group. Facebook don’t make this list readily available, but that doesn’t mean that the list is particularly hard to find.

对于每个页面,都有一个喜欢该页面的人员列表。 对于每个组,都有一个喜欢该组的人员列表。 Facebook不会立即提供此列表,但这并不意味着该列表特别难找到。

Say you want to find all the people who like a particular group. Let’s start with a group which doesn’t appeal to people with protected characteristics, like Running Events. This is a UK based group for people to share running events. We can find a list of its members simply by appending /members to its URL.

假设您想找到所有喜欢特定人群的人。 让我们从一个对“受保护的人物”(例如“ 跑步活动”)不受欢迎的人群开始。 这是一个总部位于英国的团体,人们可以共享跑步活动。 我们只需将/ members附加到其URL即可找到其成员列表

The full list of group members doesn’t immediately populate, you have to keep scrolling down in order to see the full list. This would be somewhat tedious for a human to do, meaning that a manual approach wouldn’t scale. Fortunately it’s fairly simple to write bots which can not only access the member list of a group, but which can also keep scrolling down the page as a human would, causing Facebook to load more members.

群组成员的完整列表不会立即填充,您必须继续向下滚动才能看到完整列表。 对于人类而言,这将有些乏味,这意味着手动方法无法扩展。 幸运的是,编写机器人不仅可以访问组的成员列表,而且还可以像人类一样不断向下滚动页面,从而使Facebook加载更多成员,这相当简单。

Once the bot has scrolled all the way to the bottom, it can now start scraping the page. It does this by saving the page’s HTML, and looking for markers which indicate users’ profile URLs. It programmatically runs through the entire HTML, and saves the profile URL of each user.

一旦漫游器一直滚动到底部,就可以开始抓取页面了。 通过保存页面HTML,并查找指示用户个人资料URL的标记来实现此目的。 它以编程方式遍历整个HTML,并保存每个用户的配置文件URL。

At this point the bot has potentially done it’s job. It’s successfully scraped the profile URL from every person who is a member of Running Events. This on its own is already a worrying feat. In addition to scraping profile URLs, the bot could then scrape those URLs to pull data points such as people’s names, and whatever other attributes they make public on their profile.

此时,该机器人可能已经完成了它的工作。 它已经成功地从“ 运行活动”的每个成员那里抓取了个人资料URL。 本身这已经是一个令人担忧的壮举。 除了抓取个人资料网址外,漫游器还可以抓取这些URL来提取数据点,例如人的姓名以及他们在个人资料中公开的其他任何属性。

If the actor behind the bot already has some data on these people, they could augment this data based on the results of their scraping. For example, if a running retailer already has a comprehensive customer database, they could use the data they’ve scraped to learn which of their users are interested in running events.

如果漫游器背后的参与者已经掌握了有关这些人员的数据,他们可以根据其抓取的结果来扩充此数据。 例如,如果一家正在运营的零售商已经拥有一个全面的客户数据库,则他们可以使用自己抓取的数据来了解哪些用户对正在运行的事件感兴趣。

Custom Audiences: Enriching The Data


Perhaps whoever is carrying out this scraping doesn’t just want to know who is into running events, but wants to specifically target people with advertisements; how do they do this?

也许谁进行此抓取操作不只是想知道谁正在运行事件,还想专门针对带有广告的用户; 他们如何做到这一点?

Facebook allows you to upload customer data into its ad platform in order to target those customers, a process known as custom audience creation. Facebook don’t want to allow advertisers to be able to target users whose profiles they’ve scraped, so you can’t simply give Facebook the list of profile URLs you’ve just found.

Facebook允许您将客户数据上传到其广告平台中,以定位这些客户,这一过程称为自定义受众创建。 Facebook不想让广告商能够定位已抓取其个人资料的用户,因此您不能简单地将刚刚找到的个人资料网址列表提供给Facebook。

To be able to target these users, you need to enrich the data, and be able to find email addresses and phone numbers for the users whose profiles have been scraped. If you can pass Facebook emails and phone numbers, in addition to their names, then Facebook will have enough fields to be able to match your data against Facebook users, effectively letting you target the people whose profiles you’ve scraped.

为了能够定位这些用户,您需要充实数据,并能够为已删除其个人资料的用户找到电子邮件地址和电话号码。 如果您可以传递Facebook电子邮件和电话号码(除了其名称),那么Facebook将具有足够的字段,可以将您的数据与Facebook用户进行匹配,从而有效地将您定位为抓取其个人资料的人员。

So, if all you know is someone’s name and Facebook profile, how do you get their email address or phone number?


People Search Engines


Search engines let us search sites. People search engines (PSEs) let us search people. It’s as simple as that.

搜索引擎使我们可以搜索站点。 人物搜索引擎 (PSE)使我们可以搜索人物。 就这么简单。

While very few of us will have ever used a PSE, there are a variety of them that are available online. PSEs all work in the same way. They hold huge databases of personal details, and allow searchers to look up these users by providing one or more of the fields stored in this database.

虽然我们当中很少有人会使用PSE,但有很多可以在线使用。 PSE都以相同的方式工作。 他们拥有庞大的个人详细信息数据库,并允许搜索者通过提供此数据库中存储的一个或多个字段来查找这些用户。

One of the fields that these databases hold is social media profile URLs. By providing a Facebook profile URL to a PSE, the engine is able to find the user in its database with that same profile URL, and tell you everything else it knows about that user.

这些数据库保存的字段之一是社交媒体配置文件URL。 通过向PSE提供Facebook个人资料URL,引擎可以在数据库中找到具有相同个人资料URL的用户,并告诉您有关该用户的所有其他信息。

Some of the most well-known PSEs available currently are Pipl and CatchID. Both Russian companies, they offer APIs which allow users to upload hundreds of thousands of social media profiles to them. In return, users are offered everything that the PSEs know about the profile that’s been uploaded. This often includes phone numbers and emails.

一些可用的最知名的PSE目前是PIPLCatchID 。 两家俄罗斯公司都提供API,使用户可以向他们上传数十万个社交媒体资料。 作为回报,向用户提供了PSE关于已上传个人资料的所有信息。 这通常包括电话号码和电子邮件。

If someone were to scrape a list of people who belong to a particular Facebook group, or who like a certain page, they could easily upload their profile URLs to a PSE. The PSE would, in most cases, be able to find a phone number and email that person whose profile URL was uploaded. If you have a list of people’s names, emails, and phone numbers, you can then upload this into Facebook in order to target these people with ads.

如果某人要抓取属于特定Facebook群组或喜欢某个页面的人员列表,则可以轻松地将其个人资料URL上传到PSE。 在大多数情况下,PSE可以找到一个电话号码,并通过电子邮件发送上传了个人资料网址的那个人。 如果您有一个人的名字,电子邮件和电话号码的列表,则可以将其上传到Facebook,以便通过广告定位这些人。

Think this all sounds like too much work? Worry not, there are services which can handle the scraping and data enrichment for you. One such service is LeadEnforce, which automates the whole process of scraping group members and page fans, and enriches this data with people search engines like Pipl and CatchID. LeadEnforce plans start at $99 a month.

觉得这听起来太麻烦了吗? 不用担心,有些服务可以为您处理抓取和数据充实。 LeadEnforce就是这样的服务之一 ,它使抓取小组成员和页面粉丝的整个过程自动化,并使用Pipl和CatchID等人员搜索引擎丰富了这些数据。 LeadEnforce计划的起价为每月99美元。

What Does This Mean for User Privacy?


When we’re reminded of how much sensitive information we express through our group memberships, and our page likes, it’s easy to see why the above prevents a huge threat to user privacy online. If an LGBTQ person likes the LGBTQ Nation page, or if a Muslim is a member of the United Muslims group, they’re exposing pieces of sensitive personal information to any bad actor with the technical know-how to build a scraping bot.

当让我们想起我们通过组成员身份表达的敏感信息数量以及我们的页面喜欢的数量时,很容易看出为什么上述内容可以防止对在线用户隐私的巨大威胁。 如果一个LGBTQ人士喜欢LGBTQ Nation页面,或者一个穆斯林是联合穆斯林组织的成员,那么他们将利用一些技术知识将敏感的个人信息暴露给任何不良行为者,以打造一个抓取机器人。

Once a bad actor has access to this information, there are countless ways it can be abused. Scraped data could be used to serve voter-suppression ads to specific minorities, reducing their electoral turnout by suggesting that opposition candidates dislike their minority group. Scraped data could be used to target pharmaceutical ads to people with specific medical conditions, conditions which the bad actor has gleamed from member lists of groups like The Hairloss Crusaders.

一旦不良行为者获得了这些信息,就会有无数种方法可以被滥用。 搜集到的数据可用于向特定的少数族裔投放抑制选民的广告,通过暗示反对派候选人不喜欢他们的少数族裔来减少其投票率。 搜集到的数据可用于将药品广告定位到患有特定疾病的人,这些疾病是不良演员从“发怒十字军”之组织成员名单中闪闪发光的状况

It isn’t just about who you show ads to; it’s also about who you don’t show ads to. A homophobic restaurant owner could scrape data from local LGBT pages and set up their ads so that they don’t show to these users. A loan provider could create audiences of those in debt management groups, and ensure these people don’t see any of their loan ads. The possibilities, sadly, are endless.

这不仅仅与向谁显示广告有关。 也与 显示广告有关。 恐同餐馆老板可以从本地LGBT页面抓取数据并设置广告,以免向这些用户展示。 贷款提供者可以在债务管理组中创建受众,并确保这些人看不到他们的任何贷款广告。 可悲的是,可能性是无限的。

Teaching Facebook What a Minority Looks Like


The threat posed by scraping isn’t limited to the people whose data is being scraped. By uploading data of people who like a certain page or group, a bad actor can teach Facebook what these people look like. A bad actor can do this by creating a lookalike audience from the data they upload.

抓取所构成的威胁不仅限于要抓取数据的人员。 通过上传喜欢某个页面或组的人的数据,一个坏演员可以告诉Facebook这些人的面貌。 坏演员可以通过从他们上传的数据中创建相似的受众来做到这一点。

Facebook populates the lookalike audience with its users who most closely resemble those whose details have been uploaded. In this way, a bad actor could use lookalikes to find audiences of people with protected characteristics.

Facebook用与其最上载详细信息的用户最相似的用户来填充相似的受众。 这样,坏演员可以使用相像来查找具有受保护特征的人的观众。

If a bad actor wanted to target or exclude Jews from seeing their ads on Facebook, they could upload data for people who like Jewish pages or groups, and create a lookalike audience from that data. The lookalike audience would likely contain plenty of people who aren’t Jewish, but crucially it would likely over-index on people who are Jewish. This means that the proportion of people in the lookalike audience who are Jewish would be much higher than, say, the national average.

如果不良演员想定位或排除犹太人在Facebook上看到他们的广告,他们可以上传喜欢犹太人页面或群组的人的数据,并从这些数据中创建相似的受众。 在相似受众可能会含有大量的人谁不是犹太人,但重要的是它会在人谁犹太人可能在指数。 这意味着在相似的听众中,犹太人的比例将大大高于全国平均水平。

Beyond Just Advertising


To make matters worse, the danger doesn’t just stop with advertising. Bad actors could create entire databases of people based on specific characteristics, and use this to inform business decisions. A health insurance provider could scrape pages and groups related to medical conditions en masse, and use this information to deny people coverage, or inflate prices.

更糟糕的是,这种危险不仅会随着广告而消失。 不良行为者可以根据特定特征创建人员的整个数据库,并以此为业务决策提供依据。 健康保险提供者可能会大规模抓取与医疗状况相关的页面和组,并使用此信息来拒绝人们承保或抬高价格。

Arguably you wouldn’t even need to scrape page or group member lists for this. If you want to see all of a user’s page likes or groups then you can, if they haven’t been set to private, simply append /likes or /groups to their Facebook profile URL to get a complete list. In just seconds, you can learn things about a person that even their best friends may not know.

可以说,您甚至不需要为此抓取页面或组成员列表。 如果要查看用户页面的所有喜欢或群组,则可以(如果尚未将其设置为私有),只需将/ likes或/ groups附加到其Facebook个人资料URL即可获得完整列表。 在短短的几秒钟内,您可以了解一个人的事情,即使他们的最好的朋友也可能不知道。

To Wrap Up


Facebook may disallow web scraping in their terms and conditions, but the fact that they make it so easy to carry out implies that they don’t see it as a serious issue. With the amount of data exposed by being able to see someone’s page likes, or their groups, the threat to user privacy is severe. For as long as Facebook don’t take steps to actually prevent web scraping, it will remain a ongoing threat to user privacy.

Facebook可能在其条款和条件中不允许进行网络抓取,但事实是,它们使操作变得如此容易表明,他们并不认为这是一个严重的问题。 由于能够查看某人的页面喜欢或他们的群组而暴露的数据量很大,因此对用户隐私的威胁是严重的。 只要Facebook不采取措施来真正防止Web抓取,它仍然是对用户隐私的持续威胁。

