如何防止网站抓取? [关闭]

本文翻译自:How do I prevent site scraping? [closed]

I have a fairly large music website with a large artist database. 我有一个相当大的音乐网站,有一个大型的艺术家数据库。 I've been noticing other music sites scraping our site's data (I enter dummy Artist names here and there and then do google searches for them). 我一直在注意其他音乐网站抓取我们网站的数据(我在这里和那里输入虚拟艺术家名称然后谷歌搜索它们)。

How can I prevent screen scraping? 如何防止屏幕抓取? Is it even possible? 它甚至可能吗?


#1楼

参考:https://stackoom.com/question/dgsI/如何防止网站抓取-关闭


#2楼

Things that might work against beginner scrapers: 可能对初学者刮刀有用的东西:

  • IP blocking IP阻止
  • use lots of ajax 使用大量的ajax
  • check referer request header 检查referer请求标头
  • require login 要求登录

Things that will help in general: 一般有用的事情:

  • change your layout every week 每周更改您的布局
  • robots.txt 的robots.txt

Things that will help but will make your users hate you: 有用的东西会让你的用户讨厌你:

  • captcha 验证码

#3楼

One way would be to serve the content as XML attributes, URL encoded strings, preformatted text with HTML encoded JSON, or data URIs, then transform it to HTML on the client. 一种方法是将内容作为XML属性,URL编码的字符串,带有HTML编码的JSON的预格式文本或数据URI提供,然后在客户端上将其转换为HTML。 Here are a few sites which do this: 以下是一些执行此操作的网站:

  • Skechers : XML Skechers :XML

     <document filename="" height="" width="" title="SKECHERS" linkType="" linkUrl="" imageMap="" href=&quot;http://www.bobsfromskechers.com&quot; alt=&quot;BOBS from Skechers&quot; title=&quot;BOBS from Skechers&quot; /> 
  • Chrome Web Store : JSON Chrome网上应用店 :JSON

     <script type="text/javascript" src="https://apis.google.com/js/plusone.js">{"lang": "en", "parsetags": "explicit"}</script> 
  • Bing News : data URL Bing新闻 :数据网址

     <script type="text/javascript"> //<![CDATA[ (function() { var x;x=_ge('emb7'); if(x) { x.src='*...*/'; } }() ) 
  • Protopage : URL Encoded Strings Protopage :URL编码字符串

     unescape('Rolling%20Stone%20%3a%20Rock%20and%20Roll%20Daily') 
  • TiddlyWiki : HTML Entities + preformatted JSON TiddlyWiki :HTML实体+预先格式化的JSON

      <pre> {&quot;tiddlers&quot;: { &quot;GettingStarted&quot;: { &quot;title&quot;: &quot;GettingStarted&quot;, &quot;text&quot;: &quot;Welcome to TiddlyWiki, } } } </pre> 
  • Amazon : Lazy Loading 亚马逊 :懒惰加载

     amzn.copilot.jQuery=i;amzn.copilot.jQuery(document).ready(function(){d(b);f(c,function() {amzn.copilot.setup({serviceEndPoint:h.vipUrl,isContinuedSession:true})})})},f=function(i,h){var j=document.createElement("script");j.type="text/javascript";j.src=i;j.async=true;j.onload=h;a.appendChild(j)},d=function(h){var i=document.createElement("link");i.type="text/css";i.rel="stylesheet";i.href=h;a.appendChild(i)}})(); amzn.copilot.checkCoPilotSession({jsUrl : 'http://z-ecx.images-amazon.com/images/G/01/browser-scripts/cs-copilot-customer-js/cs-copilot-customer-js-min-1875890922._V1_.js', cssUrl : 'http://z-ecx.images-amazon.com/images/G/01/browser-scripts/cs-copilot-customer-css/cs-copilot-customer-css-min-2367001420._V1_.css', vipUrl : 'https://copilot.amazon.com' 
  • XMLCalabash : Namespaced XML + Custom MIME type + Custom File extension XMLCalabash :Namespaced XML +自定义MIME类型+自定义文件扩展名

      <p:declare-step type="pxp:zip"> <p:input port="source" sequence="true" primary="true"/> <p:input port="manifest"/> <p:output port="result"/> <p:option name="href" required="true" cx:type="xsd:anyURI"/> <p:option name="compression-method" cx:type="stored|deflated"/> <p:option name="compression-level" cx:type="smallest|fastest|default|huffman|none"/> <p:option name="command" select="'update'" cx:type="update|freshen|create|delete"/> </p:declare-step> 

If you view source on any of the above, you see that scraping will simply return metadata and navigation. 如果你在上面的任何一个上查看源代码,你会发现抓取只会返回元数据和导航。


#4楼

Quick approach to this would be to set a booby/bot trap. 快速解决这个问题的方法是设置一个booby / bot陷阱。

  1. Make a page that if it's opened a certain amount of times or even opened at all, will collect certain information like the IP and whatnot (you can also consider irregularities or patterns but this page shouldn't have to be opened at all). 制作一个页面,如果它打开一定次数甚至打开,将收集某些信息,如IP和诸如此类的东西(你也可以考虑不规则或模式,但这个页面根本不应该打开)。

  2. Make a link to this in your page that is hidden with CSS display:none; 在您的页面中使用CSS display:none隐藏此链接 or left:-9999px; 或左:-9999px; positon:absolute; 当前位置:绝对的; try to place it in places that are less unlikely to be ignored like where your content falls under and not your footer as sometimes bots can choose to forget about certain parts of a page. 尝试将它放在不太可能被忽略的地方,例如你的内容落在哪里而不是你的页脚,因为有时机器人可以选择忘记页面的某些部分。

  3. In your robots.txt file set a whole bunch of disallow rules to pages you don't want friendly bots (LOL, like they have happy faces!) to gather information on and set this page as one of them. 在你的robots.txt文件中,为你不想要友好机器人(LOL,就像他们有幸福的面孔!)的页面设置一大堆禁止规则,以收集信息并将此页面设置为其中之一。

  4. Now, If a friendly bot comes through it should ignore that page. 现在,如果一个友好的机器人通过它应该忽略该页面。 Right but that still isn't good enough. 对,但仍然不够好。 Make a couple more of these pages or somehow re-route a page to accept differnt names. 制作更多这些页面或以某种方式重新路由页面以接受不同的名称。 and then place more disallow rules to these trap pages in your robots.txt file alongside pages you want ignored. 然后将更多不允许的规则放在robots.txt文件中的这些陷阱页面旁边,以及要忽略的页面。

  5. Collect the IP of these bots or anyone that enters into these pages, don't ban them but make a function to display noodled text in your content like random numbers, copyright notices, specific text strings, display scary pictures, basically anything to hinder your good content. 收集这些机器人的IP或进入这些页面的任何人,不要禁止它们,但要在你的内容中显示涂有文本的功能,如随机数字,版权声明,特定文本字符串,显示可怕图片,基本上任何阻碍你的内容好的内容。 You can also set links that point to a page which will take forever to load ie. 您还可以设置指向将永远加载的页面的链接,即。 in php you can use the sleep() function. 在php中你可以使用sleep()函数。 This will fight the crawler back if it has some sort of detection to bypass pages that take way too long to load as some well written bots are set to process X amount of links at a time. 如果它有某种检测功能可以绕过加载时间太长而无法加载的页面,这会对爬行器起作用,因为一些写得很好的机器人设置为一次处理X个链接数量。

  6. If you have made specific text strings/sentences why not go to your favorite search engine and search for them, it might show you where your content is ending up. 如果你已经制作了特定的文字字符串/句子,为什么不去你最喜欢的搜索引擎并搜索它们,它可能会显示你的内容最终结果。

Anyway, if you think tactically and creatively this could be a good starting point. 无论如何,如果你在战术上和创造性地思考这可能是一个很好的起点。 The best thing to do would be to learn how a bot works. 最好的办法是学习机器人的工作原理。

I'd also think about scambling some ID's or the way attributes on the page element are displayed: 我还会考虑对某些ID进行标记,或者显示页面元素的属性方式:

<a class="someclass" href="../xyz/abc" rel="nofollow" title="sometitle"> 

that changes its form every time as some bots might be set to be looking for specific patterns in your pages or targeted elements. 每次都会改变其形式,因为某些机器人可能会设置为在页面或目标元素中查找特定模式。

<a title="sometitle" href="../xyz/abc" rel="nofollow" class="someclass"> 

id="p-12802" > id="p-00392"

#5楼

Sorry, it's really quite hard to do this... 对不起,这真的很难......

I would suggest that you politely ask them to not use your content (if your content is copyrighted). 我建议您礼貌地要求他们不要使用您的内容(如果您的内容受版权保护)。

If it is and they don't take it down, then you can take furthur action and send them a cease and desist letter . 如果是,并且他们没有把它取下来,那么你可以采取进一步的行动并向他们发送停止和终止信

Generally, whatever you do to prevent scraping will probably end up with a more negative effect, eg accessibility, bots/spiders, etc. 一般来说,无论你采取什么措施来防止刮擦都可能会产生更多负面影响,例如可访问性,机器人/蜘蛛等。


#6楼

Your best option is unfortunately fairly manual: Look for traffic patterns that you believe are indicative of scraping and ban their IP addresses. 不幸的是,您最好的选择是相当手动:查找您认为表示抓取并禁止其IP地址的流量模式。

Since you're talking about a public site then making the site search-engine friendly will also make the site scraping-friendly. 既然你正在谈论一个公共网站,那么使网站搜索引擎友好也将使网站刮不过友好。 If a search-engine can crawl and scrape your site then an malicious scraper can as well. 如果搜索引擎可以抓取并抓取您的网站,那么恶意抓取工具也可以。 It's a fine-line to walk. 这是一个很好的行走。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值