selenium的封杀与突破,记录一次出师未捷身先死,淘宝、美团对爬虫的深入打击

做爬虫,出师未捷身先死,体会过吗?!!!

最近在做一个国外的网站爬虫中文名叫蝙蝠,有网友这样介绍的:“贸易中介类的网站,PR值是6,网站比较可靠”;上面记录了很多公司的信息,如电话、地址、业务等等,目标就是采集特定公司的信息。

本文不是讲如何突破淘宝和美团,而是通过一个案列讲解他们所使用的共同技术和思想,因为不久之后即使采用selenium抓取网站也会遇上强大的反扒机制,因为在这场较量中对手已经开始对selenium反击。

那就开始把,首先是翻墙打开VPN访问手动网站,结果返回下面的页面:

第一次手动访问这个网站居然出现这种情况,即使反扒机制再强也没遇到过连网页都打不开的情况!!!除了惊喜还是惊喜

本来页面中还有个Google图形验证码,我已经手动打了,然后让你填写信息,填写之后就得到下面结果:

大致意思是他们要审核一下。这究竟是一个怎样的网站,我第一次访问居然要Google图形验证+访问申请+访问审核,如果我是善良的客户那么对网站的体验是及其差的。

好吧,该网站可能提前知道我是爬虫所以拒绝也有可能,然后接下来我又试了更换代理,包括使用国外chrome集群直接访问、使用亚马逊ec2动态IP访问,结果如下:

感觉被侮辱了,真的是出师未捷身先死,从来没遇到过一个网站如此厉害的。

然后换了多个姿势来手动访问:

打开VPN,手动打开一个浏览器访问,结果:Access To Website Blocked

打开VPN,使用driver,结果:Access To Website Blocked

使用美国的chrome集群访问,结果:Access To Website Blocked

使用亚马逊的ec2动态IP,结果:Access To Website Blocked

在多次尝试是失败后大胆猜测,该网站应该有访问IP白名单,接下来就验证这一想法,使用了美国的一台本地主机(该主机不是云服务器提供的),居然一下就打开了:

就是这个不熟悉的页面,终于访问成功了,于是大胆猜测:这个网站是有ip白名单的,IP出口是国外的、是国内云服务器供应的那那就拒绝访问。

本以为这样的就完了,然后是我想的太简单了,接下来使用美国本地IP加上driver依然得到下面的结果:

顿时心中把把这个网站问候了一遍,顿时感觉这网站的反扒技术可能和淘宝、美团一样了,通过环境监测,识别浏览器是否是driver控制的,下面就来讲如何分析的。

接下来只有分析数据请求了:

一下就看见了这个409错误,然后依次分析在409错误之前的几个请求,功夫不负有心人,找到了一个成功的post请求:

然后看post提交的数据:

p=%7B%22proof%22%3A%22b3%3A1545721180848%3AHcCAd39S7g05mx3PpKsI%22%2C%22fp2%22%3A%7B%22userAgent%22%3A%22Mozilla%2F5.0(WindowsNT10.0%3BWOW64)AppleWebKit%2F537.36(KHTML%2ClikeGecko)Chrome%2F71.0.3578.98Safari%2F537.36%22%2C%22language%22%3A%22zh-CN%22%2C%22screen%22%3A%7B%22width%22%3A1920%2C%22height%22%3A1080%2C%22availHeight%22%3A1040%2C%22availWidth%22%3A1920%2C%22pixelDepth%22%3A24%2C%22innerWidth%22%3A968%2C%22innerHeight%22%3A889%2C%22outerWidth%22%3A1539%2C%22outerHeight%22%3A1020%2C%22devicePixelRatio%22%3A1%7D%2C%22timezone%22%3A8%2C%22indexedDb%22%3Atrue%2C%22addBehavior%22%3Afalse%2C%22openDatabase%22%3Atrue%2C%22cpuClass%22%3A%22unknown%22%2C%22platform%22%3A%22Win32%22%2C%22doNotTrack%22%3A%22unknown%22%2C%22plugins%22%3A%22ChromePDFPlugin%3A%3APortableDocumentFormat%3A%3Aapplication%2Fx-google-ug_shaders%3BWEBGL_depth_texture%3BWEBKIT_WEBGL_depth_texture%3BWEBGL_draw_buffers%3BWEBGL_lose_context%3BWEBKIT_WEBGL_lose_context%22%2C%22aliasedlinewidthrange%22%3A%22%5B1%2C1%5D%22%2C%22aliasedpointsizerange%22%3A%22%5B1%2C1024%5D%22%2C%22alphabits%22%3A8%2C%22antialiasing%22%3A%22yes%22%2C%22bluebits%22%3A8%2C%22depthbits

·········

很明显是经过URL编码后的数据,然后用URL解码:

from urllib import parse
parse.unquote("%7B%22proof%22%3A%22b3%3A1545721180848%3AHcCAd39S7g05mx3PpKsI%22%2C%22fp2%22%3A%7B%22userAgent%22%3A%22Mozilla%2F5.0(WindowsNT10.0%3BWOW64)AppleWebKit%2F537.36(KHTML%2ClikeGecko)Chrome%2F71.0.3578.98Safari%2F537.36%22%2C%22language%22%3A%22zh-CN%22%2C%22screen%22%3A%7B%22width%22%3A1920%2C%22height%22%3A1080%2C%22availHeight%22%3A1040%2C%22availWidth%22%3A1920%2C%22pixelDepth%22%3A24%2C%22innerWidth%22%3A968%2C%22innerHeight%22%3A889%2C%22outerWidth%22%3A1539%2C%22outerHeight%22%3A1020%2C%22devicePixelRatio%22%3A1%7D%2C%22timezone%22%3A8%2C%22indexedDb%22%3Atrue%2C%22addBehavior%22%3Afalse%2C%22openDatabase%22%3Atrue%2C%22cpuClass%22%3A%22unknown%22%2C%22platform%22%3A%22Win32%22%2C%22doNotTrack%22%3A%22unknown%22%2C%22plugins%22%3A%22ChromePDFPlugin%3A%3APortableDocumentFormat%3A%3Aapplication%2Fx-google-chrome-pdfpdf%3BChromePDFViewer%3A%3A%3A%3Aapplication%2Fpdfpdf%3BNativeClient%3A%3A%3A%3Aapplication%2Fx-nacl%2Capplication%2Fx-pnacl%22%2C%22canvas%22%3A%7B%22winding%22%3A%22yes%22%2C%22towebp%22%3Atrue%2C%22blending%22%3Atrue%2C%22img%22%3A%22f41a4128d68ea76d3784bd619744f7b83fe826eb%22%7D%2C%22webGL%22%3A%7B%22img%22%3A%22bd6549c125f67b18985a8c509803f4b883ff810c%22%2C%22extensions%22%3A%22ANGLE_instanced_arrays%3BEXT_blend_minmax%3BEXT_color_buffer_half_float%3BEXT_disjoint_timer_query%3BEXT_frag_depth%3BEXT_shader_texture_lod%3BEXT_texture_filter_anisotropic%3BWEBKIT_EXT_texture_filter_anisotropic%3BEXT_sRGB%3BOES_element_index_uint%3BOES_standard_derivatives%3BOES_texture_float%3BOES_texture_float_linear%3BOES_texture_half_float%3BOES_texture_half_float_linear%3BOES_vertex_array_object%3BWEBGL_color_buffer_float%3BWEBGL_compressed_texture_s3tc%3BWEBKIT_WEBGL_compressed_texture_s3tc%3BWEBGL_compressed_texture_s3tc_srgb%3BWEBGL_debug_renderer_info%3BWEBGL_debug_shaders%3BWEBGL_depth_texture%3BWEBKIT_WEBGL_depth_texture%3BWEBGL_draw_buffers%3BWEBGL_lose_context%3BWEBKIT_WEBGL_lose_context%22%2C%22aliasedlinewidthrange%22%3A%22%5B1%2C1%5D%22%2C%22aliasedpointsizerange%22%3A%22%5B1%2C1024%5D%22%2C%22alphabits%22%3A8%2C%22antialiasing%22%3A%22yes%22%2C%22bluebits%22%3A8%2C%22depthbits%22%3A24%2C%22greenbits%22%3A8%2C%22maxanisotropy%22%3A16%2C%22maxcombinedtextureimageunits%22%3A32%2C%22maxcubemaptexturesize%22%3A16384%2C%22maxfragmentuniformvectors%22%3A1024%2C%22maxrenderbuffersize%22%3A16384%2C%22maxtextureimageunits%22%3A16%2C%22maxtexturesize%22%3A16384%2C%22maxvaryingvectors%22%3A30%2C%22maxvertexattribs%22%3A16%2C%22maxvertextextureimageunits%22%3A16%2C%22maxvertexuniformvectors%22%3A4096%2C%22maxviewportdims%22%3A%22%5B16384%2C16384%5D%22%2C%22redbits%22%3A8%2C%22renderer%22%3A%22WebKitWebGL%22%2C%22shadinglanguageversion%22%3A%22WebGLGLSLES1.0(OpenGLESGLSLES1.0Chromium)%22%2C%22stencilbits%22%3A0%2C%22vendor%22%3A%22WebKit%22%2C%22version%22%3A%22WebGL1.0(OpenGLES2.0Chromium)%22%2C%22vertexshaderhighfloatprecision%22%3A23%2C%22vertexshaderhighfloatprecisionrangeMin%22%3A127%2C%22vertexshaderhighfloatprecisionrangeMax%22%3A127%2C%22vertexshadermediumfloatprecision%22%3A23%2C%22vertexshadermediumfloatprecisionrangeMin%22%3A127%2C%22vertexshadermediumfloatprecisionrangeMax%22%3A127%2C%22vertexshaderlowfloatprecision%22%3A23%2C%22vertexshaderlowfloatprecisionrangeMin%22%3A127%2C%22vertexshaderlowfloatprecisionrangeMax%22%3A127%2C%22fragmentshaderhighfloatprecision%22%3A23%2C%22fragmentshaderhighfloatprecisionrangeMin%22%3A127%2C%22fragmentshaderhighfloatprecisionrangeMax%22%3A127%2C%22fragmentshadermediumfloatprecision%22%3A23%2C%22fragmentshadermediumfloatprecisionrangeMin%22%3A127%2C%22fragmentshadermediumfloatprecisionrangeMax%22%3A127%2C%22fragmentshaderlowfloatprecision%22%3A23%2C%22fragmentshaderlowfloatprecisionrangeMin%22%3A127%2C%22fragmentshaderlowfloatprecisionrangeMax%22%3A127%2C%22verte

  • 2
    点赞
  • 9
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值