python中使用urllib2伪造HTTP报头的2个方法

最新推荐文章于 2024-08-10 10:46:00 发布

weixin_30273931

最新推荐文章于 2024-08-10 10:46:00 发布

阅读量54

点赞数

文章标签： xhtml python

原文链接：http://www.cnblogs.com/skying555/p/4592590.html

版权

在采集网页信息的时候，经常需要伪造报头来实现采集脚本的有效执行

下面，我们将使用urllib2的header部分伪造报头来实现采集信息

方法1、

 
          #!/usr/bin/python 
         
          # -*- coding: utf-8 -*- 
         
          #encoding=utf-8 
         
          #Filename:urllib2-header.py 
         
          import 
          urllib2 
         
          import 
          sys 
         
          #抓取网页内容-发送报头-1 
         
          url 
          = 
          "http://www.jb51.net" 
         
          send_headers  
          = 
          { 
         
          'Host' 
          : 
          'www.jb51.net' 
          , 
         
          'User-Agent' 
          : 
          'Mozilla/5.0 (Windows NT 6.2; rv:16.0) Gecko/20100101 Firefox/16.0' 
          , 
         
          'Accept' 
          : 
          'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8' 
          , 
         
          'Connection' 
          : 
          'keep-alive' 
         
          } 
         
          req  
          = 
          urllib2.Request(url,headers 
          = 
          send_headers) 
         
          r  
          = 
          urllib2.urlopen(req) 
         
          html  
          = 
          r.read()         
          #返回网页内容 
         
          receive_header  
          = 
          r.info()      
          #返回的报头信息 
         
          # sys.getfilesystemencoding()  
         
          html  
          = 
          html.decode( 
          'utf-8' 
          , 
          'replace' 
          ).encode(sys.getfilesystemencoding())  
          #转码:避免输出出现乱码  
         
          print 
          receive_header 
         
          # print '####################################' 
         
          print 
          html

方法2、

 
          #!/usr/bin/python 
         
          # -*- coding: utf-8 -*- 
         
          #encoding=utf-8 
         
          #Filename:urllib2-header.py 
         
          import 
          urllib2 
         
          import 
          sys 
         
          url  
          = 
          'http://www.jb51.net' 
         
          req  
          = 
          urllib2.Request(url) 
         
          req.add_header( 
          'Referer' 
          , 
          'http://www.jb51.net/' 
          ) 
         
          req.add_header( 
          'User-Agent' 
          , 
          'Mozilla/5.0 (Windows NT 6.2; rv:16.0) Gecko/20100101 Firefox/16.0' 
          ) 
         
          r  
          = 
          urllib2.urlopen(req) 
         
          html  
          = 
          r.read() 
         
          receive_header  
          = 
          r.info() 
         
          html  
          = 
          html.decode( 
          'utf-8' 
          ).encode(sys.getfilesystemencoding()) 
         
          print 
          receive_header 
         
          print 
          '#####################################' 
         
          print 
          html

转载于:https://www.cnblogs.com/skying555/p/4592590.html

weixin_30273931

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python中使用urllib2伪造HTTP报头的2个方法

在采集网页信息的时候，经常需要伪造报头来实现采集脚本的有效执行下面，我们将使用urllib2的header部分伪造报头来实现采集信息方法1、?1234567891011121314151617181920212223242526272829...
复制链接

扫一扫