初识python之 APP store排行榜蜘蛛抓取(一)

最新推荐文章于 2022-05-19 21:18:26 发布

weixin_34355715

最新推荐文章于 2022-05-19 21:18:26 发布

阅读量202

点赞数

原文链接：http://www.cnblogs.com/etodream/p/3918264.html

版权

直接上干货！！

采用python 2.7.5-windows

打开 http://www.apple.com/cn/itunes/charts/free-apps/

如上图可以见采用的是utf-8 编码

经过一番思想斗争编码如下（拍砖别打脸）

#coding=utf-8
import urllib2    
import urllib    
import re    
import thread    
import time

  
    
#----------- APP store 排行榜 -----------    
class Spider_Model:    
        
    def __init__(self):    
        self.page = 1    
        self.pages = []    
        self.enable = False    
       
    def GetCon(self):    
        myUrl = "http://www.apple.com/cn/itunes/charts/free-apps/"    
        user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'   
        headers = { 'User-Agent' : user_agent }   
        req = urllib2.Request(myUrl, headers = headers)   
        myResponse = urllib2.urlopen(req)  
        myPage = myResponse.read()    
        #encode的作用是将unicode编码转换成其他编码的字符串    
        #decode的作用是将其他编码的字符串转换成unicode编码       
        print myPage
 
print ' '
myModel = Spider_Model()
myModel.GetCon()

　　采集页面字符集 python文件字符集统一为utf-8 （贫蛋哥是认为没啥问题的）

　　打印输出结果：

拿出杀手锏 www.baidu.com

找到原因：

　　　　　　　　http://blog.csdn.net/lf8289/article/details/2465196

　　　　　　　　http://www.crifan.com/unicodeencodeerror_gbk_codec_can_not_encode_character_in_position_illegal_multibyte_sequence/

　　各种狂改中.......

#coding=gbk   编码修改为gbk
import urllib2    
import urllib    
import re    
import thread    
import time

  
    
#----------- APP store 排行榜 -----------    
class Spider_Model:    
        
    def __init__(self):    
        self.page = 1    
        self.pages = []    
        self.enable = False    
       
    def GetCon(self):    
        myUrl = "http://www.apple.com/cn/itunes/charts/free-apps/"    
        user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'   
        headers = { 'User-Agent' : user_agent }   
        req = urllib2.Request(myUrl, headers = headers)   
        myResponse = urllib2.urlopen(req)  
        myPage = myResponse.read()    
        #encode的作用是将unicode编码转换成其他编码的字符串    
        #decode的作用是将其他编码的字符串转换成unicode编码    
        unicodePage = myPage.decode('utf-8').encode('gbk','ignore') #采集页面编码为utf-8  转为 gbk (ignore来忽略非法的字符)

　　　　　print unicodePage

　　　　print ' ' 
　　　　myModel = Spider_Model() 
　　　　myModel.GetCon()

　　运行结果：

转载于:https://www.cnblogs.com/etodream/p/3918264.html

weixin_34355715

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
初识python之 APP store排行榜蜘蛛抓取(一)

直接上干货！！采用python 2.7.5-windows打开http://www.apple.com/cn/itunes/charts/free-apps/如上图可以见采用的是utf-8 编码经过一番思想斗争编码如下（拍砖别打脸）#coding=utf-8import urllib2 import urllib imp...
复制链接

扫一扫