Python抓取网页

最新推荐文章于 2024-05-17 08:30:00 发布

yuyuyu0012586

最新推荐文章于 2024-05-17 08:30:00 发布

阅读量327

点赞数

[python]view plaincopy 
   
 #!/usr/bin/env python  
 # 1.py  
 # use UTF-8  
 # Python 3.3.0  
   
 # get code of given URL as html text string  
 # Python3 uses urllib.request.urlopen()  
 # instead of Python2's urllib.urlopen() or urllib2.urlopen()  
 # http://blog.csdn.net/zsuguangh/article/details/6226385  
 import urllib.request  
   
 fp = urllib.request.urlopen("http://www.baidu.com")  
 mybytes = fp.read()  
   
 # note that Python3 does not read the html code as string  
 # but as html code bytearray, convert to string with  
 mystr = mybytes.decode("utf8")      # 说明接收的数据是UTF-8格式(这样子可以解析和显示中文)  
   
 fp.close()  
   
 print(mystr)  

---------------------------------------------------------------------------------------------------------------------------------------------------------

2. 分析html的编码方式(其实就是字符串的分析)

---------------------------------------------------------------------------------------------------------------------------------------------------------

[python]view plaincopy 
   
 #!/usr/bin/env python  
 # 2.py  
 # use UTF-8  
 # Python 3.3.0  
   
 # get the code of a given URL as html text string  
 # Python3 uses urllib.request.urlopen()  
 # get the encoding used first  
 # tested with Python 3.1 with the Editra IDE  
   
 import urllib.request  
   
 def extract(text, sub1, sub2):  
     """ 
     extract a substring from text between first 
     occurances of substrings sub1 and sub2 
     """   
     return text.split(sub1, 1)[-1].split(sub2, 1)[0]  
   
 fp = urllib.request.urlopen("http://www.baidu.com")                     # 打开URL  
 mybytes = fp.read()                         # 读取HTML信息  
   
 encoding = extract(str(mybytes).lower(), 'charset=', '"')       # 查找HTML数据中"charset"字符, 找到编码方式  
 print('-'*50)  
 print( "Encoding type = %s" % encoding )  
 print('-'*50)  
   
 if encoding:  
     # note that Python3 does not read the html code as string  
     # but as html code bytearray, convert to string with  
     mystr = mybytes.decode(encoding)  
     print(mystr)  
 else:  
     print("Encoding type not found!")  
   
 fp.close()  

yuyuyu0012586

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Python抓取网页

[python] view plaincopy#!/usr/bin/env python # 1.py # use UTF-8 # Python 3.3.0 # get code of given URL as html text string # Python3 uses urllib.request.urlopen()
复制链接

扫一扫