urllib模块可以完成的工作都可以使用urllib2来完成,当需要以比较灵活的方式访问
url资源的时候,就可以使用urllib2模块来实现
urllib2模块基本方法:
fp = urllib2.urlopen("http://www.baidu.com")
print fp.read()#从文件对象中读取资源
print fp.geturl()
print fp.info().items()
使用Request类来生成request对象,然后通过使用urlopen方法来打开对象,从而实现上面的功能,当可选参数data为"None"的时候,将会使用GET方法来获取URL资源,当data不为空的时候,则使用POST方法将数据传递给URL资源
request = urllib2.Request("http://www.baidu.com",data='data')
fp = urllib2.urlopen(request)
print fp.read()
urllib2中的Handler
Handler包括ProxyHandler、HttpBasicAtuhHandler等众多处理模块
可以通过build_opener方法来构造并安装这些处理程序
#使用ProxyHandler生成一个代理服务器handler
proxy_handler=urllib2.ProxyHandler({'http':'http://www.baidu.com'})
#生成一个简单认证的handler
proxy_auth_handler=urllib2.HTTPBasicAuthHandler()
#设置用户名和密码
proxy_auth_handler.add_password('realm','host','user','passwd')
#使用build_opener方法构建一个自定义资源获取对象
opener=urllib2.build_opener(proxy_handler,proxy_auth_handler)
#调用install_opener进行安装
urllib2.install_opener(opener)
#使用urllib2模块中的urlopen方法来访问指定的URL资源,此时将会通过只懂得代理服务器进行访问
urllib2.urlopen("https://exmail.qq.com/login")
HTML文档的解析
HTMLParser是一个用来解析HTML文档的模块
实例:
import urllib
import urlparse
import HTMLParser
class CheckHTML(HTMLParser.HTMLParser):
available = True
def handle_data(self, data):
if "404 Not Found" in data or "Error 404" in data:
self.available=False
check_urls=["index","test","help","news","faq","download"]
for url in check_urls:
new_url = urlparse.urljoin("http://www.python.org/",url)
fp = urllib.urlopen(new_url)
data = fp.read()
fp.close()
p = CheckHTML()
p.feed(data)
p.close()
if p.available:
print new_url,"==>OK"
else:
print new_url,"==> Not Found"
结果:
http://www.python.org/index ==> Not Found
http://www.python.org/test ==> Not Found
http://www.python.org/help ==>OK
http://www.python.org/news ==>OK
http://www.python.org/faq ==> Not Found
http://www.python.org/download ==>OK
将一个HTML文档中的所有超链接都提取出来
实例:
import sgmllib
import urllib
class LinkDemo(sgmllib.SGMLParser):
def __init__(self):
sgmllib.SGMLParser.__init__(self)
self.links=[]
def start_a(self,attributes):
for link in attributes:
tag,attr=link[:2]
if tag == "href":
self.links.append(attr)
f = urllib.urlopen("http://www.baidu.com")
data = f.read()
f.close()
ld = LinkDemo()
ld.feed(data)
for i,link in enumerate(ld.links):
print i,"== >",link
结果:
0 == > /
1 == > javascript:;
2 == > javascript:;
3 == > javascript:;
4 == > /
5 == > javascript:;
6 == > https://passport.baidu.com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F
7 == > http://news.baidu.com
8 == > https://www.hao123.com
9 == > http://map.baidu.com
10 == > http://v.baidu.com
11 == > http://tieba.baidu.com
12 == > http://xueshu.baidu.com
13 == > https://passport.baidu.com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F
14 == > http://www.baidu.com/gaoji/preferences.html
15 == > http://www.baidu.com/more/
16 == > //www.baidu.com/s?rtt=1&bsst=1&cl=2&tn=news&word=
17 == > http://tieba.baidu.com/f?kw=&fr=wwwt
18 == > http://zhidao.baidu.com/q?ct=17&pn=0&tn=ikaslist&rn=10&word=&fr=wwwt
19 == > http://music.taihe.com/search?fr=ps&ie=utf-8&key=
20 == > http://image.baidu.com/search/index?tn=baiduimage&ps=1&ct=201326592&lm=-1&cl=2&nc=1&ie=utf-8&word=
21 == > http://v.baidu.com/v?ct=301989888&rn=20&pn=0&db=0&s=25&ie=utf-8&word=
22 == > http://map.baidu.com/m?word=&fr=ps01000
23 == > http://wenku.baidu.com/search?word=&lm=0&od=0&ie=utf-8
24 == > //www.baidu.com/more/
25 == > //www.baidu.com/cache/sethelp/help.html
26 == > http://home.baidu.com
27 == > http://ir.baidu.com
28 == > http://e.baidu.com/?refer=888
29 == > http://www.baidu.com/duty/
30 == > http://jianyi.baidu.com/
31 == > http://www.beian.gov.cn/portal/registerSystemInfo?recordcode=11000002000001
使用htmllib处理HTML文档
import htmllib
import urllib
import formatter
import string
class LinkDemo(htmllib.HTMLParser):
def __init__(self,verbose=0):
f = formatter.NullFormatter()
htmllib.HTMLParser.__init__(self,f,verbose)
self.links=[]
def anchor_bgn(self, href, name, type):#锚点标签开始的时候处理
self.save_bgn()
self.link=href
def anchor_end(self):#锚点标签结束的时候处理
text=string.strip(self.save_end())
if self.link and text:
self.links[text] = self.links.get(text,[])+[self.link]
f = urllib.urlopen("http://www.baidu.com")
data = f.read()
f.close()
ld = LinkDemo()
ld.feed(data)
ld.close()
for href,link in ld.links.items():
print href,"== >",link