python知识捡拾---urllib模块及HTML文档解析

最新推荐文章于 2023-05-25 12:52:51 发布

pfxia

最新推荐文章于 2023-05-25 12:52:51 发布

阅读量416

点赞数

分类专栏： python2 技术类

本文链接：https://blog.csdn.net/xioaf12/article/details/103486601

版权

技术类同时被 2 个专栏收录

23 篇文章 0 订阅

订阅专栏

python2

13 篇文章 0 订阅

订阅专栏

urllib模块可以完成的工作都可以使用urllib2来完成，当需要以比较灵活的方式访问
url资源的时候，就可以使用urllib2模块来实现

urllib2模块基本方法：

fp = urllib2.urlopen("http://www.baidu.com")
print fp.read()#从文件对象中读取资源
print fp.geturl()
print fp.info().items()

使用Request类来生成request对象，然后通过使用urlopen方法来打开对象，从而实现上面的功能，当可选参数data为"None"的时候，将会使用GET方法来获取URL资源，当data不为空的时候，则使用POST方法将数据传递给URL资源

request = urllib2.Request("http://www.baidu.com",data='data')
fp = urllib2.urlopen(request)
print fp.read()

urllib2中的Handler

Handler包括ProxyHandler、HttpBasicAtuhHandler等众多处理模块
可以通过build_opener方法来构造并安装这些处理程序

#使用ProxyHandler生成一个代理服务器handler
proxy_handler=urllib2.ProxyHandler({'http':'http://www.baidu.com'})
#生成一个简单认证的handler
proxy_auth_handler=urllib2.HTTPBasicAuthHandler()
#设置用户名和密码
proxy_auth_handler.add_password('realm','host','user','passwd')
#使用build_opener方法构建一个自定义资源获取对象
opener=urllib2.build_opener(proxy_handler,proxy_auth_handler)
#调用install_opener进行安装
urllib2.install_opener(opener)
#使用urllib2模块中的urlopen方法来访问指定的URL资源，此时将会通过只懂得代理服务器进行访问
urllib2.urlopen("https://exmail.qq.com/login")

HTML文档的解析

HTMLParser是一个用来解析HTML文档的模块
实例：

import urllib
import urlparse
import  HTMLParser

class CheckHTML(HTMLParser.HTMLParser):
    available = True
    def handle_data(self, data):
        if "404 Not Found" in data or "Error 404" in data:
            self.available=False

check_urls=["index","test","help","news","faq","download"]
for url in check_urls:
    new_url = urlparse.urljoin("http://www.python.org/",url)
    fp = urllib.urlopen(new_url)
    data = fp.read()
    fp.close()
    p = CheckHTML()
    p.feed(data)
    p.close()

    if p.available:
        print new_url,"==>OK"
    else:
        print new_url,"==> Not Found"

结果：
http://www.python.org/index ==> Not Found
http://www.python.org/test ==> Not Found
http://www.python.org/help ==>OK
http://www.python.org/news ==>OK
http://www.python.org/faq ==> Not Found
http://www.python.org/download ==>OK

将一个HTML文档中的所有超链接都提取出来

实例：

import sgmllib
import urllib
class LinkDemo(sgmllib.SGMLParser):
    def __init__(self):
        sgmllib.SGMLParser.__init__(self)
        self.links=[]

    def start_a(self,attributes):
        for link in attributes:
            tag,attr=link[:2]
            if tag == "href":
                self.links.append(attr)

f = urllib.urlopen("http://www.baidu.com")
data = f.read()
f.close()

ld = LinkDemo()
ld.feed(data)

for i,link in enumerate(ld.links):
    print i,"== >",link

结果：

0 == > /
1 == > javascript:;
2 == > javascript:;
3 == > javascript:;
4 == > /
5 == > javascript:;
6 == > https://passport.baidu.com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F
7 == > http://news.baidu.com
8 == > https://www.hao123.com
9 == > http://map.baidu.com
10 == > http://v.baidu.com
11 == > http://tieba.baidu.com
12 == > http://xueshu.baidu.com
13 == > https://passport.baidu.com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F
14 == > http://www.baidu.com/gaoji/preferences.html
15 == > http://www.baidu.com/more/
16 == > //www.baidu.com/s?rtt=1&bsst=1&cl=2&tn=news&word=
17 == > http://tieba.baidu.com/f?kw=&fr=wwwt
18 == > http://zhidao.baidu.com/q?ct=17&pn=0&tn=ikaslist&rn=10&word=&fr=wwwt
19 == > http://music.taihe.com/search?fr=ps&ie=utf-8&key=
20 == > http://image.baidu.com/search/index?tn=baiduimage&ps=1&ct=201326592&lm=-1&cl=2&nc=1&ie=utf-8&word=
21 == > http://v.baidu.com/v?ct=301989888&rn=20&pn=0&db=0&s=25&ie=utf-8&word=
22 == > http://map.baidu.com/m?word=&fr=ps01000
23 == > http://wenku.baidu.com/search?word=&lm=0&od=0&ie=utf-8
24 == > //www.baidu.com/more/
25 == > //www.baidu.com/cache/sethelp/help.html
26 == > http://home.baidu.com
27 == > http://ir.baidu.com
28 == > http://e.baidu.com/?refer=888
29 == > http://www.baidu.com/duty/
30 == > http://jianyi.baidu.com/
31 == > http://www.beian.gov.cn/portal/registerSystemInfo?recordcode=11000002000001

使用htmllib处理HTML文档

import htmllib
import urllib
import formatter
import string
class LinkDemo(htmllib.HTMLParser):
    def __init__(self,verbose=0):
        f = formatter.NullFormatter()
        htmllib.HTMLParser.__init__(self,f,verbose)
        self.links=[]

    def anchor_bgn(self, href, name, type):#锚点标签开始的时候处理
       self.save_bgn()
       self.link=href

    def anchor_end(self):#锚点标签结束的时候处理
        text=string.strip(self.save_end())
        if self.link and text:
            self.links[text] = self.links.get(text,[])+[self.link]

f = urllib.urlopen("http://www.baidu.com")
data = f.read()
f.close()

ld = LinkDemo()
ld.feed(data)
ld.close()

for href,link in ld.links.items():
    print href,"== >",link